Practical Machine Learning

Friday, November 24, 2017

Implementing Logistic Regression Using Python

November 24, 2017Classification, Machine Learning, Python No comments

In the last tutorial, we have implemented Simple Linear Regression to fitting the model. Today, we will learn about one of classification algorithm i.e Logistic Regression using Python. Actually, every machine learning algorithm works best under a given set of conditions. Making sure your algorithm fits the assumptions/ requirements ensures superior performance. You cannot use any algorithm in any condition. For example: We cannot use linear regression on a categorical dependent variable. Because we will not be appreciated for getting extremely low values of adjusted R2 and F statistic. Instead, in such situations, we should try using algorithms such as Logistic Regression, Desicion Trees, Support Vector Machine, Random Forest, etc.

Brief of Logistic Regression

Logistic Regression is one of the most popular ways to fit models for categorical data, especially for binary response data in Data Modeling. It is the most important (and probably most used) member of class of models called generalized linear models. Unlike linear regression, logistic regression can directly predict probabilities (values that are restricted to the 0(0,1) interval). Furthermore, those probabilities are well-caliberated when compared to the probabilities predicted by some other classifiers, such as Naive Bayes. Logistic regression preserves the marginal probabilities of the training data. The coefficients of the model also provide some hint of the relative importance of each input variable.

Logistic Regression is used when the dependent variable variable is categorical. For example: to predict wether the tumor is malignant (1) or not (0), to predict wether an email is spam (1) or not (0), etc.

Logistic Regression is generally used where the dependent variable is Binary or Dichotomous. That means the dependent variable can take only two possible values such as “Yes or No”, “Default or No Default”, “Living or Dead”, etc. Independent factors or variables an be categorical or numerical variables.

Let’s code!

In this tutorial we use data set of ‘Social_Network_Ads.csv’

#importing library
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#importing dataset
dataset = pd.read_csv('Social_Network_Ads.csv')

So, just quick reminder, this dataset contains information of users in a social network, such as user id, gender, age, estimated salary, and purchased.

Those social network has several business clients which can put their ads on social network. And one of their clients is a car company who has just launched their brand new luxury SUV for a ridiculous price. And we are trying to see which of these users of the social network are going to buy this brand new SUV. We are going to build a model that is going to predict if a user is going to buy or not that product based on two variable "age" and "estimated salary". We want to find some correlations between the age and the estimated salary of a user and his decision to purchase yes or no.

X = dataset.iloc[:, [2,3]].values
y = dataset.iloc[:, 4].values

#splitting dataset into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

The X variable contains Age and Estimated Salary column.

The y variable contains purchased column.

So, because the age variable and the salary variable do not have the same scale, this will cause some issue in your machine learning. Then, we need to do feature scaling.

#feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

Before	After

Now, we can start fitting Logistic Regression to our training set.

#fitting logistic regression to the training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

So, our classifier learns the correlations between X_train and y_train. By learning those correlations, it will be able to predict our observations and to test its predictive power on a different set which is going to be the test set.

After fitting them, we are going to predict the test results.

#predicting the test set results
y_pred = classifier.predict(X_test)

X_test	y_pred

So, let's create the confusion matrix to see the correct and incorrect prediction.

#making the confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

The result shows that we have 65 and 24 correct predictions, but we have 8 and 3 incorrect predictions. So, that's good and this is a first step into evaluating a model performance.

Now, we are going to make a graph to see clearly the regions where our logistic regression model predicts "Yes" or "No".

#visualizing the training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

Then, we visualize the test set

#visualizing the test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

So, we can see that the prediction regions predict well on those real values. Congratulations, you implemented your first classification model using Python. See you in the next tutorial.

Practical: Simple Linear Regression Using Python

November 22, 2017Machine Learning, Python, Regression No comments

We have learned all about important data preprocessing steps and the last tutorial we have learned is feature scaling. Today, we will learn about Simple Linear Regression(SLR).

Adapting from Wikipedia, the definition of simple linear regression is a linear regression model with a single explanatory variable. SLR is useful for finding relationship between two continuous variables. One is predictor or independent variable and other is response or dependent variable. It looks for statistical relationship but not deterministic relationship. Relationship between two variables is said to be deterministic if one variable can be accurately expressed by the other. For example, using temperature in degree Celsius it is possible to accurately predict Fahrenheit. Statistical relationship is not accurate in determining relationship between two variables. For example, relationship between height and weight.

The core idea is to obtain a line that best fits the data. The best fit line is the one for which total prediction error (all data points) are as small as possible. Error is the distance between the point to the regression line.

So, let's look at the simple linear regression because it is the easiest one to discuss. The specific example where we have experience and salary. Salary is all vertical axis and we want to understand how people's salary depends on their experiences.

Let's code:

we paste code from the data preprocessing template. The first one is importing library:

#importing library
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

then we import data set using pandas library:

#importing dataset
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values

We can see there are 30 rows (observations) of dataset, contains YearsExperience and Salary column. that's why X is YearsExperience (independent variable) and y is Salary (dependent variable).

Then, splitting dataset into training and testing data:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3, random_state=0)

After we ssplit the dataset, then we do fitting training data using linear regression. Don't worry, there is library to handle it.

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

#predicting the test result
y_pred = regressor.predict(X_test)

Now, it's time to visualize the training set result:

plt.scatter(X_train, y_train, color='red')
plt.plot(X_train, regressor.predict(X_train), color='blue')
plt.title('Years of Experience vs Salary (Training Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

So, the real values are red observation points. These contain the real values of the salary and experience of employees. And the predictive values are on the blue simple regression line. By the way, we have some accurate predictions. We can see that the real salary is close to the predicted salary.

We write same code to visualize the testing set result:

Amazing. It is easy to practice simple linear regression, isn't it? :-)

Data Preprocessing (Feature Scaling)

November 21, 2017Machine Learning, Python No comments

After we learned about splitting dataset into training and testing set on the previous tutorial, Today, we are going to talk about the last preprocessing tutorial section i.e. features scaling which is very important in machine learning. Let’s learn what is features scaling and why we need to do it.

As we can see, we have these two columns age and salary that contains numerical variables. You notice that the variables are not on the same scale because the age are going from 27 to 50 and the salaries going from around 40,000 to 90,000. So, because the age variable and the salary variable do not have the same scale, this will cause some issue in your machine learning. A lot of machine learning models are based on what is called Euclidean Distance. The Euclidean Distance between two data points is the square root of the sum of the square coordinates.

Actually, here it is on the same. We have two variables age and salary. So, we can argue age as the x coordinate and the salary as the y coordinate. And in the machine learning those equations some Euclidean Distance between observation points.

There are several ways of scaling our data. A very common one is the standardization which means that for each observation and each feature withdraw the mean value of all the values of the feature and divide it by standard deviation. Another type is normalization which means that substract the observation feature X by the minimum value of all the feature values and divided by the difference between the maximum and minimum of the future values.

Feature Scaling is putting the variable in the same range in the same scale

Let’s code in Python:

#feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

Before	After

Congratulations!!! we have learned about feature scaling concept and practical in Python.

Data Preprocessing (Splitting Dataset)

November 19, 2017Machine Learning, Python No comments

From the previous tutorial, we encoded categorical data from first column ('Country') and created dummy variable. So, it will comfortable and fast computing for machine learning.

Today, we will split dataset into training set and testing set. In statistics and machine learning we usually split our data into two subsets: training data and testing data (and sometimes to three: train, validate and test), and fit our model on the train data, in order to make predictions on the test data. When we do that, one of two thing might happen: we overfit our model or we underfit our model. We don’t want any of these things to happen, because they affect the predictability of our model. We might be using a model that has lower accuracy and/or is ungeneralized (meaning you can’t generalize your predictions on other data). Let’s see what under and overfitting actually mean:

Overfitting

Overfitting means that model we trained has trained “too well” fit too closely to the training dataset. This usually happens when the model is too complex (i.e. too many features/variables compared to the number of observations). This model will be very accurate on the training data but will probably be very not accurate on untrained or new data. It is because this model is not as generalized. Basically, when this happens, the model learns or describes the “noise” in the training data instead of the actual relationships between variables in the data. This noise, obviously, isn’t part in of any new dataset, and cannot be applied to it.

Underfitting

In contrast to overfitting, when a model is underfitted, it means that the model does not fit the training data and therefore misses the trends in the data. It also means the model cannot be generalized to new data. As you probably guessed (or figured out!), this is usually the result of a very simple model (not enough predictors/independent variables). It could also happen when, for example, we fit a linear model (like linear regression) to data that is not linear. It almost goes without saying that this model will have poor predictive ability.

It is worth noting the underfitting is not as prevalent as overfitting. Nevertheless, we want to avoid both of those problems in data analysis. You might say we are trying to find the middle ground between under and overfitting our model. As you will see, train/test split and cross validation help to avoid overfitting more than underfitting. Let’s dive into both of them!

Train/Test Split

As I said before, the data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.

Let’s see how to do that. We’ll do this using the Scikit-Learn library and specifically the train_test_split method.

The test_size=0.2 inside the function indicates the percentage of the data that should be held over for testing. It’s around 80% for training data and 20% for testing data. random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Congratulation!!! Now, you learned about splitting dataset into training and testing set.

Data Preprocessing (Encoding Categorical Data)

November 18, 2017Machine Learning, Python No comments

After we know how to take care of missing value from the previous tutorial, we start to encode the categorical data. Anyway, the first column (Country) we have here is a text field, and is categorical variable. So, we will have to label encode this and also one hot encode to be sure we will not be working with any hierarchy.

For this, we will still need the OneHotEncoder library to be imported in our code and the ColumnTransformer library. So let’s import these two first:

Next, we have to create an object of the ColumnTransformer class. But before we can do that, we need to understand the constructor signature of the class. The ColumnTransformer constructor takes quite a few arguments, but we’re only interested in two. The first argument is an array called transformers, which is a list of tuples. The array has the following elements in the same order:

name: a name for the column transformer, which will make setting of parameters and searching of the transformer easy.
transformer: here we are supposed to provide an estimator. We can also just "passthrough" or "drop" if we want. But since we are encoding the data in this example, we will use the OneHotEncoder here. Remember that the estimator you use here needs to support fit and transform.
column(s): the list of columns which you want to be transformed. In this case, we will only transform the first column.

The second parameter we are interested in is remainder. This will tell the transformer what to do with the other columns in the dataset. By default, only the columns which are transformed will be returned by the transformer. All other columns will be dropped. But we have the option to tell the transformer what to do with the other columns. We can either drop them, pass them through unchanged, or specify another estimator if we want to do some more processing.

Now that we understand the signature of the constructor, let’s go ahead and create an object:

As you can see from the snippet above, we will name the transformer simply "myencoder". We are using the OneHotEncoder() constructor to provide a new instance as the estimator. And then we are specifying that only the first column has to be transformed. We are also making sure that the remainder columns are passed through without any changes.

Once we have constructed this columnTransformer object, we have to fit and transform the dataset to label encode and one hot encode the column. For this, we will use the following simple command:

And for the "y" data (Purchased), we can only use LabelEncoder library to label encode from categorical data to number.

Here, we will use the following simple command:

Perfect, see you on the next tutorial.

Data Preprocessing (Taking Care Of Missing Values)

November 17, 2017Machine Learning, Python No comments

The first problem that we have to deal with is the case where we have some missing data in our data set and that happens quite a lot actually in real life. So, we have to get the trick to handle this problem and make it good for our machine learning model to run correctly. If you still have problem in importing dataset, see the previous tutorial. And here the dataset.

As we can see, there are two missing values in data set. There is one missing data in the age column for Spain and one missing value in the salary column for Germany. So, we need to figure out a better idea to handle this problem. And the most common idea to handle missing data is to take the mean of the columns.

So, as usual we are going to take a library to do this job for us. The library that we are going to use for this one in called scikit-learn preprocessing. The library to import the imputer class.

sklearn is scikit learn contains amazing libraries to make machinery models and preprocessing library contains a lot of class methods to preprocess any dataset. From this library we import the imputer class which will allow us to take care those missing values.

imputer.fit(X[:, 1:3]) means fit only column contains missing value. Column index 1 and 2. Why we use index 3, because in Python, the upper bound is excluded.

imputer.transform(X[:, 1:3]) means we replace the NaN value with the mean value using transform() method.

Here the result:

Congratulations, Now you know how to take care of missing value in Python. You can have fun and try to use another strategy such as median and most_frequent. See you on the next tutorial

Data Preprocessing (Importing Dataset)

November 15, 2017Machine Learning, Python No comments

So, as I explained in the previous tutorial, the best library to import the data set is pandas. We are going to declare a new variable that is going to be the data set itself and simply called "dataset".

We have to use the shortcut "pd" that is shortcut for pandas and method read_csv() as code above. The dataset file is Data.csv.

So, we have four columns: Country, Age, Salary, and Purchased. also we have ten observations (rows). You have to understand that index in Python is start at zero.

There is something very important to understand machine learning in Python, we have a dataset but we need to distinguish the matrix of features and the dependent variable vector. We are going to create the matrix of three independent variables and simply called "X". Also we create the dependent variable vector which is going to be the last column with the ten observations.

Below, how to write the code

for variable "X" (independent variable), we take all the lines of data and -1 means left the last column. So, only the first three column. for variable "y" (dependent variable), 3 means only get column index three.

Ok, we have imported the data set and prepared the data correctly. See you in the next tutorial

Practical Machine Learning

Friday, November 24, 2017

Implementing Logistic Regression Using Python

Brief of Logistic Regression

Wednesday, November 22, 2017

Practical: Simple Linear Regression Using Python

Tuesday, November 21, 2017

Data Preprocessing (Feature Scaling)

Sunday, November 19, 2017

Data Preprocessing (Splitting Dataset)

Saturday, November 18, 2017

Data Preprocessing (Encoding Categorical Data)

Friday, November 17, 2017

Data Preprocessing (Taking Care Of Missing Values)

Wednesday, November 15, 2017

Data Preprocessing (Importing Dataset)

Labels

Blog Archive

Pageviews

Visitors

About The Author

Words of Wisdom

Followers