November 2017 ~ Practical Machine Learning

Friday, November 24, 2017

Implementing Logistic Regression Using Python

November 24, 2017Classification, Machine Learning, Python No comments

In the last tutorial, we have implemented Simple Linear Regression to fitting the model. Today, we will learn about one of classification algorithm i.e Logistic Regression using Python. Actually, every machine learning algorithm works best under a given set of conditions. Making sure your algorithm fits the assumptions/ requirements ensures superior performance. You cannot use any algorithm in any condition. For example: We cannot use linear regression on a categorical dependent variable. Because we will not be appreciated for getting extremely low values of adjusted R2 and F statistic. Instead, in such situations, we should try using algorithms such as Logistic Regression, Desicion Trees, Support Vector Machine, Random Forest, etc.

Brief of Logistic Regression

Logistic Regression is one of the most popular ways to fit models for categorical data, especially for binary response data in Data Modeling. It is the most important (and probably most used) member of class of models called generalized linear models. Unlike linear regression, logistic regression can directly predict probabilities (values that are restricted to the 0(0,1) interval). Furthermore, those probabilities are well-caliberated when compared to the probabilities predicted by some other classifiers, such as Naive Bayes. Logistic regression preserves the marginal probabilities of the training data. The coefficients of the model also provide some hint of the relative importance of each input variable.

Logistic Regression is used when the dependent variable variable is categorical. For example: to predict wether the tumor is malignant (1) or not (0), to predict wether an email is spam (1) or not (0), etc.

Logistic Regression is generally used where the dependent variable is Binary or Dichotomous. That means the dependent variable can take only two possible values such as “Yes or No”, “Default or No Default”, “Living or Dead”, etc. Independent factors or variables an be categorical or numerical variables.

Let’s code!

In this tutorial we use data set of ‘Social_Network_Ads.csv’

#importing library
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#importing dataset
dataset = pd.read_csv('Social_Network_Ads.csv')

So, just quick reminder, this dataset contains information of users in a social network, such as user id, gender, age, estimated salary, and purchased.

Those social network has several business clients which can put their ads on social network. And one of their clients is a car company who has just launched their brand new luxury SUV for a ridiculous price. And we are trying to see which of these users of the social network are going to buy this brand new SUV. We are going to build a model that is going to predict if a user is going to buy or not that product based on two variable "age" and "estimated salary". We want to find some correlations between the age and the estimated salary of a user and his decision to purchase yes or no.

X = dataset.iloc[:, [2,3]].values
y = dataset.iloc[:, 4].values

#splitting dataset into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

The X variable contains Age and Estimated Salary column.

The y variable contains purchased column.

So, because the age variable and the salary variable do not have the same scale, this will cause some issue in your machine learning. Then, we need to do feature scaling.

#feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

Before	After

Now, we can start fitting Logistic Regression to our training set.

#fitting logistic regression to the training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

So, our classifier learns the correlations between X_train and y_train. By learning those correlations, it will be able to predict our observations and to test its predictive power on a different set which is going to be the test set.

After fitting them, we are going to predict the test results.

#predicting the test set results
y_pred = classifier.predict(X_test)

X_test	y_pred

So, let's create the confusion matrix to see the correct and incorrect prediction.

#making the confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

The result shows that we have 65 and 24 correct predictions, but we have 8 and 3 incorrect predictions. So, that's good and this is a first step into evaluating a model performance.

Now, we are going to make a graph to see clearly the regions where our logistic regression model predicts "Yes" or "No".

#visualizing the training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

Then, we visualize the test set

#visualizing the test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

So, we can see that the prediction regions predict well on those real values. Congratulations, you implemented your first classification model using Python. See you in the next tutorial.

Practical: Simple Linear Regression Using Python

November 22, 2017Machine Learning, Python, Regression No comments

We have learned all about important data preprocessing steps and the last tutorial we have learned is feature scaling. Today, we will learn about Simple Linear Regression(SLR).

Adapting from Wikipedia, the definition of simple linear regression is a linear regression model with a single explanatory variable. SLR is useful for finding relationship between two continuous variables. One is predictor or independent variable and other is response or dependent variable. It looks for statistical relationship but not deterministic relationship. Relationship between two variables is said to be deterministic if one variable can be accurately expressed by the other. For example, using temperature in degree Celsius it is possible to accurately predict Fahrenheit. Statistical relationship is not accurate in determining relationship between two variables. For example, relationship between height and weight.

The core idea is to obtain a line that best fits the data. The best fit line is the one for which total prediction error (all data points) are as small as possible. Error is the distance between the point to the regression line.

So, let's look at the simple linear regression because it is the easiest one to discuss. The specific example where we have experience and salary. Salary is all vertical axis and we want to understand how people's salary depends on their experiences.

Let's code:

we paste code from the data preprocessing template. The first one is importing library:

#importing library
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

then we import data set using pandas library:

#importing dataset
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values

We can see there are 30 rows (observations) of dataset, contains YearsExperience and Salary column. that's why X is YearsExperience (independent variable) and y is Salary (dependent variable).

Then, splitting dataset into training and testing data:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3, random_state=0)

After we ssplit the dataset, then we do fitting training data using linear regression. Don't worry, there is library to handle it.

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

#predicting the test result
y_pred = regressor.predict(X_test)

Now, it's time to visualize the training set result:

plt.scatter(X_train, y_train, color='red')
plt.plot(X_train, regressor.predict(X_train), color='blue')
plt.title('Years of Experience vs Salary (Training Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

So, the real values are red observation points. These contain the real values of the salary and experience of employees. And the predictive values are on the blue simple regression line. By the way, we have some accurate predictions. We can see that the real salary is close to the predicted salary.

We write same code to visualize the testing set result:

Amazing. It is easy to practice simple linear regression, isn't it? :-)

Data Preprocessing (Feature Scaling)

November 21, 2017Machine Learning, Python No comments

After we learned about splitting dataset into training and testing set on the previous tutorial, Today, we are going to talk about the last preprocessing tutorial section i.e. features scaling which is very important in machine learning. Let’s learn what is features scaling and why we need to do it.

As we can see, we have these two columns age and salary that contains numerical variables. You notice that the variables are not on the same scale because the age are going from 27 to 50 and the salaries going from around 40,000 to 90,000. So, because the age variable and the salary variable do not have the same scale, this will cause some issue in your machine learning. A lot of machine learning models are based on what is called Euclidean Distance. The Euclidean Distance between two data points is the square root of the sum of the square coordinates.

Actually, here it is on the same. We have two variables age and salary. So, we can argue age as the x coordinate and the salary as the y coordinate. And in the machine learning those equations some Euclidean Distance between observation points.

There are several ways of scaling our data. A very common one is the standardization which means that for each observation and each feature withdraw the mean value of all the values of the feature and divide it by standard deviation. Another type is normalization which means that substract the observation feature X by the minimum value of all the feature values and divided by the difference between the maximum and minimum of the future values.

Feature Scaling is putting the variable in the same range in the same scale

Let’s code in Python:

#feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

Before	After

Congratulations!!! we have learned about feature scaling concept and practical in Python.

Data Preprocessing (Splitting Dataset)

November 19, 2017Machine Learning, Python No comments

From the previous tutorial, we encoded categorical data from first column ('Country') and created dummy variable. So, it will comfortable and fast computing for machine learning.

Today, we will split dataset into training set and testing set. In statistics and machine learning we usually split our data into two subsets: training data and testing data (and sometimes to three: train, validate and test), and fit our model on the train data, in order to make predictions on the test data. When we do that, one of two thing might happen: we overfit our model or we underfit our model. We don’t want any of these things to happen, because they affect the predictability of our model. We might be using a model that has lower accuracy and/or is ungeneralized (meaning you can’t generalize your predictions on other data). Let’s see what under and overfitting actually mean:

Overfitting

Overfitting means that model we trained has trained “too well” fit too closely to the training dataset. This usually happens when the model is too complex (i.e. too many features/variables compared to the number of observations). This model will be very accurate on the training data but will probably be very not accurate on untrained or new data. It is because this model is not as generalized. Basically, when this happens, the model learns or describes the “noise” in the training data instead of the actual relationships between variables in the data. This noise, obviously, isn’t part in of any new dataset, and cannot be applied to it.

Underfitting

In contrast to overfitting, when a model is underfitted, it means that the model does not fit the training data and therefore misses the trends in the data. It also means the model cannot be generalized to new data. As you probably guessed (or figured out!), this is usually the result of a very simple model (not enough predictors/independent variables). It could also happen when, for example, we fit a linear model (like linear regression) to data that is not linear. It almost goes without saying that this model will have poor predictive ability.

It is worth noting the underfitting is not as prevalent as overfitting. Nevertheless, we want to avoid both of those problems in data analysis. You might say we are trying to find the middle ground between under and overfitting our model. As you will see, train/test split and cross validation help to avoid overfitting more than underfitting. Let’s dive into both of them!

Train/Test Split

As I said before, the data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.

Let’s see how to do that. We’ll do this using the Scikit-Learn library and specifically the train_test_split method.

The test_size=0.2 inside the function indicates the percentage of the data that should be held over for testing. It’s around 80% for training data and 20% for testing data. random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Congratulation!!! Now, you learned about splitting dataset into training and testing set.

Data Preprocessing (Encoding Categorical Data)

November 18, 2017Machine Learning, Python No comments

After we know how to take care of missing value from the previous tutorial, we start to encode the categorical data. Anyway, the first column (Country) we have here is a text field, and is categorical variable. So, we will have to label encode this and also one hot encode to be sure we will not be working with any hierarchy.

For this, we will still need the OneHotEncoder library to be imported in our code and the ColumnTransformer library. So let’s import these two first:

Next, we have to create an object of the ColumnTransformer class. But before we can do that, we need to understand the constructor signature of the class. The ColumnTransformer constructor takes quite a few arguments, but we’re only interested in two. The first argument is an array called transformers, which is a list of tuples. The array has the following elements in the same order:

name: a name for the column transformer, which will make setting of parameters and searching of the transformer easy.
transformer: here we are supposed to provide an estimator. We can also just "passthrough" or "drop" if we want. But since we are encoding the data in this example, we will use the OneHotEncoder here. Remember that the estimator you use here needs to support fit and transform.
column(s): the list of columns which you want to be transformed. In this case, we will only transform the first column.

The second parameter we are interested in is remainder. This will tell the transformer what to do with the other columns in the dataset. By default, only the columns which are transformed will be returned by the transformer. All other columns will be dropped. But we have the option to tell the transformer what to do with the other columns. We can either drop them, pass them through unchanged, or specify another estimator if we want to do some more processing.

Now that we understand the signature of the constructor, let’s go ahead and create an object:

As you can see from the snippet above, we will name the transformer simply "myencoder". We are using the OneHotEncoder() constructor to provide a new instance as the estimator. And then we are specifying that only the first column has to be transformed. We are also making sure that the remainder columns are passed through without any changes.

Once we have constructed this columnTransformer object, we have to fit and transform the dataset to label encode and one hot encode the column. For this, we will use the following simple command:

And for the "y" data (Purchased), we can only use LabelEncoder library to label encode from categorical data to number.

Here, we will use the following simple command:

Perfect, see you on the next tutorial.

Data Preprocessing (Taking Care Of Missing Values)

November 17, 2017Machine Learning, Python No comments

The first problem that we have to deal with is the case where we have some missing data in our data set and that happens quite a lot actually in real life. So, we have to get the trick to handle this problem and make it good for our machine learning model to run correctly. If you still have problem in importing dataset, see the previous tutorial. And here the dataset.

As we can see, there are two missing values in data set. There is one missing data in the age column for Spain and one missing value in the salary column for Germany. So, we need to figure out a better idea to handle this problem. And the most common idea to handle missing data is to take the mean of the columns.

So, as usual we are going to take a library to do this job for us. The library that we are going to use for this one in called scikit-learn preprocessing. The library to import the imputer class.

sklearn is scikit learn contains amazing libraries to make machinery models and preprocessing library contains a lot of class methods to preprocess any dataset. From this library we import the imputer class which will allow us to take care those missing values.

imputer.fit(X[:, 1:3]) means fit only column contains missing value. Column index 1 and 2. Why we use index 3, because in Python, the upper bound is excluded.

imputer.transform(X[:, 1:3]) means we replace the NaN value with the mean value using transform() method.

Here the result:

Congratulations, Now you know how to take care of missing value in Python. You can have fun and try to use another strategy such as median and most_frequent. See you on the next tutorial

Data Preprocessing (Importing Dataset)

November 15, 2017Machine Learning, Python No comments

So, as I explained in the previous tutorial, the best library to import the data set is pandas. We are going to declare a new variable that is going to be the data set itself and simply called "dataset".

We have to use the shortcut "pd" that is shortcut for pandas and method read_csv() as code above. The dataset file is Data.csv.

So, we have four columns: Country, Age, Salary, and Purchased. also we have ten observations (rows). You have to understand that index in Python is start at zero.

There is something very important to understand machine learning in Python, we have a dataset but we need to distinguish the matrix of features and the dependent variable vector. We are going to create the matrix of three independent variables and simply called "X". Also we create the dependent variable vector which is going to be the last column with the ten observations.

Below, how to write the code

for variable "X" (independent variable), we take all the lines of data and -1 means left the last column. So, only the first three column. for variable "y" (dependent variable), 3 means only get column index three.

Ok, we have imported the data set and prepared the data correctly. See you in the next tutorial

Data Preprocessing (Importing The Libraries)

November 14, 2017Machine Learning, Python No comments

So, here we will create data preprocessing file using Spyder IDE. If you have problem on installing Anaconda, see the previous tutorial. To open Spyder, you can click the Windows Start button and select the Anaconda3 (64-bit), then Spyder. Or type "Spyder" on windows search menu. Click on it. The Spyder will look something like this:

Save as data_preprocessing_template.py. So, we created a new file in Python that we called data preprocessing template because we are starting to build our template. First step is importing the libraries.

What is a library?

A library is a tool that you can use to make specific job. You just have to give inputs and library will do the job then it will return some outputs. And we are going to mostly use libraries during this tutorials to make our machine learning models most efficient as possible.

Actually, we are going to use many libraries during this tutorials, but there are three essential libraries that we are going to use every time. The first step of data preprocessing is importing these three essential libraries.

numpy
matplotlib.pyplot
pandas

How to write code:

numpy is a library that contains mathematical tools. Basically, this is the library that we need to include any types of mathematics in our code. Since, machine learning models are based on mathematics, we will absolutely need numpy.

matplotlib.pyplot is a library that is going to help us plot nice charts. It contains very intuitive and useful tools.

pandas is a library to import datasets and manage datasets. It contains very intuitive as well. We are going to spend most of the time to import our datasets and we will manage them.

To run the code, block all the code then press Ctrl + Enter

These are three essential libraries that we need. So, I look forward to seeing you in the next tutorial.

Installing Python & Anaconda

November 13, 2017Machine Learning, Python No comments

Before we discuss about machine learning, I recommend installing and using the Anaconda distribution of Python. This section details the installation of the Anaconda distribution of Python on Windows 10. Anaconda is free (although the download is large which can take time) and can be installed on school or work computers where you don't have administrator access or the ability to install new programs. Anaconda comes bundled with about 600 packages pre-installed including NumPy, Matplotlib and Pandas. These three packages are very useful for machine learning and will be discussed in subsequent chapters.

Follow the steps below to install the Anaconda distribution of Python on Windows.

Steps:

Visit anaconda.com/downloads
Select Windows
Download the .exe installer
Open and run the .exe installer
Open the Anaconda Prompt and run some python code

1. Visit the anaconda download page

Go to the following link anaconda.com/downloads

The anaconda download page will look something like this:

2. Select Windows

Select Windows where the three operating systems are listed.

3. Download

Download the most recent Python 3 release. At the time of writing, the most recent release was the Python 3.6 Version. Python 2.7 is legacy Python. For problem solvers, select the Python 3.6 version. If you are unsure if your computer is running a 64-bit or 32-bit version of Windows, select 64-bit as 64-bit Windows is most common.

You may be prompted to enter your email. You can still download Anaconda if you click [No Thanks] and don't enter your Work Email address.

The download is quite large (over 500 MB) so it may take a while to for Anaconda to download.

4. Open and run the installer

Once the download completes, open and run the .exe installer.

At the beginning of the install, you need to click Next to confirm the installation.

Then agree to the license.

At the Advanced Installation Options screen, I recommend that you do not check "Add Anaconda to my PATH environment variable"

5. Open the Anaconda Prompt from the Windows start menu

After the installation of Anaconda is complete, you can go to the Windows start menu and select the Anaconda Prompt.

This opens the Anaconda Prompt. Anaconda is the Python distribution and the Anaconda Prompt is a command line shell (a program where you type in commands instead of using a mouse). The black screen and text that makes up the Anaconda Prompt doesn't look like much, but it is really helpful for problem solvers using Python.

At the Anaconda prompt, type 'python' and hit [Enter]. The python command starts the Python interpreter, also called the Python REPL (for Read Evaluate Print Loop).

Note the Python version. You should see something like Python 3.6.1. With the interpreter running, you will see a set of greater-than symbols >>> before the cursor.

To close the Python interpreter, type exit() at the prompt >>>. Note the double parenthesis at the end of the exit() command. The () is needed to stop the Python interpreter and get back out to the Anaconda Prompt.

To close the Anaconda Prompt, you can either close the window with the mouse, or type exit, no parenthesis necessary.

When you want to use the Python interpreter again, just click the Windows Start button and select the Anaconda3 (64-bit), then Anaconda Prompt, Jupyter or Spyder.

Facebook Data Collection with R

November 11, 2017Facebook Analysis, R No comments

Continuing from my previous, this post will introduce to how data is collected, cleaned and analyzed

1. Collecting Data
we want to extract list of posts from a United Airline funpage with n=50

Facebook has more than a like button. Last year, it launched emoji (emoticons). If a post got 1000 likes, it does not mean everyone really loves the comment. The reaction can be happy, sad or angry. On the above code, I used "plotly" package to visualize the reaction on interactive graph. The result as belows:

2. Cleaning Data
After getting comment as data, next step is creating corpus, removing extra spaces, stopwords, special characters and other unwanted things using "tm" package

3. Analyzing Data
a. Creating Term Document Matrix

A document-term matrix describes the frequency of terms that occur in a collection of documents. Rows correspond to documents in the collection and columns correspond to terms.

Here is the 2,219 extracted words with frequency.

b. Creating Wordcloud
wordcloud is a visual representation of text data, typically used to depict keyword metadata (tags) on websites, or to visualize free form text. The importance of each tag is shown with font size or color. To create wordcloud in R, we use "wordcloud" package.

Next post is Sentiment Analysis

Practical Machine Learning

Friday, November 24, 2017

Implementing Logistic Regression Using Python

Brief of Logistic Regression

Wednesday, November 22, 2017

Practical: Simple Linear Regression Using Python

Tuesday, November 21, 2017

Data Preprocessing (Feature Scaling)

Sunday, November 19, 2017

Data Preprocessing (Splitting Dataset)

Saturday, November 18, 2017

Data Preprocessing (Encoding Categorical Data)

Friday, November 17, 2017

Data Preprocessing (Taking Care Of Missing Values)

Wednesday, November 15, 2017

Data Preprocessing (Importing Dataset)

Tuesday, November 14, 2017

Data Preprocessing (Importing The Libraries)

Monday, November 13, 2017

Installing Python & Anaconda

Saturday, November 11, 2017

Facebook Data Collection with R

Labels

Blog Archive

Pageviews

Visitors

About The Author

Words of Wisdom

Followers