All About Using Python & R for Machine Learning, Data Science, Data Analyst, Deep Learning, Artificial Intelligence

Saturday, November 18, 2017

Data Preprocessing (Encoding Categorical Data)

After we know how to take care of missing value from the previous tutorial, we start to encode the categorical data. Anyway, the first column (Country) we have here is a text field, and is categorical variable. So, we will have to label encode this and also one hot encode to be sure we will not be working with any hierarchy. 


For this, we will still need the OneHotEncoder library to be imported in our code and the ColumnTransformer library. So let’s import these two first:


Next, we have to create an object of the ColumnTransformer class. But before we can do that, we need to understand the constructor signature of the class. The ColumnTransformer constructor takes quite a few arguments, but we’re only interested in two. The first argument is an array called transformers, which is a list of tuples. The array has the following elements in the same order:
  • name: a name for the column transformer, which will make setting of parameters and searching of the transformer easy.
  • transformer: here we are supposed to provide an estimator. We can also just "passthrough" or "drop" if we want. But since we are encoding the data in this example, we will use the OneHotEncoder here. Remember that the estimator you use here needs to support fit and transform.
  • column(s): the list of columns which you want to be transformed. In this case, we will only transform the first column.
The second parameter we are interested in is remainder. This will tell the transformer what to do with the other columns in the dataset. By default, only the columns which are transformed will be returned by the transformer. All other columns will be dropped. But we have the option to tell the transformer what to do with the other columns. We can either drop them, pass them through unchanged, or specify another estimator if we want to do some more processing.

Now that we understand the signature of the constructor, let’s go ahead and create an object:


As you can see from the snippet above, we will name the transformer simply "myencoder".  We are using the OneHotEncoder() constructor to provide a new instance as the estimator. And then we are specifying that only the first column has to be transformed. We are also making sure that the remainder columns are passed through without any changes.

Once we have constructed this columnTransformer object, we have to fit and transform the dataset to label encode and one hot encode the column. For this, we will use the following simple command:



And for the "y" data (Purchased), we can only use LabelEncoder library to label encode from categorical data to number.


Here, we will use the following simple command:



Perfect, see you on the next tutorial.

Share:

0 comments:

Post a Comment

Pageviews

Visitors

Flag Counter