All About Using Python & R for Machine Learning, Data Science, Data Analyst, Deep Learning, Artificial Intelligence

Friday, November 17, 2017

Data Preprocessing (Taking Care Of Missing Values)

The first problem that we have to deal with is the case where we have some missing data in our data set and that happens quite a lot actually in real life. So, we have to get the trick to handle this problem and make it good for our machine learning model to run correctly. If you still have problem in importing dataset, see the previous tutorial. And here the dataset.


 As we can see, there are two missing values in data set. There is one missing data in the age column for Spain and one missing value in the salary column for Germany. So, we need to figure out a better idea to handle this problem. And the most common idea to handle missing data is to take the mean of the columns. 

So, as usual we are going to take a library to do this job for us. The library that we are going to use for this one in called scikit-learn preprocessing. The library to import the imputer class.


sklearn is scikit learn contains amazing libraries to make machinery models and preprocessing library contains a lot of class methods to preprocess any dataset. From this library we import the imputer class which will allow us to take care those missing values.

imputer.fit(X[:, 1:3]) means fit only column contains missing value. Column index 1 and 2. Why we use index 3, because in Python, the upper bound is excluded.

imputer.transform(X[:, 1:3]) means we replace the NaN value with the mean value using transform() method.
Here the result:

  
Congratulations, Now you know how to take care of missing value in Python. You can have fun and try to use another strategy such as median and most_frequent. See you on the next tutorial
Share:

0 comments:

Post a Comment

Pageviews

Visitors

Flag Counter