After we learned about splitting dataset into training and testing set on the previous tutorial, Today, we are going to talk about the last preprocessing tutorial section i.e. features scaling which is very important in machine learning. Let’s learn what is features scaling and why we need to do it.
As we can see, we have these two columns age and salary that contains numerical variables. You notice that the variables are not on the same scale because the age are going from 27 to 50 and the salaries going from around 40,000 to 90,000. So, because the age variable and the salary variable do not have the same scale, this will cause some issue in your machine learning. A lot of machine learning models are based on what is called Euclidean Distance. The Euclidean Distance between two data points is the square root of the sum of the square coordinates.
Actually, here it is on the same. We have two variables age and salary. So, we can argue age as the x coordinate and the salary as the y coordinate. And in the machine learning those equations some Euclidean Distance between observation points.
There are several ways of scaling our data. A very common one is the standardization which means that for each observation and each feature withdraw the mean value of all the values of the feature and divide it by standard deviation. Another type is normalization which means that substract the observation feature X by the minimum value of all the feature values and divided by the difference between the maximum and minimum of the future values.
Feature Scaling is putting the variable in the same range in the same scale
Let’s code in Python:
#feature scaling from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test)
Congratulations!!! we have learned about feature scaling concept and practical in Python.
0 comments:
Post a Comment