All About Using Python & R for Machine Learning, Data Science, Data Analyst, Deep Learning, Artificial Intelligence

Sunday, November 19, 2017

Data Preprocessing (Splitting Dataset)

From the previous tutorial, we encoded categorical data from first column ('Country') and created dummy variable. So, it will comfortable and fast computing for machine learning.

Today, we will split dataset into training set and testing set. In statistics and machine learning we usually split our data into two subsets: training data and testing data (and sometimes to three: train, validate and test), and fit our model on the train data, in order to make predictions on the test data. When we do that, one of two thing might happen: we overfit our model or we underfit our model. We don’t want any of these things to happen, because they affect the predictability of our model. We might be using a model that has lower accuracy and/or is ungeneralized (meaning you can’t generalize your predictions on other data). Let’s see what under and overfitting actually mean:

Overfitting
Overfitting means that model we trained has trained “too well” fit too closely to the training dataset. This usually happens when the model is too complex (i.e. too many features/variables compared to the number of observations). This model will be very accurate on the training data but will probably be very not accurate on untrained or new data. It is because this model is not as generalized. Basically, when this happens, the model learns or describes the “noise” in the training data instead of the actual relationships between variables in the data. This noise, obviously, isn’t part in of any new dataset, and cannot be applied to it.

Underfitting
In contrast to overfitting, when a model is underfitted, it means that the model does not fit the training data and therefore misses the trends in the data. It also means the model cannot be generalized to new data. As you probably guessed (or figured out!), this is usually the result of a very simple model (not enough predictors/independent variables). It could also happen when, for example, we fit a linear model (like linear regression) to data that is not linear. It almost goes without saying that this model will have poor predictive ability.


It is worth noting the underfitting is not as prevalent as overfitting. Nevertheless, we want to avoid both of those problems in data analysis. You might say we are trying to find the middle ground between under and overfitting our model. As you will see, train/test split and cross validation help to avoid overfitting more than underfitting. Let’s dive into both of them!

Train/Test Split
As I said before, the data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.


Let’s see how to do that. We’ll do this using the Scikit-Learn library and specifically the train_test_split method.

 
The test_size=0.2 inside the function indicates the percentage of the data that should be held over for testing. It’s around 80% for training data and 20% for testing data. random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Congratulation!!! Now, you learned about splitting dataset into training and testing set.
Share:

0 comments:

Post a Comment

Pageviews

Visitors

Flag Counter