We have learned all about important data preprocessing steps and the last tutorial we have learned is feature scaling. Today, we will learn about Simple Linear Regression(SLR).
Adapting from Wikipedia, the definition of simple linear regression is a linear regression model with a single explanatory variable. SLR is useful for finding relationship between two continuous variables. One is predictor or independent variable and other is response or dependent variable. It looks for statistical relationship but not deterministic relationship. Relationship between two variables is said to be deterministic if one variable can be accurately expressed by the other. For example, using temperature in degree Celsius it is possible to accurately predict Fahrenheit. Statistical relationship is not accurate in determining relationship between two variables. For example, relationship between height and weight.
The core idea is to obtain a line that best fits the data. The best fit line is the one for which total prediction error (all data points) are as small as possible. Error is the distance between the point to the regression line.
So, let's look at the simple linear regression because it is the easiest one to discuss. The specific example where we have experience and salary. Salary is all vertical axis and we want to understand how people's salary depends on their experiences.
Let's code:
we paste code from the data preprocessing template. The first one is importing library:
#importing library import numpy as np import matplotlib.pyplot as plt import pandas as pd
then we import data set using pandas library:
#importing dataset dataset = pd.read_csv('Salary_Data.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 1].values
We can see there are 30 rows (observations) of dataset, contains YearsExperience and Salary column. that's why X is YearsExperience (independent variable) and y is Salary (dependent variable).
Then, splitting dataset into training and testing data:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3, random_state=0)After we ssplit the dataset, then we do fitting training data using linear regression. Don't worry, there is library to handle it.
from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train)
#predicting the test result y_pred = regressor.predict(X_test)Now, it's time to visualize the training set result:
plt.scatter(X_train, y_train, color='red') plt.plot(X_train, regressor.predict(X_train), color='blue') plt.title('Years of Experience vs Salary (Training Set)') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.show()
So, the real values are red observation points. These contain the real values of the salary and experience of employees. And the predictive values are on the blue simple regression line. By the way, we have some accurate predictions. We can see that the real salary is close to the predicted salary.
We write same code to visualize the testing set result:
Amazing. It is easy to practice simple linear regression, isn't it? :-)
0 comments:
Post a Comment