Data Science Project- PART 3
III. Machine learning using scikit-learn.
It is more important to understand how an algorithm works rather than just using it. For this reason, this part of the project shows how to use linear regression to make a predictive model.
Introduction to Linear Regression.
A Linear model is one of the most basic types of regression yet powerful. A regression model is created using one or more dependent and independent features. Features mean the columns in our dataset. The aim is to get the best line of fit from a scatter plot of the data points. This will in turn be used to make future predictions given the independent variables. Therefore in simpler terms, a model is the best line of fit.
The best way to understand the whole concept is to start from scratch. with what a simple linear regression is. The figure below shows a simple linear regression model formula.
Yi is the value of the dependent value for ith trial(row)
β0 is the y-intercept parameter (constant term)
β1 is the slope parameter corresponding to Xi.
Xi is known as explanatory/independent/predictor value in the ith trial(row)
ϵ= the error term (also known as the residuals) based on Xi.
The figure below shows a graphical interpretation of the simple linear regression. The best-fitting line can be generated using the Ordinary Least Squares(OLS) formula as shown in the figure below. In data, OLS formula comes from taking each predicted value of y and subtracting the actual value of y to get the residuals. Then the squared of the residuals are summed. This process is done many times to attain many lines. Then the line with the lowest sum of squared residuals is the best fitting line.
The figure below shows a multiple linear regression model formula.
- yi= dependent variable
- xi= explanatory variables.
- β0= y-intercept.
- βp= slope coefficients for each explanatory variable
- ϵ= the model’s error term.
To try and visualize a multiple linear regression with two predictors and one dependent variable will result to a plane of best fit as seen below. However, the more complex the number of predictor variables, the harder it is to visualize or calculate by hand. For this reason, python with the help of sklearn proves to be an effective and stable tool for this and more machine learning models.
Linear Regression in action.
We will use our data that was preprocessed(Cleaned) and saved with the name "pigiame-cleaned.csv" from part 2 of this project:(https://developers.decoded.africa/data-science-project-part-2/).
Importing the dataset.
tvs = pd.read_csv("pigiame-cleaned.csv")
Importing the necessary Libraries.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
A glance at the dataset.
Descriptive statistics of variables.
# Include = 'all' argument enables for the inclusion of non-numerical statistics such as Product, Condition, brand, Location and Description. tvs.describe(include = 'all')
Checking for missing values.
The output below shows that there are no missing values:
A glance at the dimension of our data frame.
print('Rows: '+ str(tvs.shape)) print('Columns: '+ str(tvs.shape))
Feature Selection simplifies models, improves speed and prevents a series of unwanted issues arising from having too many features. This is done by picking the columns needed for this Linear regression
Screen Size and brands are the two columns that are most important for this regression.
The other columns will be dropped. Below are the reasons why.:
- products column has up to 70% and the description column has up to 45% unique string to be encoded. This can be seen below. If encoded this might lead to the curse of dimensionality - This means adding features without increasing the data necessary for training.
- Location is 100% Nairobi, so the predictive power of the column has been lost.
- Condition column mostly has ‘new’ type which forms up to 99.5% of the data. Therefore there will be bias over other condition types if we use the column.
Analyzing Products and Description
Prod_Describe = pd.DataFrame(tvs.describe(include = 'all')) Prod_Describe.loc['unique',['Product', 'Description']] Unique_prod_percentage = Prod_Describe.loc['unique','Product']/tvs.shape*100 Unique_description_percentage = Prod_Describe.loc['unique','Description']/tvs.shape*100 print('Percentage of unique values in products column is: '+ str(Unique_prod_percentage)) print('Percentage of unique values in description column is: '+ str(Unique_description_percentage))
#Understading the frequency of all unique values of column Location. tvs['Location'].value_counts()
The output shows that Nairobi is the only unique value in column Location:
# Understanding the unique values in Screen size column. condition = pd.DataFrame(tvs['Condition'].value_counts()) # Calcuating the percentage of new condition in the dataframe. new_percentage = condition.loc['New','Condition']/ sum(condition['Condition'])*100 print( 'Percentage of new category in Condition column: ' + str(new_percentage))
Dropping columns except for Brand, Screen Size, and Price.
# Dropping coumns entails selecting wanted columns and reassigning back to the dataframe. tvs_new = tvs[['Brand', 'Screen Size', 'Price']] tvs_new.head(2)
Analyzing Screen size.
Intuitively, the screen size is a great variable to predict the price of a tv.
Therefore, in the cells that follow, we will prepare it for prediction by making the column into numeric form.
tvs_new['Screen Size'] = pd.to_numeric(tvs['Screen Size'])
Handing Categorical Data.
This concept would better be understood by understanding the different types of categorical data. Namely: Ordinal and Norminal.
Ordinal data has some sense of order among them. For instance, shoe sizes as seen in the figure below.
Nominal data on the other hand has no concept of ordering(to mean one is not superior to the other). Example include music, movies, cuisine among others.
Categorical data can be handled by assigning numerical values to the distinct factors in a feature. For instance, the feature Gender can only have one of two finite factors: Male or Female. Machine learning algorithms need numerical data as input. For this reason, Male can be assigned 0 and Female can be assigned 1. The process is known as transformation. After which an encoding scheme can later be applied.
For the sake of this tutorial, we will stick to the one-hot encoding process that we have used in the code that follows.
Taking our previous example of gender, the assignment of 0 to male and 1 to female is interpreted as continuous data by machine learning algorithms. Bringing the notion that being female is 1 and male 0, female is more important than male. However, this is not true because these data is of a nominal type.
One hot encoding scheme helps with this problem. The whole concept is to create m-1 dummy feature( m is the number of factors in our feature) and having a value of either 0 for active or 1 for inactive as we will see on the column tv brands below.
The process of one-hot encoding can be further studied at https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63.
Brands column factors need to be converted into dummy variables for regression purposes.
# categorical features to be converted by One Hot Encoding # The array categ can take more than one column in case of multiple categorical variables/ features. categ = [ "Brand", ] # One Hot Encoding conversion tvs_new = pd.get_dummies(tvs_new, prefix_sep="_", columns=categ)
After encoding, there is usually an increase in the number of columns.
# Checking the new dimentions of our data frame. print('Rows: '+ str(tvs_new.shape)) print('Columns: '+ str(tvs_new.shape))
Rearranging the columns.
Rearranging is not neccessarily important but it simplifies the understanding of regression. This is done by putting our dependent variable 'Price' as the last column.
col_arranged = ['Brand_Bruhm', 'Brand_HTC', 'Brand_Haier', 'Brand_Hisense', 'Brand_LG', 'Brand_Leader', 'Brand_Nobel', 'Brand_Other', 'Brand_Phillips', 'Brand_Samsung', 'Brand_Sanyo', 'Brand_Skyworth', 'Brand_Sony', 'Brand_Star', 'Brand_Syinix', 'Brand_TCL', 'Brand_Taj', 'Brand_Tornado', 'Brand_Vision', 'Brand_Vitron', 'Screen Size', 'Price'] tvs_new = tvs_new[col_arranged]
All including added columns can be seen with the help of the code below.
The output below shows at a glance the new data frame after one-hot encoding:
Price Log Transformation.
Before diving into Log transformation, it is important to know how that normal/gaussian distribution is a probability and statistical concept that is widely used for its benefits. Some being its mean, mode, median are equal and is defined by its mean and variance.
The graphical interpretation below shows a normal distribution.
A logarithm is defined as the log of x to the base b is equal to y because x is equal to b to the power of y (log(X) = y because X = bʸ). The base in this subject can be anything but the most common are 2, 10, and the natural log('e' = 2.718282)
The figure below shows the formulas explained.
Log transformation is done to normalize the dependent variable in case the data is skewed. This step is done to increase accuracy by improving the linearity between our independent and dependent variables. Hence boosting the validity of our statistical analysis.
The two graphs in the figure below, show the data before and after a log-transformation.
Back to our analysing our TVs dataset, the code below shows the frequency of tv sizes in order from the highest to the lowest.
The below code shows the distribution of price in relation to screen size of size 55 inch in form of a boxplot.
box = tvs[tvs_new['Screen Size']==55] sns.boxplot(box['Screen Size'], box['Price'])
The output below shows TVs that are soo expensive coming of as outliers. This interprets as a skewness in our dependent feature. This was earlier seen in our descriptive statistics section where the mean was greater than the median for price column.
The code below shows the distribution of tv prices in the dataset before log transformation.
The below code converts the price column into its log form and assigns it to the column 'log_price'
# Log transformation. tvs_new['log_price'] = np.log(tvs_new['Price'])
The code below shows the distribution of the new column 'log_price' after log transformation.
# After log transformation. sns.distplot(tvs_new['log_price'])
Since we will only use the 'log_price' column, we can drop the price column with the help of the code below.
tvs_new.drop(['Price'], axis= 1, inplace =True)
Before the actual regression, the code below is used to take a final glance at the dataset.
Linear Regression Model.
Linear Regression needs inputs and target variables.
The target variable in this case is the 'log_price' column and the inputs variable(predictor features) are all other columns less the 'log_price'. That is why we drop the column 'log_price' while assigning the variable inputs.
target = tvs_new['log_price'] inputs = tvs_new.drop(['log_price'], axis= 1)
The code below confirms the form of target and inputs variable.
Feature Scaling or Standardization or normalization: It is a step of Data Pre-Processing which is applied to independent variables or features of data. It helps to transform the data to a standard scale.
This is done by subtracting the mean from each value in the series and dividing by standard deviation for each feature independently. The figure illustrates more.
Through code, feature scaling is archived by creating an instance of the Standard scaler library from sklearn and using it to fit the independent features of the train data while keeping a copy of the standard deviation and mean of the scaler object. The copy can later be used to transform both the train and test data.
# Importing the library used for scaling. from sklearn.preprocessing import StandardScaler # Creating an instance of the Standard scaler. scaler = StandardScaler() # Creating a scaling model. scaler.fit(inputs)
The scaler object can now be used to transform the inputs variable.
# Scaling our inputs and storing them in a variable inputs_scaled inputs_scaled = scaler.transform(inputs)
Using the code below, we can see how the scaled data looks.
Train Test Split
This is the process of splitting data into training and testing. At times data can be further split into validation as well. The process of splitting data can be well understood by explaining overfitting and underfitting.
Overfitting is training a model too well to the training dataset such that it looses on accuracy to data outside the training. The data includes the noise rather than describe the relationship between the independent and dependent features.
Underfitting is when the model missed the trend in the data. This may result from less data leading to a very simple model. However, this may not as common as overfitting.
It is easy to spot underfitting and overfitting with the accuracy of both the train and test data as we will see later on.
The code below shows the use of train_test_split module in the model_selection sub-library to split data into train and test and further into the inputs(x) and target(y) as explained earlier.
from sklearn.model_selection import train_test_split # Splitting our data into train and test. # Both train and test have x-values(inputs), and y-values(target). # the test_size argument ensures train takes 80% of the data and test 20%. # the random_state argument ensures consistency in the split. x_train, x_test, y_train, y_test = train_test_split(inputs_scaled,target, test_size = 0.2, random_state = 365)
Create a regression.
Regressions are created using the LinearRegression module in the linear_model sub-library of sklearn.
First, the module in use is imported.
from sklearn.linear_model import LinearRegression
Second, an object of the LinearRegression module is created. In this case 'reg'.
reg = LinearRegression()
Finally, using the object reg, we can fit the data using the x_train(inputs) and y_train(outputs).
Training data Prediction.
Predicting our train data based on our model.
This is done using the reg object with the method 'predict'.
y_hat = reg.predict(x_train)
Visual representation of the model performance on the train data.
plt.scatter(y_train, y_hat) plt.plot([0, 1, 4, 8, 12], [0, 1, 4, 8, 12], linewidth =2, color='red') plt.xlim(6,12) plt.ylim(6,12) plt.show()
The output follows the logic that in case of a perfect prediction all values should be on the red line. Therefore, the graph shows the closeness of the predicted(on the y-axis) values of the train data to their corresponding actual(on the x-axis) values. The model looks to be doing just well with most of the points clustered to the red line:
# Checking that the errors are normally distributed. sns.distplot(y_train - y_hat)
The accuracy of our model on the training data is approximately 95% which is good. We are yet to see how the model performs on the testing data.
# Checking for accuracy. reg.score(x_train,y_train)
Finding the weights and bias.
The model has a y-intercept/bias of 10.354 or approximately 31,391 KES.
The output means that incase all weights are zero and none plays a role in the outcome of a tv price, then the price of tv is approximately 31,391 KES:
The output shows how different features affect the model with their weights/coefficients:
The code below generates a dataframe of the features alongside their corresponding
weights in descending order.
reg_summary = pd.DataFrame(inputs.columns.values, columns = ['Features']) reg_summary['Weights'] = reg.coef_ reg_summary.sort_values(by=['Weights'], inplace=True, ascending = True) reg_summary
Since all the independent features were scaled, then the weights can be trusted because no feature in the model holds a bias over the another. The interpretation of the output is such that Screen size is the most important feature in predicting price:
Testing data Prediction.
# Predicting the test data using the reg object and predict method and storing the outcome in 'y_hat_test' variable. y_hat_test = reg.predict(x_test)
Visual representation of the model performance.
plt.scatter(y_test,y_hat_test,alpha=0.2) plt.xlabel('Target(y_train)',size=18) plt.ylabel('Predictions(y_hat)',size=18) # plt.xlim(6,13) plt.ylim(0,13) plt.plot([0, 1, 4, 8, 12], [0, 1, 4, 8, 12], linewidth =2, color='red') plt.show()
Output interpretation is such that the closer the clusters to the red line the better the model:
The aim the lines of code below is to create a data frame to analyze the performance of the model on the test data by creating a data frame with columns: actual values, predicted, residuals and difference%.
- Create a data frame df_pred to hold the predictions.
df_pred = pd.DataFrame(np.exp(y_hat_test), columns=['Prediction'])
- Create data frame df_test to hold the actual price of the tvs in the test data.
df_test = pd.DataFrame(np.exp(y_test)) #Rename the colum with a more sensible name. In this case 'Actual_Price'. df_test.columns = ['Actual_Price'] # Reset the index and drop the index column for cleaner data. df_test.reset_index(inplace=True) df_test.drop(['index'], axis = 1, inplace =True)
- Create a column 'prediction' with the aim to join the actual and predicted values in one dataframe. Then round the column values to 1 decimal place.
df_test['Prediction'] = df_pred['Prediction'] df_test = df_test.round(1)
- Create a column 'residual' that is as a result of the difference between the prediction and actual price columns.
df_test['Residual'] = df_test['Prediction'] - df_test['Actual_Price']
- Creating a column 'difference in percentage' that is as a result of the absolute values of the residual divided by the actual prices and multiplied by 100.
df_test['difference%'] = np.absolute(df_test['Residual'] / df_test['Actual_Price']*100) df_test
- Arrange the data frame in descending order of the 'difference%' column with the aim to view the poorly predicted prices in the dataset first.
df_test.sort_values(by=['difference%'], inplace=True, ascending = False) df_test
Descriptive statistics of the test data performance.
Distribution plot of the residuals
The output shows that negative residuals are resulting from actual values that are too high for the dataset hence affecting the model. This may be due to outliers that may be from mistyping by retailers on the pigiame site.
Accuracy.The accuracy of our model on the test data is approximately 96% which is good.
# Checking the accuracy. reg.score(x_test,y_test)
On the question of whether this model is good, I would say it is because the accuracy on the train data by the model is approximately 95.24% and 95.87% on the test data. This shows that the model performs well on trained and new data hence it can be trusted when predicting the price of a tv based on inches and brand.
However, the model is still improvable through:
- Outlier removal. From the Graph above, we can high value negative residuals. This means that there are actual values that are too high and need to be inspected further for clarity.
- Feature engineering can be used to determine 4k and non-4k TVs. Intuitively, this would lead to the pricing of TVs differently.
- Reducing the number of factors/ unique variables in the column ‘brand’. This can be done by only letting the popular brands stand alone with the rest of the less frequent brands clumped under 'other'.
In conclusion, the average salary of a Kenyan, from different online sources, ranges between 29,467 KES to 140,035 KES. In this case, being that the cheapest smart screen tv from our model's y-axis parameter is 31,391KES, we can make a general conclusion to the question in part 1 that a greater number families in Kenya can afford a smart flat-screen TV.