Data Pre-processing

II. Data Preprocessing using Pandas:

From part 1 of this series, https://developers.decoded.africa/data-science-project-part-1/, we focused on web scrapping with selenium and fie handling. We managed to store the scrapped data in 'pigiame.csv' file. In this part of the series, we will have to first load the CSV file into jupyter's Integrated development environment(ide), clean it and do some Exploratory Data Analysis(EDA) on it.

In case you are new to jupyter you can learn more on it at https://developers.decoded.africa/ever-wondered-how-a-simple-data-science-project-looks-like/

The first step before running any code is loading the necessary libraries.

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
  • Pandas will be used for data frame manipulation.
  • Seaborn and matpoltlib's sub-library pyplot will be used for visualization.
  • Numpy will be used for its mathematical capabilities.
  • PercentFormatter module in matpoltlib's sub-library ticker will be used to convert axis values into percentages.

Loading the dataset is possible with the help of the code below:

# Importing dataset
tvs = pd.read_csv('pigiame.csv', header=None)
tvs.head()

The output shows a glimpse of the dataset after loading:

After loading our scrapped dataset, we need to clean it before we use it for analysis. The series of code below is well commented to show the cleaning process that is unique to every dataset.

  • Duplicate values have to be dropped because some online retailers posted more than once therefore the data frame has replicated rows. After which you will notice the loss of a chunk of data.
tvs.drop_duplicates(inplace=True)
  • Renaming columns with logical names.
tvs.columns = ['Product', 'Condition', 'Brand', 'Location', 'Description', 'Price']
tvs.head(2)

The output confirms that the columns have been renamed:

From the above image, we can see that the last column(price) is a string. However, that column is more useful to us in numerical form. Through feature engineering this can be attained.

  • Splitting currency(KSh) from price values(00,000): To obtain two columns: currency and price.
tvs[['Currency','Price']] = tvs['Price'].str.split(' ' , expand=True)
  • Dropping currency column because the column is of a uniform distribution hence offering no value to this analysis.
tvs.drop(['Currency'], axis=1, inplace=True)
  • Removing the comma in the 'price' column by splitting price values at the thousandth place value into two columns 'price1' and 'price2' and rejoining them to as Price without the comma.
tvs[['Price1', 'Price2']] = tvs['Price'].str.split(',' , expand=True)
tvs['Price'] = tvs['Price1'] + tvs['Price2']
  • Dropping 'Price1' and 'Price2' columns because we have a clean column from the previous code named 'price'.
tvs.drop(['Price1', 'Price2'], axis=1, inplace=True)
tvs.head()

The output is a glimpse of our new columns Price:

  • Null values are not good for analysis. Therefore the should be dealt with either through deletion or imputation.
    In our case, missing values were brought about after generating a cleaner price column.
# Printing out sum of null values
print('Missing values after feature engineering are: '+ str(tvs['Price'].isnull().sum()))

# Dropping null values
tvs.dropna(inplace=True)

# Printing out the sum of null values after dropping.
print('Missing values after dropping null values are: '+ str(tvs['Price'].isnull().sum()))

Output:

  • Input from the user comes in form of a string. Therefore, we need to change price column data type to integer. Followed by confirming data types.
# Confirming data types
print('Column data types before \n:'+ str(tvs.dtypes) + '\n')

# Changing Price column data type to integer
tvs['Price'] = tvs['Price'].astype(int)

# Confirming data types
print('Column data types after \n:'+ str(tvs.dtypes))

The output shows that we have converted the column Price to its appropriate data type:

Descriptive statistics is important as it presents large datasets in a summarized and meaningful way for simpler interpretation. However, it does not explain everything in our data. The code below brings the output to our descriptive statistics on the pigiame dataset.

tvs.describe(include='all')

The output shows the following:

Count, unique, top and freq are statistical interpretations of categorical columns. In this case product, condition, brand, location and description.

  • Count is the number of non-empty cells per column or feature.
  • Unique if the number of factors per column.
  • Top is the most frequent factor in a column while freq is its frequency.

Mean, std, min, 25%, 50%, 75% and max are all statistical interpretations of numerical columns. In this case price.

  • mean is the average.
  • std is the standard deviation.
  • 25%, 50% and 75% are percentiles to show the distribution of data.
  • min and max are the minimum and maximum values respectively.

Exploratory Data Analysis is one of the most important steps in data science. The main reason for its use is to get to understand the data in-depth. That is to say the size, attributes and patterns in or between one or more features in a dataset. When accuracy is at it's best, data exploration can be used to improve it. Through exploration, feature engineering and selection are also possible. The exploration process can be made easier through data visualization.

Data exploration processes include:

  • Univariate analysis: Analysis of one feature.
  • Bivariate analysis: Between two variables to determine their relationship.
  • Multivariate analysis: Between multiple dependent features.
  • Principal Component Analysis: Correlation patterns of features to reduce their number. In our case, it will not be necessary because we have no correlated features.

Univariate analysis.

  • Looking into the number of distinct tv-brands on pigiame.
print('The number of unique tv brands are: '+ str(len(tvs['Brand'].unique())))

Output:

The code below helps visualize by count brands by their popularity in pigiame database.

tv_brands = pd.DataFrame(tvs['Brand'].value_counts())
tv_brands.reset_index(inplace=True)
tv_brands.columns = ['Brand', 'Count']


plt.figure(figsize=(10,6))
chart = sns.barplot(x='Brand', y='Count', data=tv_brands)
chart.set_xticklabels(chart.get_xticklabels(), rotation=45)
print(chart)

Output:

Tv-Brands.

From the above figure, it is clear that Hisense is the most popular brand followed by Samsung, TCL and  Sony. Tv brands in other categories less LG and Skyworth follow.

  • Location feature has different unique entries but will later be summarised into Nairobi in general. The code and output can be seen below.
# Get the unique entries bu count of location column in TVs data frame.
tvs['Location'].value_counts()

The few entries that are not in Nairobi can either be deleted or converted to Nairobi since they are too few to make a change.

# Change all entries in Location column into 'Nairobi'
tvs['Location'] = 'Nairobi'
tvs['Location'].value_counts()

Output:

The feature Product has soo many unique values and from an analytical point of view, this might not be very useful. However, from the code and output below, notice that the product column has inch measurement in it. This is a very powerful tool especially in predicting the price of a TV. Therefore, through feature engineering, we can extract and create a new column 'Screen Size'.

# Code to view the product column in tvs dataframe.
tvs.head(10)['Product']
First 10 values of product column.
# Extract the inches from Product column and store in sizes variable.
tvs['Screen Size'] = tvs['Product'].str.extract('(\d\d)', expand=True)

# Checking the number of rows.
print(len(tvs['Screen Size']))
# Removing missing values from the data.
tvs = tvs[pd.notnull(tvs['Screen Size'])]
# Checking the number of rows after removing missing values
print(len(tvs['Screen Size']))

# View the first 10 entries for changes.
tvs.head(10)

The output below shows that the extraction of inches from Product to the last column screen size has been successful.

The most popular screen size in pigia me can be visualized with the help of the code below.

# Create a dataframe from the value counts of all the unique Screen sizez.
tv_sizes = pd.DataFrame(tvs['Screen Size'].value_counts())
tv_sizes.reset_index(inplace=True)

# Rename the columns into Size and Count
tv_sizes.columns = ['Size', 'Count']

# Create a cumpercentage column using the count column and getting the cummulative percentage.
tv_sizes["cumpercentage"] = tv_sizes["Count"].cumsum()/tv_sizes["Count"].sum()*100

# creating a figure size.
plot = plt.figure(figsize=(15,6))
ax = plot.subplots()

# Creating the first plot with size on the x-axis and count on the y-axis.
ax.bar(tv_sizes["Size"], tv_sizes["Count"], color="C0")

# Enables axis sharing on the x-axis by creating a twin.
ax2 = ax.twinx()
# Plots the size on the x-axis and cumpercentage on the y-axis.
ax2.plot(tv_sizes["Size"], tv_sizes["cumpercentage"], color="C1", marker="o", ms=4)

# formats the cumulative values on the right y-axis into percentage form.
ax2.yaxis.set_major_formatter(PercentFormatter())

#Visualize the plot.
plot.show()

From the output below, we can say that the most popular screen sizes are  32, 43, 55, 40, 49, 65, 50, 24, and 75.

The code below helps view the Tv_sizes data frame created earlier. The data frame shows 32 to 75 inch in the figure above, covers up to 97.9% as per the data in the data frame.

tv_sizes

The code below shows the most expensive and the most economical brands.

plt.figure(figsize=(10,10))
sns.relplot(x='Price', y='Brand', hue='Screen Size', data=tvs)

The output shows that LG, Samsung and sony have the most expensive TVs on average.
We can also spot an outlier in Skyworth.

Cleaning the screen size column is important. Some rows have some minor issues. For instance, screen sizes of 2 or '4o' do not make sense. Visualizing these rows one after another would be proof and necessary changes can be made through filtering out and changing cell values. Making such changes can be done with the following steps

Step I: Extract and visualize rows by value.
This step can be used for clarity.

Example: Extracts all columns from TVs data frame with the string equal to 4k in the Screen Size column. This is because 4k is not a screen size. Therefore we need to handle this issue.

# If the column Screen Size is a string, apostrophes are required. The vice versa is true.
filter = tvs['Screen Size'] == '4k'

# filter variable has some boolean values. Therefore the code below shows rows satisfying the above requirement(their boolean values are true).
tvs[filter]

The output shows that there are wrong placements of inches the screen size column. Therefore correction for cleaner data is mandatory.

Step II: Change the values of a cell.
You can do this step if the cells you need to change are not many. If that is not the case, you might want to use a different approach. For instance, deleting the cells in question.

# iloc is a function that takes integer arguments of a row and column number, therefore, identifying the specific cell and assigning the new value.
# tvs.iloc(row_number, column_number) = new_value_to_replace_with

A good example from our previous image is that at the first row(584) and the product column, the row has a tv of size 55 inches rather than 4k. Therefore to change the corresponding cell in column screen size, we would use the code below.

tvs.iloc(6, 584) = '55'

The screen size column is 6 and the row is 584. Followed by assigning a string value of '55' since the column is of type object.

Step III: Deleting/dropping of a cell.

We have seen how to drop a column before. The same code holds for dropping rows with minor changes.
The code blueprint below guides on how to do just that.

tvs.drop([a], axis=0, inplace=True)
  • The first argument a takes row number or numbers separated by commas.
  • The second argument takes axis of 0 meaning row. In this case, we are dropping a row. An axis of 1 means column.
  • The third argument in place is set to true meaning the changes will be saved back in the original data frame. The vice versa is true.

Using the three steps above, the figure below gives a hint on what screen sizes need cleaning.

Some errors may not be seen and corrected since the data frame is large. For this reason, a general approach is needed for correction to avoid compromise on accuracy. To do this in a few lines of code, removing outliers would help. To go about this, step VI  of the article https://developers.decoded.africa/ever-wondered-how-a-simple-data-science-project-looks-like/ would be of great assistance.

The final step after all the necessary cleaning is to store the clean data in another CSV. The code below does just that.

tvs.to_csv(r'Path\pigiame-cleaned.csv', index = False, header=True)
  • Replace the 'path' in the code above with the actual path you wish to store the clean data. Most preferably in the same path as this script.

After all is done, the next and last article in this series (https://developers.decoded.africa/data-science-project-part-3/) will cement your understanding of data science by teaching you machine learning using linear regression models.

You've successfully subscribed to Decoded For Devs
Welcome back! You've successfully signed in.
Great! You've successfully signed up.
Your link has expired
Success! Your account is fully activated, you now have access to all content.