In light of the use of Data Science as a tool in decision making, this series of articles answers the question of how much it would cost a household in Nairobi to own a flat-screen tv. What makes the project interesting is that, unlike getting data from online pre-uploaded datasets, we will get data directly from the internet through web scrapping and clean our desired data for statistical analysis and model creation.

The project aims to make any interested party learn from scratch what it feels like to work on a project from start to end with an idea but no data.

By the end of this series of articles, you will learn:
I. Web Scrapping using selenium.

  1. Loops.
  2. File handling.

II. Data Preprocessing using Pandas:

  1. Loading a Dataset.
  2. Exploring the dataset.
  3. Cleaning the data.
  4. Statistics (Descriptive).
  5. Feature Engineering.
  6. Outlier removal.

III. Machine learning using scikit-learn.

  1. Linear regression.
  2. Feature Selection.
  3. Handing Categorical Data (One hot encoding).
  4. Feature Scaling(Standardization).
  5. Splitting the dataset into train and test.
  6. Training and testing.

I. Web Scrapping using selenium.

Web scrapping process.Web scraping is the process of using computational tools to get data from internet sites automatically. The main benefit is the extraction of large amounts of useful information from online sites such as social media, e-commerce among others. Useful information may include customers, products, stocks among others. The process of web scraping is done by identifying patterns of postings on sites and programming in such a way that code takes advantage of these patterns by extracting information multiple times using loops or recursive functions.

This article focuses on using the e-commerce platform for the project. For this reason, we will use web scrapping to get the necessary data from the site. The data that we look forward to getting includes: product, condition, brand, shop, description and price of televisions on the pigiame site. The extracted data will be stored in a CSV file with the knowledge of file handling.

Importing the necessary libraries.

The main library in this case for web scrapping is selenium.

The sub-libraries found in this library are:

  • webdriver for running commands on any browser such as Mozilla or Google chrome such is in our case.
  • exceptions that use modules such as ‘NoSuchElementException’ among others to help catch errors and prevent the program from crashing.

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException, 
TimeoutException, NoSuchAttribute

Define the function scraper1 which scraps elements on the first 5 pages. This is because, after the fifth page, there follows a different pattern for products on the site.
Every line of code has been commented for better understanding.

def scraper1(): 
# loop through 400 items so as to capture listings on the first 5 pages.    
     for i in range(1,400):        
       data1 = [] # create an empty list "data1" 

# try finding elements by their respective xpath positions below and skip exceptions specified at the bottom of the function i.e. nosuchelement, timeout and nosuchattribute        
          product = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[2]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[1]/div[1]/div[1]').text              
          condition = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[2]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[1]/div[1]/div[2]/span[1]').text            
          brand = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[2]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[1]/div[1]/div[2]/span[2]').text            
          shop = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[2]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[1]/div[1]/div[2]/span[3]').text            
          description = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[2]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/p[1]').text            
          price = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[2]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[2]/div[1]/span[1]/span[1]').text            

# store elements in empty list as standard unicode.            
          data1 = [product.encode('unicode-escape').decode('utf-8'),                     

# open a new csv file and write the data            
          with open('pigiame.csv', 'a', newline='') as products:                        
                  thewriter = csv.writer(products)                
          except(TimeoutException, NoSuchElementException, NoSuchAttributeException):            

print('finished scrapping this page') # once complete, print finished srapping

Define the function scraper2 which scraps elements on all the other pages up to page 4000. However, if there are more pages, one can replace the range for the loop to include the left out pages.

def scraper2():    
    for i in range(1,400):        
      data1 = []        
        product = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[3]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[1]/div[1]/div[1]').text            
        condition = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[3]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[1]/div[1]/div[2]/span[1]').text            
        brand = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[3]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[1]/div[1]/div[2]/span[2]').text            
        shop = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[3]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[1]/div[1]/div[2]/span[3]').text            
        description = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[3]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/p[1]').text            
        price = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[3]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[2]/div[1]/span[1]/span[1]').text                        
        data1 = [product.encode('unicode-escape').decode('utf-8'),                     
        with open('pigiame.csv', 'a', newline='') as products:                
              thewriter = csv.writer(products)                
        except(TimeoutException, NoSuchElementException,NoSuchAttributeException):            
print('finished scrapping this page.')

The cells that follow will run the program thus running all the functions previously defined.

Create an object driver for the sub-library Chrome and launch a browser window with the help of ‘chromedriver.exe’ downloaded and saved in the same directory.

driver = webdriver.Chrome('chromedriver.exe')

Load target site(pigiame) and specific page(tvs).


Scroll through until maximum listings are loaded to avoid missing elements.


Scrap elements on that first five pages and store in a comma-separated value file


Initialize I to 6 where i is the page number and six means to start from the sixth page since the first 5 pages have already been scrapped.


Loop through pages in increments of 5 from page 6 scrolling to the bottom for each page to load all listings and scrapping elements. The increment of 5 is because scrolling to the bottom loads 5 more pages. Print “finished scrapping this page” after each page is scrapped. The limit of 3644 pages can be increased in case of more pages.

while i<3644:     
    driver.get('{}'.format(i))     driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")    

After the whole process is done, we will have a CSV file by the name ‘pigiame.csv’ in the same folder as this script. It is important to know that the whole scrapping process needs relatively fast and steady internet. Now that we are done with the extraction of data we can go to the next part of this series.