In light of the use of Data Science as a tool in decision making, this series of article answers the question of how much it would cost a household in Nairobi to own a flat-screen tv. What makes the project interesting is that, unlike getting data from online pre-uploaded datasets, we will get data directly from the internet through web scrapping and clean our desired data for statistical analysis and model creation.

The project aims to make any interested party learn from scratch what it feels like to work on a project from start to end with an idea but no data .

By the end of this series of articles, you will learn:
I. Web Scrapping using selenium.

  1. Loops.
  2. File handling.

II. Data Preprocessing using Pandas:

  1. Loading a Dataset.
  2. Exploring the dataset.
  3. Cleaning the data.
  4. Statistics (Descriptive).
  5. Feature Engineering.
  6. Outlier removal.

III. Machine learning using scikit-learn.

  1. Linear regression.
  2. Feature Selection.
  3. Handing Categorical Data (One hot encoding).
  4. Feature Scaling(Standardization).
  5. Splitting the dataset into train and test.
  6. Training and testing.

I. Web Scrapping using selenium.

Web scrapping process.

Web scraping is the process of using computational tools to get data from internet sites automatically. The main benefit is the extraction of large amounts of useful information from online sites such as social media, e-commerce among others. Useful information may include customers, products, stocks among others. The process of web scraping is done by identifying patterns of postings on sites and programming in such a way that code takes advantage of these patterns by extracting information multiple times using loops or recursive functions.

This article focuses on using the e-commerce platform https://www.pigiame.co.ke/tvs for the project. For this reason, we will use web scrapping to get the necessary data from the site. The data that we look forward to getting includes: product, condition, brand, shop, description and price of televisions on the pigiame site. The extracted data will be stored in a csv file with the knowledge of file handling.

Importing the necessary libraries.

The main library in this case for web scrapping is selenium.

The sub-libraries found in this library are:

  • webdriver for running commands on any browser such as Mozilla or Google chrome such is in our case.
  • exceptions that use modules such as 'NoSuchElementException' among others to help catch errors and prevent the program from crashing.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException, TimeoutException, NoSuchAttribute

Define the function scraper1 which scraps elements on the first 5 pages. This is because, after the fifth page, there follows a different pattern for products on the site.
Every line of code has been commented for better understanding.

def scraper1(): 
# loop through 400 items so as to capture listings on the first 5 pages.
    for i in range(1,400):
        data1 = [] # create an empty list "data1" 
# try finding elements by their respective xpath positions below and skip exceptions specified at the bottom of the function i.e. nosuchelement, timeout and nosuchattribute
        try:
            product = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[2]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[1]/div[1]/div[1]').text  
            condition = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[2]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[1]/div[1]/div[2]/span[1]').text
            brand = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[2]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[1]/div[1]/div[2]/span[2]').text
            shop = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[2]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[1]/div[1]/div[2]/span[3]').text
            description = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[2]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/p[1]').text
            price = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[2]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[2]/div[1]/span[1]/span[1]').text
            

# store elements in empty list as standard unicode.
            data1 = [product.encode('unicode-escape').decode('utf-8'),
                     condition.encode('unicode-escape').decode('utf-8'),
                     brand.encode('unicode-escape').decode('utf-8'),
                     shop.encode('unicode-escape').decode('utf-8'), 
                     description.encode('unicode-escape').decode('utf-8'), 
                     price.encode('unicode-escape').decode('utf-8')]        
# open a new csv file and write the data
            with open('pigiame.csv', 'a', newline='') as products:        
                thewriter = csv.writer(products)
                thewriter.writerow(data1)          
                      

        except(TimeoutException, NoSuchElementException, NoSuchAttributeException):
            continue             
           
    print('finished scrapping this page') # once complete, print finished srapping

Define the function scraper2 which scraps elements on all the other pages up to page 4000. However, if there are more pages, one can replace the range for the loop to include the left out pages.

def scraper2():
    for i in range(1,400):
        data1 = []
        try:
            product = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[3]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[1]/div[1]/div[1]').text
            condition = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[3]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[1]/div[1]/div[2]/span[1]').text
            brand = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[3]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[1]/div[1]/div[2]/span[2]').text
            shop = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[3]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[1]/div[1]/div[2]/span[3]').text
            description = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[3]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/p[1]').text
            price = driver.find_element_by_xpath('/html[1]/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[1]/div[3]/div[1]/div['+str(i)+']/div[1]/a[1]/div[2]/div[1]/div[2]/div[1]/span[1]/span[1]').text
            

            data1 = [product.encode('unicode-escape').decode('utf-8'),
                     condition.encode('unicode-escape').decode('utf-8'),
                     brand.encode('unicode-escape').decode('utf-8'),
                     shop.encode('unicode-escape').decode('utf-8'), 
                     description.encode('unicode-escape').decode('utf-8'), 
                     price.encode('unicode-escape').decode('utf-8')]        

            with open('pigiame.csv', 'a', newline='') as products:
                thewriter = csv.writer(products)
                thewriter.writerow(data1)          
                      

        except(TimeoutException, NoSuchElementException,NoSuchAttributeException):
            continue             
           
    print('finished scrapping this page.')

The cells that follow will run the program thus running all the functions previously defined.

Create an object driver for the sub-library Chrome and launch a browser window with the help of 'chromedriver.exe' downloaded and saved in the same directory .

driver = webdriver.Chrome('chromedriver.exe')

Load target site(pigiame) and specific page(tvs).

driver.get('https://www.pigiame.co.ke/tvs')

Scroll through untill maximum listings are loaded to avoid missing elements.

driver.execute_script("window.scrollTo(0,document.body.scrollHeight)") 

Scrap elements on that first five pages and store in a commma separated value file

scraper1() 

Initialize i to 6 where i is the page number and six means to start from the sixth page since the first 5 pages have already been scrapped.

i=6

Loop through pages in increments of 5 from page 6 scrolling to the bottom for each page to load all listings and scrapping elements.The increament of 5 is because scrolling to the bottom loads 5 more pages. Print "finished scrapping this page" after each page is scrapped. The limit of 3644 pages can be increased in case of more pages.

while i<3644: 
    driver.get('https://www.pigiame.co.ke/tvs?p={}'.format(i)) 
    driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
    scraper2()
    i+=5

After the whole process is done, we will have a csv file by the name 'pigiame.csv' in the same folder as this script. It is important to know that the whole scrapping process needs relatively fast and steady internet. Now that we are done with the extraction of data we can go to the next part of this series.

https://developers.decoded.africa/data-science-project-part-2/

You've successfully subscribed to Decoded For Devs
Welcome back! You've successfully signed in.
Great! You've successfully signed up.
Your link has expired
Success! Your account is fully activated, you now have access to all content.