How to retrieve web data using Python libraries (Beautiful soup, requests and urllib3)

In this tutorial, we will learn about the various python libraries used in web scraping or retrieving data from a website or API, such as Reddit API and YouTube video API. Web scraping is an essential skill for web developers and data scientists. It enables data scientists to pull data from different sources using a simple script automatically.

Intended Audience; who the tutorial is for:

This tutorial’s targeted audience are python developers (beginners and intermediate levels) who would want to use python methods to download text, image or video data from authorized websites to later use in machine learning applications such as natural language text classification, visual recognition applications or algorithms. It teaches the readers how to use the basics of python programming and libraries to handle text, image or video data or datasets.

What the tutorial will cover:

The tutorial begins with an introduction to the python programming language, installation of python on Windows, Linux and Mac operating systems, introduction and use cases of python-requests and urllib3.

API’s are established methods or scripts that enable a user to retrieve information stored in servers such as the Reddit and YouTube video API’s that allow registered users to access data. The most commonly used libraries in Python 3 include Beautiful Soup, Selenium, Requests, Lxml, Mechanical Soup, and Urllib2. For this tutorial, we will focus on Beautiful Soup to parse retrieved contents, requests to get and post web data, and urllib3 to handle the HTTP connections.

Urllib3 is a python dependency needed by libraries such as requests and pip. It assists in thread safety, connection pooling and client-side TSL/SSL verification, file uploading with multipart encoding, proxy support for HTTP and SOCKS, and retrying requests and handling HTTP redirects.

Web scraping usually involves three steps;

  1. Sending an http request to the the web page URL, the server responds by returning the HTML content. At this step, we use a python HTTP library called requests.
  2. After returning the HTML content, the information must be parsed to create an organized or hierarchical structure of the data to make it understandable. This process is known as ‘parsing the data’, and the python libraries used in this step are html5lib or LXML.
  3. The final part involves searching through the hierarchical data for useful information relevant to the user searching. To achieve this, we use the python library, beautiful soup.

Prerequisites:

  • Working Python 3 installation.
  • Text editor such as Notepad++, Sublime Text, Atom etc.
  • Python libraries i.e requests, urrllib3, html5lib, and BeautifulSoup.
  • Basic knowledge of python syntax such as conditional statements, exceptions and errors, methods, functions and operators.
  • Internet connection to make HTTP requests to access the website content.
  • Authorized website URLS to make HTTP requests or the connection will be refused.

Installing Python

To understand the tutorial, we have to start with the installation of python. The section below describes how to install Python on Windows, Linux and Mac OS. The installation processes are different for the 3 operating systems.

Installing Python on Windows

  • Download the Python executable file by visiting the official python website http://www.python.org/download/ and download the latest Python MSI file. The latest version as per the writing of this tutorial is Python version 3.9. Click on the downloaded Python executable file and the installation process will commence.

Installing Python on Linux

  • In Linux systems, Python comes pre-installed. If you want to check if python is installed, type python on the terminal command line window. The installed python version will be displayed on the terminal.

Installing Python on Mac OS

  • Download the Python executable file by visiting the official python website http://www.python.org/download/. Look for the Mac OS installer and download the latest Mac OS 64 0r 32 bit installer with .pkg extension. Double click on the installer and follow the steps till the end. Type python3 on the command line and the installed python version will be displayed on the terminal window.

Installing Notepad ++

  • Download the Notepad++ executable file by visiting the official website http://www.notepad-plus-plus.org/downloads/. Double click the executable file and follow the installation steps till completion. To be able to write computer programs a text editor such as Notepad ++, Sublime Text or the one that you prefer is needed. This is where you will write the python code and then use the python interpreter to run and output the results.

Introduction to Python Requests Library

Requests is a python library that allows users to send HTTP requests. To make use of requests in a project, first install requests on the python interpreter or command line by running the code: pip install requests

Then import the library by running the code: import requests

GET Request

From the above code:

  • Line 1: imports the requests library
  • Line 2: creates a response object that gets the URL link
  • Line 3: Outputs the response in text format

To use requests to create a response object using the GET method, run the code above. Save the full script as requests_get.py and run it through the python interpreter to see the HTML output (extracted data) on the terminal.

Output:

POST Request

From the above code:

  • Line 1: imports the requests library
  • Line 2: Creates a response object containing the url link
  • Line 3: creates the first response object that has a key-value pair dictionary
  • Line 4: creates the second response object that posts with arguments of url and response 1
  • Line 5: checks the connection status_code
  • Line 6: checks if the response status_code of response 2 is 200
  • Line 7: outputs the results of response2
  • Line 8: checks the connection the status_code of response 2 is 404
  • Line 9: outputs the message ‘Not found’

To use the post () method that sends a POST request to a specified URL. The POST method is used to send data to a server. Run the code above. Save the python script as requests_post.py and run it through the python interpreter by writing python requests_post.py to see the output (extracted data).

Output:

Other important requests methods include:

  • basicresponse.encoding #returns ‘utf-8’ as the web
    page encoder
  • response.status_code
    #returns status code 200 if successful and 404 if an error is encountered
  • response.elapsed #returns elapsed time getting
    the response
  • response.url #returns url of webpage
  • response.history # returns a chronological information about response(s)
  • response.headers[‘Content-Type’] # returns
    response headers such as ‘text/html; charset=utf-8’
  • response.cookies #to access cookies information
  • response.is_redirect # returns True (HTTP
    redirect) or False (No HTTP redirect) based on response obtained
  • response.text # returns response in text format
  • response.json # returns response for JSON encoded
    content
  • response.raw # returns response in raw format
    req.iter_content (chunk_size=50000) #download files in chunks of 50000 bytes.

Part 1: Using requests to access HTML content:

  1. Before starting the tutorial, install Python 3 by downloading the installer from the official download page, if you have not. You can change or customize the install location, which is usually C:\Python37. Add the python directory to the Windows environment PATH.
  2. Open the python interpreter or command line and install the following python libraries using the python’s pip library; requests, urllib3, and bs4 (BeautifulSoup) and html5lib
    • pip install requests
    • pip install urllib3
    • pip install bs4
    • pip install html5lib
  3. Access the HTML content of a specific web page using the python requests library
    • import requests
    • from bs4 import BeautifulSoup
    • url = “https://wenyenchi.com”
    • response = requests.get(url)
  4. Parse the HTML content using the Beautiful Soup (bs4) library. Beautiful soup is a python library that is used to extract HTML and XML files. In this example, we use the html5 parser instead of lxml. Use the print command to create a visual representation of the raw HTML content.
    • soup = BeautifulSoup(response.text,”html.parser”)

From the above code:

  • Line 1: imports the requests library
  • Line 2: imports the BeautifulSoup library
  • Line 3: creates an object containing url link
  • Line 4: creates a response object getting the url link
  • Line 5: creates a soup object that will output the html results as text
  • Line 6: create an empty list where the results will be contained
  • Line 7: Loops through the html contents to find the anchor tags
  • Line 8: Outputs all the links found inside the html contents

Save the full script below as requests_html.py and run it through the python interpreter by writing python requests_html.py to see the output (extracted data) on the terminal.

Output:

From the above code:

  • Line 1: imports the requests library
  • Line 2: imports the BeautifulSoup library
  • Line 3: creates an object containing url link
  • Line 4: creates a response object getting the url link
  • Line 5: creates a soup object that will output the html results as text
  • Line 6: opens the file_object as a text file
  • Line 7: Loops through the html contents to find the anchor tags
  • Line 8: create a data object containing all html links
  • Line 9: Writes all the links found inside the html contents
  • Line 10: Outputs all the links found inside the html contents in different lines
  • Line 11: Ends the writing of the html results

  1. Now that we have a visual representation of the retrieved data, we can parse through the raw HTML content to get the information we are interested in. It could be headlines or articles. In HTML, web information is arranged in hierarchies or containers. The containers are usually created with the div (division) HTML tag, headlines or titles usually have the “h” (Headline) HTML tag, articles the “p” (paragraph) HTML tag and URL’s with the anchor or link tag (<a href=). To understand the layout of the website page, open the HTML source code on the browser or the get requests method:

  1. To extract URLs and save them as CSV files, grab the entire information under the anchor or link tags and save it as a text file

Save the full script above as requests_html2.py and run it through the python interpreter to see the output (extracted data). The ‘test_file.txt’ will be saved in the same folder as the python script.

Output:

This image has an empty alt attribute; its file name is Capture-21.png

Part 2: Using requests to download an image from a website

From the above code:

  • Line 1: imports the requests library
  • Line 2: imports the BeautifulSoup library
  • Line 3: creates an image_object containing url link
  • Line 4: creates a response object getting the url link
  • Line 5: opens the image_object as an image file
  • Line 6: Downloads the image_object

Introduction to request methods

import requests # imports the requests library

response= requests.get (‘path/to/image.jpg’, stream=True) # stream=True ensures that the entire file is not read into memory at once (replace the ‘path/to/image.jpg’ with your own image path.)

response.raise_for_status() # returns an HTTPError object if an error has occurred

with open(‘image.jpg’, ‘wb’) as file_image  #wb indicates that the file is opened for writing in binary mode (no changes made)

for chunk in response.iter_content (chunk_size=50000): #chunk_size determines the number of bytes that should be read into the memory at once

print(‘Received a chunk’) # prints the output of extracted contents

file_image.write (chunk) #downloads the image to folder where the python script is located

Save the full script as requests_image.py and run it through the python interpreter to see the output (extracted data). The downloaded image will be in the same folder as the python script.

Output:

This image has an empty alt attribute; its file name is Capture-22.png

Part 3: Using requests to download pdf files

From the above code:

  • Line 1: imports the requests library
  • Line 2: creates a pdf_object containing url link
  • Line 3: creates a response object getting the url link and enables stream to download in parts
  • Line 4: opens the pdf_object as a pdf file
  • Line 5: Loops the pdf_object and downloads the file in chunks of 1024 bytes
  • Line 6: checks if the pdf_object has chunks
  • Line 7: Writes or downloads the pdf_file

The ‘r.content’ or ‘response.content’ request method stores file data as a single string, which is impossible to do for large files such as pdfs or video files. In such a case, we use the ‘r.iter_content’ or ‘response.iter_content’ method to download the files in chunks.

Save the full script as requests_pdf.py and run it through the python interpreter to see the output (extracted data). The downloaded pdf will be in the same folder as the python script.

Output:

This image has an empty alt attribute; its file name is Capture-23.png

Part 4: Using requests to download videos

Downloading videos follow the same pattern as that for downloading pdfs

From the above code:

  • Line 3: imports the requests library
  • Line 4: imports the BeautifulSoup library
  • Line 6: creates an object containing url link
  • Line 8: creates a function get_video_links
  • Line 9: creates a response object getting the url link
  • Line 10: creates a soup object that will output the html results as content
  • Line 11: loops through soup contents to find all anchor tags
  • Line 12: Loops through the soup contents to find the anchor tags that end with .mp4
  • Line 13: returns the video_links
  • Line 15: Creates a second function with the video_links as the arguments
  • Line 16: Loops through the video_links
  • Line 18: Creates a video_name object from the last string in the url link
  • Line 19: Prints the words “Downloading video’ with the video_name
  • Line 20: Creates a response object that gets the link and downloads video in parts
  • Line 21: Opens the video_name as f (variable name)
  • Line 22: Loops through the file and downloads it in chunks of 1024 bytes
  • Line 23: Checks if there is a chunk
  • Line 24: Downloads the video
  • Line 25: Outputs the name of download_videos
  • Line 26: Prints video downloaded
  • Line 27: Ends the process of downloading
  • Line 29: Tells the script to run the functions below it
  • Line 31: Runs video_links function to get video_links
  • Line 32: Run download_videos function to download the videos

Output:

This image has an empty alt attribute; its file name is Capture-25.png

Part 5: Using Urllib3 to extract web data

Urllib3 is a python library that is used to fetch URL’s using a range of diverse protocols such as basic authentication, cookies, proxies etc.

Making a simple HTTP request using urllib3

From the above code:

  • Line 1: imports the urllib3 library
  • Line 3: create a PoolManager instance to send requests
  • Line 5: creates a response object getting the url link
  • Line 6: Prints the contents of response object

Output: Use urllib3

This image has an empty alt attribute; its file name is Capture-26.png

Making an HTTP request using any HTTP verb

From the above code:

  • Line 3: imports the urllib3 library
  • Line 5: create a PoolManager instance to send requests
  • Line 6: creates a response object posting the key, value pair ‘hello, world’ to the url link
  • Line 8: prints the data in the response object
  • Line 10: prints the status of the response
  • Line 12: prints the header information in the response

Output:

Handling GET requests

From the above code:

  • Line 3: imports the json library
  • Line 4: imports the urllib3 library
  • Line 6: create a PoolManager instance to send requests
  • Line 7: creates an instance containing the url link
  • Line 8: creates a response object getting the url link
  • Line 9: prints the response data decoded in json format

Output:

This image has an empty alt attribute; its file name is Capture-29.png

Making header requests

From the above code:

  • Line 3: imports the urllib3 library
  • Line 5: create a PoolManager instance to send requests
  • Line 7: creates an instance containing the url link specifying data contained in the HEAD
  • Line 8: prints the server information
  • Line 9: prints the date
  • Line 9: prints the content type

Output:

This image has an empty alt attribute; its file name is Capture-30.png

Certificate verification

From the above code:

  • Line 3: imports the urllib3 library
  • Line 4: imports the certifi library
  • Line 6: creates an instance containing the url link
  • Line 7: create a PoolManager instance to send requests
  • Line 8: creates a response object getting the url link
  • Line 9: prints the response status

Output:

This image has an empty alt attribute; its file name is Capture-28.png

Query Parameters

From the above code:

  • Line 3: imports the urllib3 library
  • Line 4: imports the certifi library
  • Line 6: create a PoolManager instance to send requests
  • Line 7: creates an instance containing a key, value dictionary
  • Line 8: creates an object containing the url link
  • Line 9: creates a response object getting the url and the payload information
  • Line 10: prints decoded response data

Output:

This image has an empty alt attribute; its file name is Capture-33-1.png

Making POST requests

From the above code:

  • Line 3: imports the urllib3 library
  • Line 4: imports the certifi library
  • Line 6: create a PoolManager instance to send requests
  • Line 7: creates an object containing the url link
  • Line 8: creates a response object posting the url and the fields information
  • Line 9: prints decoded response data

Output:

This image has an empty alt attribute; its file name is Capture-32.png

Handling JSON content

From the above code:

  • Line 3: imports the urllib3 library
  • Line 4: imports the certifi library
  • Line 5: imports the json library
  • Line 7: create a PoolManager instance to send requests
  • Line 9: create an object containing a key, value pair dictionary
  • Line 10: creates a json encoded object with the key, value information
  • Line 11: creates an object containing the url link
  • Line 13: encodes the response object into binary format
  • Line 15: create an object with decoded response data
  • Line 16: prints information in encoded_data

Output:

This image has an empty alt attribute; its file name is Capture-31.png

Handling Binary Data

From the above code:

  • Line 3: imports the urllib3 library
  • Line 5: create a PoolManager instance to send requests
  • Line 6: creates an object containing the url link
  • Line 7: create a response object with url link
  • Line 8: writes the image_object
  • Line 9: downloads the favicon.ico image

Output:

This image has an empty alt attribute; its file name is Capture-27.png

Streaming Data

From the above code:

  • Line 3: imports the urllib3 library
  • Line 4: imports the certifi library
  • Line 7: creates an object containing the url link
  • Line 10: Obtain the video name from the last string in the url link
  • Line 12: create a PoolManager instance to send requests
  • Line 14: create a response object containing the url and enabling downloading in chunks
  • Line 16: open downloaded_filename as a f (variable)
  • Line 17: Loops through the file and downloads in chunks
  • Line 19: Downloads the file
  • Line 23: Ends the connections

Output:

This image has an empty alt attribute; its file name is Capture-36.png

Handling redirects

From the above code:

  • Line 3: imports the urllib3 library
  • Line 4: imports the certifi library
  • Line 6: create a PoolManager instance to send requests
  • Line 8: create an object containing the url
  • Line 9: create a response object getting the url link
  • Line 11: prints the response status
  • Line 12: prints the geturl
  • Line 13: prints response information

Output:

This image has an empty alt attribute; its file name is Capture-34.png

Conclusion

In this tutorial we have gone through installation of python on different operating systems (Windows, Linux and Mac OS). We have learnt how to use python requests library to make get requests to access HTML content, extract URLS and save as CSV files, download images, pdfs and videos from a website. We have also learned how to make POST request to post information to authorized websites. The tutorial content has also covered how to use urllib3 to make HTTP requests using URLS and strings (text), GET requests using URLS, make header requests, verify website certificates, query parameters, make POST requests, handle JSON and BINARY data, STREAM requests and make REDIRECTS. These use cases will be important in machine learning applications such as natural language text processing, data extraction and analysis.

0 Shares:
You May Also Like