Recommender systems are designed to recommend things to the user based on many different factors. These systems predict the most likely product to which the users are most likely to purchase and are of interest. Companies like Netflix, Amazon, etc., use recommender systems to help their users identify the correct product or movies.
The recommender system deals with a large volume of information present by filtering the most important information based on the data provided by a user and other factors that take care of the user’s preference and interest. It finds the match between user and item and imputes the similarities between users and items for recommendation.
Both the users and the services provided have benefited from these kinds of systems. The quality and decision-making process has also improved through these kinds of systems.
In this tutorial, we will look at collaborative filtering using K-nearest neighbours.
What is Collaborative filtering?
Collaborative filtering filters information by using the interactions and data collected by the system from other users. It’s based on the idea that people who agreed to evaluate certain items are likely to settle again in the future.
The concept is simple: when we want to find a new movie to watch, we’ll often ask our friends for recommendations. Naturally, we have greater trust in the advice from friends who share tastes similar to our own.
Most collaborative filtering systems apply the so-called similarity index-based technique. In the neighbourhood-based approach, several users are selected based on their similarity to the active user. Inference for the active user is made by calculating a weighted average of the ratings of the selected users.
Collaborative-filtering systems focus on the relationship between users and items. The similarity of items is determined by the similarity of the ratings of those items by the users who have rated both items.
This is a step by step tutorial to help with the development of a recommendation system.
Step 1: Import the necessary libraries.
- Line 1: Imports pandas library. Pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with “relational” or “labelled” data easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.
- Line 2: Imports numpy library. NumPy is the fundamental package for scientific computing in Python. NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences.
- Line 3: Import pyplot sub-library from matplotlib. Matplotlib. pyplot is a collection of functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc. In matplotlib. Therefore it is used for visualization.
Step 2: Read the data
For this tutorial, there are three datasets that we need. The books data set can be downloaded from: http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip
Once downloaded, extract the data into your project file (the file containing your python script). This is to ensure easy access to the data and most importantly, it keeps all necessary files under one folder.
- Line 1: Imports ‘BX-Books.csv’ and stores it in a variable books
- Line 2: Imports ‘BX-Users.csv” and stores it in a variable users
- Line 3: Imports ‘BX-Book-Ratings.csv’ and stores it in a variable ratings
Step 3: Exploratory Data Analysis.
We can use the code below to look at the dimensions and the column names of ratings.
- Line 1: Shows ratings has 1,149,780 rows and 3 columns
- Line 2: Prints out the column name of the ratings dataframe
(1149780, 3) ['UserID', 'ISBN', 'bookRating']
The code below shows the count of ratings given. We can see that most movies have not been rated. However, since we have 1,149,780 rows, our data is still relevant.
- Line 1 and 2: Plot ratings with their frequency
- Line 3 – 5: Names the plot’s title, x-axis and y-axis.
- Line 6: Saves the figure in the same directory.
- Line 7: Displays the plot.
The code below prints the dimensions and column names of the books and users data frame.
- Line 1: Shows books has 271,360 rows and eight columns
- Line 2: Prints out the column name of the book data frame
- Line 3: Shows users has 278,858 rows and three columns
- Line 4: Prints out the column name of the usersdataframe
(271360, 8) ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL'] (278858, 3) ['UserID', 'Location', 'Age']
The code below shows the age distribution of our data. These are the people who have rated the books. We can see that most people lie between 20 to 40 years old.
- Line 1: creates a histogram from the age column in the user’s data frame with 10.
- Line 2: Titles the plot.
- Line 3: Names the X-axis
- Line 4: Names the Y-axis
- Line 5: Saves the image as a png in the same directory as the python file
- Line 6: Produces the plot.
Step 4: Statistical Analysis.
To ensure statistical significance, users with less than 200 ratings and books with less than 100 ratings are excluded. The code below shows how to do that.
- Line 1: Takes the frequency of each user who has rated the books.
- Line 2: Takes the users with ratings greater than 200 and uses the ‘user id’ to filter the relevant entries from the rating data frame.
- Line 3: Takes the frequency of each book from the rating data frame.
- Line 4 takes the books with ratings greater than 100 and uses the ‘book Rating’ to filter the relevant entries from the rating data frame.
Collaborative filtering using K-Nearest Neighbour Method.
Before we continue further, it is important to first start by understanding what the K-nearest neighbour algorithm is and how it works.
KNN is a machine-learning algorithm to find clusters of similar users based on common book ratings and make predictions using the average rating of top-k nearest neighbours. For example, we first present ratings in a matrix, with the matrix having one row for each item (book) and one column for each user.
K-Nearest Neighbors analogy.
To best understand K-Means, we can use a laundry analogy. You just moved out of your parent’s house into a hostel. The weekend quickly clocks in, and you notice you have a bunch of clothes you need to wash. Since you were used to having this done for you, you call your mother for advice on how to do so.
Your mother shares tips she uses to simplify her work. She does so by splitting her clothes into three groups: tops, trousers and socks. By doing this, she has helped you cluster your clothes.
From the above analogy, we can see how K-means clustering Algorithm works in the following steps:
- Cluster the data (clothes) into k groups where the value of k is predefined. In the analogy, k is three (tops, trousers and socks.)
- At random, select total k points which may/may not be from the dataset which would be known as the cluster centres or centroids. In the analogy, the three k points are a top, trouser and sock.
- Assign the data points to their closest centroids based on any distance function. This step is called the Expectation step. According to the analogy, assign the rest of the clothes as close as possible to the three clusters.
- In a real dataset, once all points are clustered, each cluster’s mean is calculated and taken for the new centroid. Then all points are reassigned to the closest centroid. This is the Maximisation step.
- Repeat the Expectation-Maximisation steps until either no data points move from their previous clusters or the centroid does not change from the previous iteration.
Understanding Cosine Similarity and Cosine Distance.
It is important to the above topic because it is widely used in recommendation systems.
Assuming we have two points P1and P2, there is a decrease in similarity with an increase in distance. On the other hand, with a reduction in the distance, there is an increase in similarity. The above reasoning can be explained using the figures below.
The figure above shows two points P1 and P2 on a cartesian plane. The degree between the point on the image on the left is 45 degrees while on the right, it is 90 degrees.
The cosine in the image on the left interprets as 53% similarity while on the right there is a 0% similarity.
Calculating the distance using the formula: (COS-Distance) = 1 – (COS-SIM), we can see that the image on the left has a smaller distance than the image on the right.
The left plot in the figure above further shows that if the two points P1 and P2 are on the same plane, the degree between them is 0. This means the Cos of 0 is 1 (Implying 100% similarity). Furthermore, the distance between them is 0.
The plot on the right shows the cosine of the angles 0 to 360. The plot implies that the cosine similarity can only lie in the range [-1, 1]
The figure above shows the similarity between minions and avengers is 0%. This is given that the cos of the angle 90 degrees between them is 0. It, therefore, translates that the distance between minions and avengers is 1.
On the other hand, The similarity between iron man and avengers is 100%. This is given that the cos of the angle 0 degrees between them is 1. It, therefore, translates that the distance between iron man and avengers is 0.
Step 5: Collaborative Filtering Using k-Nearest Neighbors (kNN)
The code below merges the ratings and books data frames then drops the unnecessary columns. The only columns we need are UserID, ISBN, book rating and book title.
- Line 1: Merges the ratings and books data frames on ‘ISBN’ column and stores the data under the variable name ‘combine_book_rating’ .
- Line 2: Store the irrelevant columns in a ‘columns’ variable.
- Line 3: Drops the columns ‘yearOfPublication’, ‘publisher’, ‘bookAuthor’, ‘imageUrlS’, ‘imageUrlM’ and ‘imageUrlL’ from combine_book_rating data frame.
- Line 4: Displays the data frame combine_book_rating.
The code below, groups by book titles and create a new column for the total rating count.
- Line 1: Drop any missing values from the combine_book_rating dataframe.
- Line 3 -9 : Groups the data and count the number of times each book has been rated and stores it in the variable ‘book_ratingCount’.
- Line 10: Displays the first five values of book_ratingCount dataframe
Using the code below, we combine the rating data with the total rating count data, this gives us exactly what we need to find out which books are popular and filter out lesser-known books.
- Line 1: Merges combine_book_rating to book_ratingCount on ‘bookTitle’ column for both with book_ratingCount data frame on the left. The new data frame is the stored under the variable name ‘rating_with_totalRatingCount’
- Line 2: Displays the rating_with_totalRatingCount data frame.
The code below takes the books rated over 50 times.
- Line 1: Assigns a threshold of 50 to the variable popularity_threshold.
- Line 2: Filters movies rated over the threshhold number (50) and stores the new dataframe in the variable name ‘rating_popular_book’.
- Line 3: Displays the data frame rating_popular_book.
The code below finally shows the dimension of the rating_popular_book data frame using the in-built method shape.
From the above image, we can see that there are 62,149 entries, to create a model on these, is computationally expensive. Therefore, for the purpose of learning, we will subset by filtering the rows from the USA and Canada. In order to get the locations ie, Canada we need to merge rating_popular_book and users data frame.
- Line 1: merges rating_popular_book and users dataframe one UserID.
- Line 3: Filters rows on a query that checks for USA or Canada in the Location column.
- Line 4: Drops the ‘Age’ column which is unnecessay.
- Line 5: Diplays the us_canada_user_rating dataframe.
Step 6: Implementation of KNN using Cosine similarity.
We convert our table to a 2D matrix and fill the missing values with zeros (since we will calculate distances between rating vectors). We then transform the values(ratings) of the matrix data frame into a scipy sparse matrix for more efficient calculations.
Finding the Nearest Neighbors We use unsupervised algorithms with sklearn. neighbous. The algorithm we use to compute the nearest neighbors is “brute”, and we specify “metric=cosine” so that the algorithm will calculate the cosine similarity between rating vectors. Finally, we fit the model.
- lIne 1: Imports csr_matrix library used to covert data frame to matrix for more efficient calculations.
- Line 2: Drops duplicates from the us_canada_user_rating data frame.
- Line 3: Pivots the dataframe with the User Id columns, book title as indices and values being the book rating. The figure below shows how the pivoted dataframe looks like
- Line 4:Converts the pivot table to a matrix.
Lucky for us, having understood the whole logic on KNN, python implements it without us having to hard code the math. The code below shows the implementation of the nearest neighbours library from sklearn.neighbors.
- Line 1:Imports nearest neighbours library from sklearn.neighbors.
- Line 3: Creates a model on the cosine logic and stores it in the variable model_knn
- Line 4: Used the model to fits the “us_canada_user_rating_matrix” variable matrix
NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine', metric_params=None, n_jobs=None, n_neighbors=5, p=2, radius=1.0)
The code below simply pics any movie by index in our pivoted data frame.
The code below shows the movie title that has been randomly selected by the code above.Error when loading gists from https://gist.github.com/.
'Pop Goes the Weasel'
The code below gets the first 6 similar movies to our randomly selected movie.
Recommendations for Pop Goes the Weasel: 1: Roses Are Red (Alex Cross Novels), with distance of 0.7104006053761363: 2: 1st to Die: A Novel, with distance of 0.7266440211508494: 3: Vittorio the Vampire: New Tales of the Vampires, with distance of 0.7386222688936215: 4: The Woman Next Door, with distance of 0.7529436075774176: 5: Cat & Mouse (Alex Cross Novels), with distance of 0.7590802130238584: