Data Mining, Web Scraping and Predictive Analysis with R

An R analysis of YouTube Tech channels with analysis of viewers projections using machine learning.

Introduction

In this day and age of information and technology, content production has become rampant. Online channels are becoming more than ever an outlet for both learning and marketing. YouTube, as an online video website, has become an avenue of work, marketing, education, and entertainment, from part-time to full-time, for fun or purely business.

More web content is being watched than on televisions, thanks to the convenience of individuals with devices in their hands. As a result, if individuals and companies want to develop and expand their reach, they should include YouTube as a marketing tool in their plans. Every day, users view over a billion hours of YouTube videos, and hundreds of hours of video material are uploaded to YouTube servers every minute. These are the actual and potential clients to whom a person, a new company, and both small and large-scale corporations are attempting to sell ideas. With YouTube offering many ways to watch videos and content, including blogs, mobile apps, and the ability to embed them on other websites, it is now simple and friendly to reach out to these potential clients.

It is important to remember, however, that the majority of content is created by individuals, but media companies often publish videos. Corporates are now reaching out to such people, who then insert company advertisements of products and ideas in their posts. Potential clients would be reached based on the number of subscribers and viewers on the individual’s channel. Registered users will comment on videos, rate them, build playlists, and subscribe to other users in addition to viewing and uploading. Corporates can get input on their products with this versatility, allowing them a broader range of options for diversification. Corporates will learn how effective they have been in meeting the needs of their customers through feedback, giving them space for product development in order to reach out to all potential customers. It is also critical that individuals produce quality content and stay current on trends in order to gain more viewers. This also increases the likelihood of being approached by corporations to be a market for their products and ideas. Other advantages of marketing with YouTube are as follows:

  • Marketing on YouTube will assist in rising market penetration by allowing you to be found on Google
  • Using YouTube for business will help you re-purpose content you’ve already made without having to spend a lot of time or money
  • Your target audience will promote and buy from you
  • You will get laser-focused access to your audience with Google AdWords for Video through advertising on content that your audience is more likely to view and search for
  • Creating daily video content allows you to earn money directly from your videos through Google’s AdSense for Video service

What you will learn

In this article, we will examine three technological channels. The owners of these channels have done an excellent job of diversifying their content. As a result, a single post could garner over 30 million views. Who wouldn’t want these guys as their marketers with this kind of credibility? We will also use a R model to estimate the number of views (or the difference in views) of the videos.

Prerequisites

  1. Installed R and RStudio(Radian)
  2. Intermediate knowledge R functions
  3. Basic knowledge of the googlecloud and API authentication

Let’s get started.

Setting the Environment

To set up the environment one will first need to set up and activate the YouTube APIs in the google cloud platform that supports the OAuth 2.0 protocol for authorizing access to private user data. Thereafter obtain credentials also called access tokens i.e. a secret ID and a passcode that will enable one to access the data for YouTube channels, videos and playlists for statistical Analysis. To learn the process of the creation and obtaining the credentials click here.

Now that you have the credentials, install and load relevant R. Also save the OAuth 2.0 credentials into two objects (client_id and client_secret):

Now you can run tuber’s yt_oauth() function to authenticate your application. This should open your browser and ask you to sign into the Google account you set everything up with:

After signing in, you’ll be asked if the YouTube application you created can access your Google account. If you approve, click “Allow.” You will be sent an authentication code that you will enter to approve you are an added user sent via google login. The code should be added after the “:” in the Rstudio console in the xxxx position. The authentication process only happens once. Now we are ready to scrape and mine YouTube data that we need for analysis.

Data

As stated earlier we will extract data from the Marques Brownlee, Dave2D and the Oliur / UltraLinx Channels. The three are vloggers majoring in tech and design. For this article, we will go through the extraction of Marques Brownlee Channel data and the replication of relevant variables we will need for analysis. The same process shall be repeated for the other channels before merging the data.

The R tuber package is responsible for accessing YouTube from R. The package is loaded with the functions giving us the ability to access channels, playlists and single video data. While there are libraries such as the youtubeAnalyticsR, SocialMediaLab and purrr, I have found tuber to ease in functionality.

Extraction of Data

From the Marques Brownlee channel, copy the URL. The tuber functions require the YouTube channel ID to access the data. To get the channel ID of Marques Brownlee, paste the URL in the Comment Picker and this will return the Channel ID.

This returns the Channel ID:  UCBJycsmduvYEL83R_U4JriQ

Channel Statistics

The channel statistics can be found by the get_channel_stats() function. The channel id is pasted in the parameter channel_id. The statistics are then stored in a variable marquesbrownlee_channel_stats as shown below:

Output:

Marques Brownlee channel statistics

Stored in the marquesbrownlee_channel_stats variable is a list of lists. The four lists in the list are:

  • snippet
    Contains the title, description, customUrl, and publication date of the channel.
  • thumbnails
    Has the url and picture that is displayed when you search the channel in google.
  • localised
    Contains also the title, description, and the country of publication.
  • statistics
    The total number of views, subscribers, hidden subscribers and videos are stored in this list.
    The content of this list is what is displayed in the Marques Brownlee channel statistics image.

To call an element, for example the count of views , from the statistics list i.e. the first element in the fifth list:

Some data in other lists are what we will need in analysis. Later we will store this in a formatted table efficient for analysis.

Channel Videos

To get the video content from the channel:

The first function filters the videos published after 1st January 2016 and stores them in the videos variable. Since we are interested to analyse the whole channel’s content we shall use the tuber::get_all_channel_video_stats() function that searches all the channel videos and then stores them in the marquesbrownlee_video_stats variable.

As of today (02 May 2021) The resultant data frame has 1348 videos which have the following entries:

  • id
    This is the video id. This can be extracted from the video Url. It is always the code after the “=” in the Url. In R this is written as:
  • title
  • publication_date
  • description
  • channel_title
  • viewCount
  • likeCount
  • dislikeCount
  • favouriteCount
  • commentCount
  • url

To get the number of views, likes, dislikes, favorite and comments:

Other columns

For analysis we would also like the channel statics to be included in the data. These are:

  • subscriberCount
  • channelVideoCount
  • channelViewCount
  • channelcountry

We also extract the year when the video was published from the publication_date. Another date column we would like to have is the year when the channel was published. The two will help us to determine if time is a factor affecting us, as you will see later. All this is done by:

We also like to know the difference in likes, dislikes, comments, publication date and title length. For that, we will first have to order the data frame by publication_date. Thereafter create Prev variables, by the function lead, with their respective values being counts of videos previously published. In analysis, for example, the difference in likes will be likeCount - PrevlikeCount. This is done by:

With the above process, we have the compiled data of the Marques Brownlee channel. We would also need to gather the data of the Dave2D and the Oliur / UltraLinx channel in the same way, then merge the three in the variable tech_merged_channels to have a complete file for analysis.

Now let us explore our data.

Cleaning the data & Feature Engineering

Data Structure

Let us first check the general structure of the data. Using the str() function, the structure of the data is returned with the name of columns, the type of data contained in the columns and a sample of the data in the columns:

Output Snippet:

structure snippet

As viewed above, there are variables whose datatype is not correctly formatted for our analysis. For example, you will notice most of the numerical data are categorized as characters, “chr”. This will require transforming them appropriately into the numeric data type. For this, we will first call the relevant variables by their number position, store them in a list then call them from the data frame and transform them to numeric by the as.numeric() function. The PublishedDate should also be transformed into the date type data by the as.Date() function and the channel_title into a categorical/factor variable by the as.factor() function:

Missing Data

To find in summary whether any of the variables has a missing entry:

This will also return a count summary for character data types and brief descriptive statistics of all numerical data; measures of central tendency(mean, mode and median), measures of dispersion (max, min, and the interquartile range). While this gives us a comprehensive view, in a nutshell, our interest is the last output in each variable which is the count of missing values or entries as shown in the snippet output below:

output snippet

The gg_miss_var() function will help is representing the missing values count in each variable in a decreasing order as follows:

Missing values count

The variables comments, likes, dislikes, and comments, as well as their preview columns, have missing values. The high number of missing values in the Prev’s can only be attributed to missing values in the respective columns from which they were created. Let us examine the patterns in missing values:

Output:

missing values trends

The above is an upset plot from the UpSetR package can be used to visualize the patterns of missingness, or rather the combinations of missingness across cases. The story in the plot is such that:

  • PrevCommentCount, CommentCount, id, PrevLikeCount, PrevDislikeCount have the highest recorded missing values.
  • PrevCommentCount has the highest number of missing values.
  • There are seven instances in which the PrevDislikeCount, PrevLikeCount and the PrevCommentCount variables have missing values together.

Missing statistics

There are 9 columns and 77 rows in the above code, which reflect YouTube video stats entries with missing values. There are a total of 0.3 percent of the data missing. Grouping missing data by year and channels:

By Year:

By Channel:

You can now see a correlation between the size of the channel (as measured by the video count) and missing values; the Marques Channel has the most videos in the data, but it also has a lot of missing data. Dave2D has the fewest missing values. As of now, we can assume that Dave2D’s channel is the most recent of the three, having been released in 2015, seven years after Marques’ channel, and may have found YouTube with proven mechanisms for collecting data with little error. This will be shown later.

Having established that only 0.3% of data is missing, deleting them causes very little effect considering most are from one channel. To have a complete data set call the complete rows in the data by:

Your heatmap should now look like this:

No missing values heatmap

Visualization

As we work to create a model to predict data views, it is important to understand how the variables interact with one another. That is, how does one aspect affect another. As previously demonstrated by Dave2D’s channel, there is very little missing data, which could be attributed to the channel’s recency. This is a relationship revealed by the results, and there are many more to be discovered. It is simple to find the story that the data is telling using visualizations, from which an accurate model can be drawn.

Our primary aim is to create a model that forecasts the number of views. Let’s check out the distribution of the views:

Output:

We can see that this is highly biased, which is understandable given that most famous YouTubers are unlikely to have that many views. The distribution also means that the data has a large variance, which may be attributed to outliers. In our case, outliers are video views that are significantly higher than the average. Outliers in the training data have a large effect on machine learning models such as linear and logistic regression. This can be a problem if the outlier is a mistake of some kind, or if we want our model to generalize well while disregarding extreme values. According to YouTube data, these outlier views are caused by natural variations in the population views of a video. To check for outliers in the views we will use the boxplots:

Output:

There are a lot of outlier views in the results. The black line represents the average of all observations. The Marques channel has the greatest number of values that deviate from the mean. To detect outliers, the box plot employs an inter-quartile set. First, we evaluate the quartiles Q1 and Q3.

Interquartile range is given by, IQR = Q3 — Q1

Upper limit = Q3+1.5*IQR

Lower limit = Q1–1.5*IQR

Anything below the lower limit and above the upper limit is considered an outlier. To check all outliers and determine if they are significant:

There are 93 observations in the above, with the view values being outliers stored in the Outliers. However, only 26 of the findings are major outliers. Let’s view at the data with and without the outliers.

Code:

Output:

Dropping data is often a drastic step and should be taken only in extreme cases when it is clear that the outlier is a measurement error, which we seldom know. We lose information in terms of data variability when we remove data. However, if we have a large number of observations but few outliers, we should consider dropping these observations. The slope of the regression line changes significantly when the extreme values are present, as seen in the following example. As a consequence, it is reasonable to drop them in favor of a better match and more general solution.

Correlation

The correlation metric calculates the relationship between two variables, or how they are related to one another. It expresses the strength of a relationship between two variables. Correlation is measured using a coefficient that ranges from -1 to +1. Correlation coefficients that are negative mean that as one variable increases, the other variable decreases. Positive correlation values mean that when one variable increases, so does the other.

We will use the Pearson correlation to find the linear relationship between the quantitative continuous variables and the views. To add to the quantitative variables are:

  • The number of days between two posted videos: publication_datePrevPublishedAt
  • Hours between two posted videos
  • The month of publication of videos
  • Age of channel in years as at the date of videos being published.

After adding the four, now let’s create the correlation heatmap of the factors using a plot with the geom_tile() function from the ggplot2 library:

Output:

To view the significance of the effects of one variable to another, use the chart.Correlation() function from the PerformanceAnalytics library.

The chart above shows that, with the exception of the number of days, all of the independent variables have a positive impact on the dependent factor view count. This means that the longer it takes for a video to be posted, the less views it will get. As a result, it is important for a vlogger to upload videos as often as possible. Overtime in the COVID-19 pandemic many consistent vloggers have had large view counts of their content.

With a coefficient of 0.68, the channel age has the greatest influence on the view count. A new vlogger might not be able to break into the market. However, with consistency in content creation, one can eventually find and meet a target audience, which is likely to grow with content diversification. Below is a chart of the yearly trends of views.

code:

General Daily Trend
General Daily Trend
Daily Trend By Channel
Daily Trends by Channel

It’s also useful to know when videos get the most views. The ease with which a vlogger positions his or her target audience to consume their content is critical to keeping the audience and potentially attracting more. As a result, it is more important for the vlogger to put himself in the shoes of the viewer rather than the other way around. As a result, we will examine view count performance in relation to the days of the week, as well as view count performance in relation to the times of day: morning, midday, evening, and night.

From the date variable, published_date, we will extract the Time , Time of Day(TOD) and the Day of Week(DOW):

To visualize weekly trends of views by channel and the time of day:

Code:

View:

It has been observed that most people watch videos at night. This can be due to busy times of day and night used to catch up with new material if it is posted. People watch videos the most on Tuesdays, Wednesdays, Thursdays, and Fridays. Dave2D and Marques’ content is mainly watched on the same days, while UltraLinx’s audiences are mostly on Fridays and Tuesdays.

The charts above aptly illustrate how opinions change as a result of various factors. We will now use additional variables in the creation of a model that will aid in accurately predicting view count trends.

Model Training

We can use the variables from the correlation map to build a model that predicts YouTube view counts. As a result, view counts become the dependent variable and the other variables become independent.

Random Forest Regression

We will use the Random Forest Regression to model the YouTube Views of the videos. The question we seek to first answer is why the Random Forest Regression?

What is Random Forest?

Random forest is a Supervised Learning algorithm that performs classification and regression using the ensemble learning process. An Ensemble method is a technique that combines predictions from multiple machine learning algorithms to produce more accurate predictions than any single model. An Ensemble model is one that is made up of many models.

The Random Forest approach, which demonstrates the power of integrating multiple decision trees into one model, has the advantage of being flexible with changes in training data and avoiding the high cost of computationally training data, which carries a high risk of overfitting and tendencies to find local optima, as opposed to decision trees.

To get started, Install and load the required packages for the creation of our predictive model:

Create the training and the testing data set. Let the training set have 70% and the testing set 30% of the YouTube data. We will use the functions from the rsample to do so:

According to the preceding, 70% of the random indexes in the youtube split vector are drawn from the YouTube data. The indexes are then used by the training and testing functions to generate the train and testing set, respectively. We then use set.seed to ensure that the results are reproducible. It is critical to have a broad train data set because variability in the data will aid in the creation of a model that takes into account the majority of factors affecting data prediction. Now that we have the train data, we can use the random Forest function to evaluate the model. Below we apply the default randomForest model using the formulaic specification:

The default random forest performs 500 trees and randomly selected predictor variables at each split. The output for the YouTube.rf1 is as below:

The model is a random forest regressor having performed 500 trees and 4 randomly selected predictor variables at each split. The variance explained by the model is 86.98% of all the data, which is more than the accepted standard. The plot(YouTube.rf1) plots the model. The model shows that our error rate stabilizes with around 100 trees but continues to decrease slowly until around 300 trees. The plotted error rate above is based on the OOB sample error and can be accessed directly at YouTube.rf1$mse. Thus, we can find which number of trees providing the lowest error rate, which is 475 trees providing a YouTube View error of 523149.6.

model plot

We will now use the validation set as allowed by random forest to calculate predictive accuracy without using the OOB samples as described above. Using the inital split, analysis, and assessment functions from the rsample package, we further split our training set to build a training and validation set. The validation set will contain 80% of the training data.

Output:

validation visualization

Without any tuning, we were able to achieve an RMSE of less than 523000 views during validation. This demonstrates the Random Forest’s efficiency as one of the best “out-of-the-box” machine learning algorithms. They usually perform admirably with little to no tuning needed. However, we can increase the random forest model’s accuracy by tuning it.

Tuning

Tuning is the method of improving a model’s efficiency without overfitting or increasing variance. This is achieved in machine learning by choosing suitable “hyperparameters.” . On a simple level, the number of candidate variables to choose from at each split can be tuned. The following are the main parameters to become acquainted with during the tuning process:

  • ntree : These are the number of trees. We will require just enough trees to stabilize the error.
  • mtry: the number of variables to randomly sample as candidates at each split. When mtry = p the model equates to bagging. When mtry = 1 the split variable is completely random
  • sampsize : the number of samples to train on The default value is 63.25% of the training set since this is the expected value of unique observations in the bootstrap sample. Lower sample sizes can reduce the training time but may introduce more bias than necessary. Increasing the sample size can increase performance but at the risk of overfitting because it introduces more variance. Typically, when tuning this parameter we stay near the 60-80% range.
  • nodesize: minimum number of samples within the terminal nodes. Controls the complexity of the trees. Smaller node size allows for deeper, more complex trees and smaller node results in shallower trees.
  • maxnodes: maximum number of terminal nodes. Another way to control the complexity of the trees. More nodes equates to deeper, more complex trees and less nodes result in shallower trees.
Initial tuning

We will use randomForest::tuneRF for a quick and easy tuning assessment. tuneRf will start at a value of mtryStart = 5 and increase by a stepFactor = 1.5 until the OOB error stops improving by 1%:

The preliminary tuning results, as shown above, point to a good model with seven variables to randomly sample as candidates at each split. This corresponds to an OOBError of 269383762763. This, however, is subject to a test case. If a better model exists, we are obligated to find it. In order to find it, we will run a large search grid.

Ranger Large Search Grid

We will use the ranger which is a C++ implementation of Brieman’s random forest algorithm and is over 6 times faster than randomForest. The limitation of the randomForest at this is its ineffiency of not scaling well. To perform the grid search we first construct a grid of the hyperparameters. We’re going to search across 96 different models with varying mtry, minimum node size, and sample size:

To get the top results of the search grid:

Output:

Our top 10 performing models all have RMSE values right around 518000 – 525000 and the results show that models with slightly larger sample sizes (70-80%) and deeper trees (3-5 observations in a terminal node) perform best. Currently, the best random forest regressor model retains columnar categorical variables and uses mtry = 8, terminal node size of 5 observations, and a sample size of 80%. Let’s repeat this model inputting the specific parameters to get a better expectation of our error rate:

To view the distribution of the OOB RMSE:

Output:

Our predicted error ranges from 516000 to 526000, with 519000 being the most likely. In addition, we have set importance = ‘impurity’ in the model, which helps us to evaluate variable importance. The decrease in MSE each time a variable is used as a node split in a tree is used to calculate variable importance. The residual error in predictive accuracy after a node split is referred to as node impurity, and a variable that decreases this impurity is considered more important than those that do not. As a result, we add up the reduction in MSE for each variable across all trees, and the variable with the greatest cumulative impact is deemed more important.

To view these variables in order of importance:

output:

The likecount, dislikeCount and commentCount are the three factors that have the highest impact on YouTube videos. According to the correlation figure, the correlation of the following to the YouTube views is 0.88, 0.66, and 0.33 respectively. The three bring us to a conclusion that the higher the correlation the higher the influence a variable has on another.

Predictions

Having established the hyper parameters that generate optimum model for predicting YouTube views, the suitable model to therefore predict YouTube views will be:

For any vlogger with data values of the independent factors in colnames(data[,-1]), one can predict the views of a video(s). For our youtube_test data predictions:

References

  1. Google authentication types for R
  2. Using tuber
  3. Random Forest
  4. Ranger
0 Shares:
You May Also Like