Twitter Sentiment Analysis, Visualization & Classification using R

Extract Twitter data from the Rest API and analyze using R creating a classification model on the same.

Introduction

Twitter has become one of many core social networking platforms where people have found with ease to exercise their freedom of speech to write and post about anything allowing interaction with information in updated trends and news in form of tweets. This can only be as a result of twitters diligence to their core purpose; providing a free and safe space for people to talk and share. Through working to support organizations that tackle issues like bullying, abuse, and hate speech and initiatives that defend and respect all voices by promoting free expression and defending civil liberties while helping people understand healthy digital habits and online safety twitter has achieved a home for expression hence the distribution of diverse information.

The ease of information spread through this platform is also a result of ease of access to the Twitter website and a portable twitter app software across devices from desktops to phone applications. With a “Simple is good, but straightforward is better” principle information spread by users has been subject-oriented and less of propaganda that makes info rather diluted.

Twitter is working with organizations promoting equal opportunity, particularly in science, technology, engineering, arts, and mathematics. Over time information spread on Twitter has not been limited to the mentioned areas but has spread into finance and business, politics and government, lifestyle and health, programming, and other fields you can think about. Twitter has also grown to be emergencies and natural disasters strike reporting base offering tools and programs that help people around the world with communication and humanitarian response.

Think of all these fields that one would desire to find about and explore, all this at the tip of a search hash-tag (#) and click button. More than accessing this information one can now have a wider approach from the opinions of people, whether legislators and/or recipients. As twitter has it, “Using Twitter is like dipping your toes into an ocean of human beings”. Being an open service that’s home to a world of diverse people, perspectives, ideas, and information, access to this info has been made easy. Emphasis of a subject can now be monitored through re-tweets. From the same platform opinions can be countered. Through conversations as responses to a tweet posted, conclusive decisions can be drawn and made from them for a research being done.

Centralization of information is key in this age. and like most organization twitter is not an exception. The market has become more easy to crack just by knowing what the people want and need. Start-ups and continuing businesses, NGOs and government now more than ever have a wider base knowledge of where and when to start operation with a major on their mind giving them a less significance chance of failing in an operation. With centralization of Twitter information (tweets and conversations) from users all over the world, analysis for such and extraction of useful info from whence decisive conclusions can be made has also become less of a hustle.

While access to people’s minds is beneficial in making decisions to improve their state of living by improving services to meet the needs of residents in a place it is however imperative to prevent people with ill intent from accessing certain info to protect people from being taken advantage of.

Prerequisites

Now that you are here, I will attempt to help you extract Twitter information and help you in analyzing the same through the R programming tool. A basic understanding of how to use R and RStudio as a statistical tool will be required. I understand that some of you may prefer Radian as an alternative, that still works. I will be using the R version 4.1.0. We will learn how to acquire a Twitter API and how to access data from the Twitter API using R. In an attempt to make our Twitter interaction more interesting, through machine learning in R will build a classification model for the tweets. Do not worry if you are new to any of the above, there is no better way to learn but by a step at a time. Let’s take our first step.

Collection of Data

As our focus in the article, we would have ourselves analyse tweets posted by people concerning a particular subject. It is important to have a subject in mind so us to determine the focus of your analysis. For example , if we could chose “health” as a subject, considering the COVID-19 pandemic then pick a country of location, we would have our focus on analyzing on how the virus has impacted different parts of the country. In the example we see that the selection of your data is the basis of your progression in analysis.

In the collection of these data, accuracy, time, cost and utility are key factors to consider when selecting a technique to collect accurate and fine data. Social media research has for some time been more convenient for social media companies, of which Twitter is one, by making data available to researchers like you and me through their application programming interfaces (APIs). According to the API technical and data standards, we can now download bulk data at the same time for empirical analysis saving us the time and the cost incurred for ground collection. The validity and reliability of the data are determined when we find the data is suitable, even in diversity, in measuring what we intend to measure.

According to Twitter’s 2020 report, the platform has 187 million daily users while 330 million people use it at least once a month. Such a large population equates to the generation of huge data for quantitative and qualitative analysis and that’s why we are here. We can extract this data through the Twitter API.

The Twitter API, like most APIs can used programmatically to retrieve and analyze data. The API provides access to a variety of resources including tweets, users, direct messages, lists, trends, media and places. The flexibility and diverse content of API allows scalability in our analysis including geospatial analysis which you will grow to love.

Twitter API Access

a. For a beginner, you will first need to create a twitter account if you do not have one already. You will then proceed to the Developer Twitter page where you will use your email/username and password to login. This will bring you to a page like the one below:

b. We will now go ahead to Create An App clicking the button to your right. For a new user like me, A dialogue box will pop up prompting the creation of a developer account. Click Apply to continue.

c. On the Developer Portal page that you will be directed to fill your credentials after clicking the Get Started button. You will then answera  few questions of why you intend to acquire the API for. This is for approval purposes and to help the Twitter team also improve on their product.

Once you have signed the Twitter API developer agreement, a verification email will be sent to your email, one you registered the twitter account with. On clicking the activate account link, you should be directed to the page below:

d. After verification, the review of the application will begin. An email will be sent for verification on how you will be going to use this data i.e before the Twitter team could finish the review of the developer account application, they will need more details about the use case. This are the steps Twitter has taken in ensuring data integrity and protection of user data which is recommendable. In the email that I was sent the team inquired:

  • The core use case, intent, or business purpose for my use of the Twitter APIs.
  • Details about the analyses I planned to conduct, and the methods or techniques to analyze Tweets, Twitter users, or their content,
  • How I’ll interact with Twitter accounts, or their content
  • How, and where, Tweets and Twitter content will be displayed with your product or service, including whether Tweets and Twitter content will be displayed at row level, or aggregated.

After the reviews, if approved, you will receive an email of approval from whence you can proceed create your application.

e. Click to sign into the developer account from the approval mail. This will direct you to the page where you will name your app(I have named mine Classification):

Once you click the Get Keys , a page will load with the API Key, API Secret Key & Bearer Token. Now that we have our credentials we can now get to setting our R environment and retrieve relevant data for analysis.

Setting the Environment

Before importing the data, we will first need to set up the environment by first installing and loading the relevant libraries for analysis.The rtweet is the main package that we will use in the extraction and the analysis of twitter data. rtweet has made the interaction with Twitter’s APIs more approachable and has made Twitter data more accessible to reseachers using R for analysis. To install the packages:

The tidyverse and the tidytext are two among the most important libraries that we will use in the manipulation of data and tokenization of tweet texts into words respectively.

To now link Twitter Application to R, we will require to store the Keys and tokens acquired while setting our Twitter application into individual variables. There are the API Key, API Secret Key, Bearer Token, Access Token & the Access Token Secret:

Since we will be looking to find and geocode the location of the tweets we will also need the google maps API registered in our R application via the RgoogleMaps library using the register_google(key = “XXX”) function where “XXX” is our acquired API_Key from the google cloud. Read more about the Google Map API Setup & Geo-Coding with R:

In setting the authentication to link the R application and the Twitter application, run:

R will prompt an output as below:

The above authentication function is from the twitteR library. In the case R prompts a question in your console asking: Use a local file ('.httr-oauth'), to cache OAuth access credentials between R sessions? it is recommended that you answer Yes tp avoid entering your Twitter credentials every time you start a new session.

Extraction and Importation of Data in R

In this study, I was interested in the distribution of crimes in the USA over time. Fortunately, because tweets are produced by different individuals at different times and in different locations, the API captures all of this information. We can extract data relating to such situations for analysis using # hashtags that people utilize.

I’m extracting this information utilizing a free tier of Twitter services. This imposes limitations on the kind of data I can access. That is, I can only access tweets from the previous week, and I can only retrieve 18000 tweets every 15 minutes. To extract this data:

Security tweets posted on twitter will be searched by the hashtags stored in the hashtag variable. In the search_tweets() function from the  rtweet libraries the following arguments have been used:

  1. needle: which allows the function to search tweets with respective hashtags one at a time. In the function, this is initialized as q signifying the query word that you want to look for.
  2. n: the number of tweets that you want to be returned. As stated earlier you can only request up to a maximum of 18,000 tweets for a standard tier.
  3. include_rts: having equated it to FALSE, it does not return retweets as part of the data
  4. geocode = lookup_coords("usa"): Specifies that we only desire security incidents from the United States.
  5. retryonratelimit: In order to extract 50 000 tweets as specified in my code, I had to add the argument to inform rtweet to retry getting more tweets after 15 minutes until I had around 50 000 tweets. This is important and gives us an advantage in having more data to work with, due to the limitation of extracting 18000 tweets every 15 minutes.

On running the above code R will extract the data. At the mark of 100%, R would have returned and stored the data in the security_info data frame:

47915 tweets from 21684 unique users were returned as of the day I was extracting this data. This is the data we will use in our analysis.

Visuals of Tweets

Hashtag Frequencies

From the crime data extracted by hashtags of Twitter, we would love to know what hashtag(s) representing a particular crime, that is reported most in the USA. Hashtags are used by a number of users to categorize their tweets and indicate related topics that the tweets address. This is important for a target market. A word cloud is therefore useful to visualize the most frequently used hashtags in reporting crime incidents on Twitter to reach a target market:

From the hashtags in the Twitter extracted data the table function returns the sum of unique hashtags used in posts. However not all the unique hashtags are relevant with respect to crime. We first therefore filter out the hashtags that appeared less than four times. Thereafter, storing hashtags related to crime and security sieved out a table and made a wordcloud of the same marching their frequency size.

Incidents Wordcloud

As noted in the wordcloud, crime incidents were highest in America. Justice is more likely to be in demand of the crimes committed. In a correlation test, the two would definitely record a strong positive correlation. Murder, protests, deaths, criminal activity , fire and rape follow close as major incidents in the country.

Geographical Analysis of Tweets

In an analysts quest it is imperative to identify from where the tweets came from. From the data, there are four sources of geographic information. First, you have geographic information embedded within the tweet itself. For our case in the search_tweets() we specified the geolocation to be the USA. Secondly, a user while posting a tweet may specify the location of an incident. Often this may also be set as a hashtag. The variable containing the tweet location in our data is the is place_full_name.

To create a table of the 1 security red zone areas from where the tweets came from we use the code:

The code is.na(place_full_name) == FALSE & place_full_name != "" within filter() returns tweets that have filled place names; removes entries with missing place names.

Output:

Extracting Tweet Geographic Coordinates

Another geographic information source is the geotagged precise location point coordinates of where the tweet was tweeted. To find the longitudes and latitudes of tweet locations use:

The function above generates the lat and lng columns which represent the latitude and longitude coordinates, respectively. It is important to note that not all tweets are geotagged. Filter the tweets with the lat/long info using the filter():

We have archived only 914 data allocations with lat and lng geocodes. This is due to the freedom Twitter has given its users with the option of whether to display the location or not. As it stands the data tells you the preference of most users.

Mapping Tweets

To map the longitudes and the latitudes of the tweets we will:

  1. Convert the non-spatial data frames security_info into sf objects using the st_as_sf() command
  2. Use leaflet() to map the points. Alternatiavely one can also use tm_shape() from the tmap package

Output:

Mapping Twitter User Locations

Occasionally, the initial reporters of tweets are present in the real locations where the incidents occur. This isn’t always the case, though. If the latter is accurate, we’d like to know where these reporters are stationed. It is critical because corporations will need stringers from all over the country to provide coverage in places where news and media firms are unable to reach. Therefore in the USA, to check where in the country are most Twitter users located:

Sentimental Analysis

When looking at the overall visualization of security tweets, we can only get a gist of the incidents that occurred. This, however, does not demonstrate what people commonly say about crimes. While it is likely that people dislike and condemn crime-related situations, it is useful to determine to what extent this is the case.

Sentimental Analysis, often known as opinion mining, is a natural language processing methodology for determining the positivity, negativity, or neutrality of information. This type of research finds and extracts subjective information from the source material, allowing organizations to better grasp the social sentiment surrounding their brand, product, or service while monitoring online conversations. For our case strive to comprehend what the society has to say about these crime situations through tweet conversations.

The Sentimentr package is one of the most commonly used libraries in sentimental study, and we will utilize it as well. There are a variety of visualizations available on the internet that clearly depict sentiments. With my knowledge, what I will depict is not necessarily all that can be taken from this data, but it will be sufficient and authentically so to draw a valid judgment on security incidents. Visualizations help simplify the data and put a researcher in a position to understand and explain the data trends.

According to the documentation for the Sentimentr package, it employs the bag of words methodology, which involves breaking down a sentence into its constituent words (also known as text analysis), with each word being referred to as a gram. All punctuation is deleted with this methodology, with the exception of pause punctuations, and the algorithm then looks up each word in a dictionary to determine if it is a positive or negative word. Intensifiers and de-amplifiers are also taken into account and weighted. After taking all of this into account, the algorithm calculates a score to each sentence using a sentiment scoring equation, with a positive score indicating positive sentiment and a negative score indicating negative sentiment. The greater the intensity of the reaction, the higher the score.

Here are the steps to sentimental analysis:

  1. Extract the tweets from our security_info data. These are stored in the text variable:
  2. Convert the security_tweets to a Simple Corpus which will be stored in memory and then generally used for preprocessing and transforming texts. The tm package, using the corpus as its main data structure, will be used in text mining, helping remove unwanted terms, numbers, stopwords (extremely common words) punctuation, and other irrelevant elements:
  3. Define custom functions to handle cleaning up Twitter handles, hashtags, emojis and using regular expressions to clean up URLs- both http and https URLs:
  4. Create a function that runs the corp through the custom and pre-defined cleanup processes and returns the data to a data frame.

With no missing values in the data confirmed by the code sum(!complete.cases(new_tweetsdf)), we can proceed and run the sentiment analysis algorithm:

Visualize sentiments

Now that we have our sentiments analysed we can go ahead and print the results using the  plot_ly library:

Output:

From the plot there is a even balance between the negative and positive sentiments. It is also noted that the longer the word count the more likely the tweet will have a negative sentiment. However, let us prove this:

Output:

When we plot the sentiment score against the number of words, we observe that there is no clear pattern in both the negative and positives, despite the fact that there may be a relationship between multiple words and sentiment score. When compared to the preceding graph, it indicates that a tweet with a long word count has a larger chance of being negative. To test this hypothesis, one can use the Bayes theorem to determine the conditional probability of a tweet with either a negative or positive emotion, given the tweet’s length:

According to the calculations

  1. 1. 0.7832557 of tweets have a negative sentiment and 0.1230379 are positive.
  2. Only 3558 negative tweets have long words, constituting only 0.09754998 of all the negative tweets.
  3. 0.05267226 of tweets characterize long positive tweets; 301 in number.

When we use Bayes Theorem to these figures, we find that if we choose a long tweet, there is a 60% chance that it will have a negative sentiment, a 33% probability that it will be positive, and a 7% chance that it would be neutral. This appears to back up my initial hypothesis that a tweet with more than 35 words has a higher chance of being negative..

Classification

Now that we have the average sentimental scores. We would also like to define a model that will be able to detect a tweet with negative and postive sentiments. The first thing we will have ourselves do is add the sentimental scores to the security_info data:

Next define a “Positive” and a “Negative” variable adding them to the security_info data:

The two new columns are factors variable of boolean TRUE and FALSE values.

Pre-processing

First convert the tweets in the security_info data into a corpus for pre-processing:

Use the pre-defined functions used to clean the corpus while performing the sentimental analysis to clean the corpus. Finally, stem the corpus document with the stemDocument argument:

From the stem document extract the word frequenciesto be used in our prediction problem. We will do so with the DocumentTermMatrix() from the tm package that generates a matrix where the rows correspond to documents, in our case, tweets columns correspond to words in those tweets.

The values in the matrix are the number of times that a word would appear in each document:

Output:

From the document matrix, we see that in the corpus there are 40266 unique words. Slice the matrix using the inspect() function and view them:

Output:

This data is what we call sparse having many zeros in our matrix. To remove the inconsequential zero values:

The second parameter, sparsity threshold, signifies that terms that appear in 0.5% or more of the tweets are to be kept.

Only 420 unique terms were reatined, which is only 1% of the full set. Now convert the sparse_dtm to a data frame that we will use in our predictive classification model:

Add the dependent variable i.e. the Negative and the Positive columns:

Building Machine Learning Model

Split data in train & test sets

Before Building the machine learning model, we need to split our data into a train and test dataset. We split the data using the Positive and Negative variable creating and storing a test and train set for each that will help us build their respective predictive classification models.

Predict Sentiments

Using the rpart library function rpart(), with that Negative and the Positive variables as dependent variables in their independent models, other variables as independent, with the method = "class" ,we will now train a classification model.

With the models trained, we will now use the predict() function to predict values in the positive_test and negative_test data and prove the accuracy of our trained model.

The confusion matrix for this will be:

Output:

To determine the accuracy of the model:

This returns a model accuracy of 0.9996076 .

Doing the same with the positive predicting model:

The positive predictive model returned an accuracy of 1 suggesting that it has a 100% prediction accuracy of classifying tweets with positive sentiments. This is always rare however the results could have been due to few data. With large amounts of data having variability different results could have been achieved.

The models trained having a 0.9996076 and a 1 accuracy fit to predict the classification of tweets. In further articles, we will deploy such models for ease and accessibility. The code for the above is found in github.

Refferences

  1. Intro to rtweet: Collecting Twitter Data
  2. Tweets Classification and Sentiment Analysis for Personalized Tweets Recommendation
  3. Dependent Sentiment Classification on Twitter
0 Shares:
You May Also Like
Read More

Data Science Project- Part 2.

Data Pre-processing II. Data Preprocessing using Pandas: From part 1 of this series, https://developers.decoded.africa/data-science-project-part-1/, we focused on web…