How to use python and scikit-learn in text classification

Intended Audience; who the tutorial is for:

This tutorial’s targeted audience are python developers (beginners and intermediate levels) who would want to understand basic text classification using python methods and open source machine learning and data analysis libraries.  Machine learning applications include natural language text classification, image or visual recognition applications. This tutorial teaches the readers how to use the basics of python programming and libraries to handle basic text classification using scikit-learn. Machine learning is principally concerned with mining data from documents. Machine learning incorporates different investigative arenas that include statistical analysis, artificial intelligence, and computer science fields. Machine learning is also denoted as predictive analytics or statistical learning. Machine learning is used in space science to study stars formation or patterns, discover distant planets, distinguish novel particles, analyze DNA sequences, and provide personalized cancer treatments.

What the tutorial will cover:

In this tutorial the readers will learnt how to use python and scikit-learn to perform text classification using the IMDb movie reviews dataset. The readers will also learn how to use scikit-learn to represent text data for machine learning, perform feature extraction, represent text data as a bag of words, use Logistic Regression and Cross Validation to train text data and improve accuracy of training models.

Part 1: Introduction to python for machine learning?        

Why use python in machine learning?

Python is very useful because it pools the power of general-purpose programming languages and domain-specific scripting languages like MATLAB or R. through libraries that can be easily installed. The Python programming language is open source and it has numerous libraries or modules for loading datasets, visualization, statistical analysis, natural language processing, and image processing. Some examples are NumPy and SciPy scientific Python libraries, data visualization libraries such as Matplotlib, Pandas, and Seaborn.

These python libraries or modules provides data scientists and software engineers a big collection of general and special-purpose functionality libraries. One of the main benefits of using Python is the capability to interact directly with the code, via the python interpreter or other tools like the Jupyter Notebook (a browser-based interactive programming environment). Machine learning and data exploration are fundamentally repetitive processes, therefore the information provided defines the final analysis. As a general-purpose programming language, Python also permits the establishment of multifaceted graphical user interfaces (GUIs) and web services that can be integrated into present systems.

Important python libraries for data analysis and visualization

  • Scikit-learn (Sklearn) is a valuable and robust library for machine learning in Python with an assortment of resourceful tools for machine learning and statistical modelling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. Sklearn is mostly written in Python. NumPy, SciPy and Matplotlib are included in installed with sklearn included as core dependencies.
  • Jupyter Notebook is a browser based interactive environment that allows users to write and execute python code directly on the browser.
  • NumPy is one of the fundamental packages for scientific computing in Python. It contains functionality for multidimensional arrays, high-level mathematical functions such as linear algebra operations and the Fourier transform, and pseudorandom number generators.
  • SciPy is a collection of functions for scientific computing in Python. It provides, among other functionality, advanced linear algebra routines, mathematical function optimization, signal processing, special mathematical functions, and statistical distributions.
  • matplotlib is the primary scientific plotting library in Python. It provides functions for making publication-quality visualizations such as line charts, histograms, scatter plots.
  • pandas is a Python library for data wrangling and analysis.

Installing Jupyter Notebook and Scikit-learn

On the command prompt run the following commands to install Jupyter Notebook, scikit-learn

  • pip install jupyter
  • pip install scikit-learn

To open the Jupyter Notebook run the code below on the command prompt

  • jupyter notebook

The python interpreter will display a similar message to this one:

To access the notebook, open this file in a browser:


    Or copy and paste one of these URLs:



  • Copy and paste any one of the displayed url links to open Jupyter Notebook on your browser.
  • Click on New on the right side of the Jupyter Notebook and click Python (ipykernel) to create a new python 3 notebook.
  • Click File on the left side of the Jupyter Notebook and click save as text_classification. This is where we will be writing and executing the code.

Part 2: Representing text data for machine learning

Machine learning algorithms require datasets e.g. excel sheets (csv), images, or video. The rows in the dataset are referred to as the sample or data points), while the columns (the properties that describe the data points) are referred to as the dataset features. Feature extraction or feature engineering is the process of constructing a decent representation of the data, which will be used to train the algorithm. Features can be described as categorical features (items from a fixed list), continuous features (features that describe a quantity) or text data (strings, made up of characters).

Examples of continuous features: pixel brightness and size measurements of plant flowers. Examples of categorical features: product brand, product colour, department store labels (books, clothing, and electronics).

The four kinds of string data that you will encounter:

Categorical data (items from a fixed list) e.g. a fixed list of colour choices

Free strings (that can be semantically represented into categories) e.g. unlimited number of colour choices that can be categorized into primary, secondary and tertiary colours

Structured string data e.g. addresses, location names, people names, dates, telephone numbers and other personal identifiers.

Text data (consists of phrases or sentences) e.g.  Tweets, chat logs, reviews, collected works of Shakespeare, website content from sites such as Wikipedia, or the Project Gutenberg collection of 50,000 Ebooks.

Important python methods to prepare or analyse strings or text data for machine learning

  • list() : the list() method can be used to get all the individual characters from a string. This function returns all the characters and whitespaces as a list.
  • s.startswith(): used if we want to check if a particular string ‘s’ is present at the beginning of a larger text. You can use this method to check if a particular string‘s’ starts with a mentioned string.
  • s.endswith(): This method checks if a particular string is present at the end of another string ‘s’.
  • s.isupper(): this method checks if all the characters in a string ‘s’ are in upper case or not and it returns a True or False value.
  • s.islower(): this method checks if all the characters in a string are lower case or not. It returns a True or False value.
  • t in s: where ‘t’ is text and ‘s’ is the string,  the keyword “in” can be used to check if a particular substring is present in a larger string. This can be used to find some string in a larger text, or check if the word we need is present in a larger paragraph.
  • s.istitle(): This method checks if a particular text is in title format. For example, “United States”. Basically, all the first letters of all words must be capital for it to be in title format.
  • s.isalpha(): this method checks if the characters in a string are all alphabets.
  • s.isdigit(): this method checks if all the characters in a string are numbers.
  • s.isalnum(): this method checks if a string has either numeric characters or alphabets. If special characters are present, False will be returned.
  • s.lower(): This method converts all the characters of the string to lowercase. This function is used when we want uniformity in our data.
  • s.upper():this method converts all lowercase characters of a string to uppercase.
  • s.title(): this method converts all the 1st letters of words to uppercase.
  • s.split(): this method splits a larger text into smaller texts based on a character, the character selected will serve as the split point.
  • s.join(): This method joins two strings to form a single string.
  • s.strip(): this method removes whitespaces around a text.
  • s.rstrip(): This method removes whitespaces, but only from the end of the string.
  • s.find(): this method can be used to find a particular substring in a larger string. The function returns the location of the search query string.
  • s.splitlines(): this method can be used to split a large string into sentences. For example when are extracting huge amounts of text from a web source.

Common methods used in Scikit-learn

Machine learning algorithms produce models based on inputted information. A majority of machine learning algorithms applied in scikit-learn anticipate that the data is stored in a two-dimensional array or into a matrix. These arrays can be either numpy arrays, or in some circumstances, scipy.sparse matrices. The size of the array is anticipated to be [nsamples, nfeatures] with: nsamples is the number of samples: each sample is an item to process (e.g. classify). A sample can be a text, an image, a sound recording, a video, a planetary object, a row in database or CSV file, or a fixed set of quantitative traits.

nfeatures is the number of features or distinctive qualities that can be used to label each entry in a measureable way. Features are commonly made up of real values, but may be boolean or discrete-valued in some circumstances. The number of features must be fixed in advance. Nevertheless some features can run into millions with most of them being zeros for a given sample. In such a case, scipy sparse matrices can be beneficial because they are much more memory-efficient than numpy arrays.

Irrespective of the classifier used, scikit-learn recommends the following common methods to process data: fit training data. For supervised learning applications, it accepts two arguments: the data X and the labels y (e.g., y)). For unsupervised learning applications, it accepts only one single argument, the data X (e.g.

model.predict() given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model.predict(X_new)), and returns the learned label for each object in the array.

model.predict_proba() For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().

model.score() For classification or regression problems, most estimators implement a score method. Scores are between 0 and 1. A larger score indicating a better fit.

model.transform() Given an unsupervised model, transform new data into the new basis. It also accepts one argument X_new, and returns the new representation of the data based on the unsupervised model.

model.fit_transform() Some estimators implement this method, which performs more efficiently a fit and a transform on the same input data

Part 3: Sentiment Analysis of IMDb Movie Reviews

In this example, we will use a dataset of movie reviews from the IMDb (Internet Movie Database) website collected by Stanford researcher Andrew Maas. This dataset contains the movie reviews in text format. Each of the reviews has a label that indicates whether the review is “positive” or “negative.” The IMDb website ratings are from 1 to 10.

  • First download the IMDb dataset from the following URL IMDb dataset
  • Extract the contents of the zip folder aclImdb_v1.tar in a new folder to your desktop
  • Open the newly created aclImdb_v1.tar folder you will find a folder named aclImdb_v1 zip
  • Copy the aclImdb_v1 zip folder and paste it in the scikit-learn datasets folder located in the site packages folder where your python is installed

For Windows:


  • Extract the contents of the aclImdb_v1 zip folder
  • Open the aclImdb folder, you will see two folders labelled test and train

The dataset is a two-class classification dataset where reviews with a score of 6 or higher are labelled as positive, and those below 6 are labelled as negative. After extracting the data, the dataset is provided as text files in two separate folders, one for the training data and one for the test data. Each of these in turn has two sub‐folders, one called pos and one called neg. The pos folder contains all the positive reviews, each as a separate text file, and likewise the neg folder contains all the negative reviews.

Scikit-learn has a helper function to load files stored in a hierarchical folder structure, where each subfolder corresponds to a label, called load_files.

To load the IMDb dataset apply the load_files function first to the training data:

First open the command prompt and run the code below in the folder where your python is installed

jupyter notebook

Copy the URL link displayed on the command prompt window on your browser (Mozilla or Chrome) to open Jupyter Notebook

Copy paste the code below and click the ‘Run’ button on the Jupyter Notebook. It might take a while for it to compile the output based on your computer. Wait till you see the output.



type of text_train: <class ‘list’>

length of text_train: 75000


b”Amount of disappointment I am getting these days seeing movies like Partner, Jhoom Barabar and now, Heyy Babyy is gonna end my habit of seeing first day shows.<br /><br />The movie is an utter disappointment because it had the potential to become a laugh riot only if the d\xc3\xa9butant director, Sajid Khan hadn’t tried too many things. Only saving grace in the movie were the last thirty minutes, which were seriously funny elsewhere the movie fails miserably. First half was desperately been tried to look funny but wasn’t. Next 45 minutes were emotional and looked totally artificial and illogical.<br /><br />OK, when you are out for a movie like this you don’t expect much logic but all the flaws tend to appear when you don’t enjoy the movie and thats the case with Heyy Babyy. Acting is good but thats not enough to keep one interested.<br /><br />For the positives, you can take hot actresses, last 30 minutes, some comic scenes, good acting by the lead cast and the baby. Only problem is that these things do not come together properly to make a good movie.<br /><br />Anyways, I read somewhere that It isn’t a copy of Three men and a baby but I think it would have been better if it was.”

The text_train has a list of length 75,000, where each entry is a string containing a review. The printed review at index 1 contains some HTML line breaks (<br />). The breaks are not likely to have a huge bearing on the machine learning models, but it is better to clean the data and remove the breaks formatting:

Copy and paste the code below on Jupyter Notebook and click the ‘Run’ button. Wait till you see the output.


The dataset is balanced there are as many positive strings (reviews) as negative strings (reviews):

Copy and paste the code below on Jupyter Notebook and click the ‘Run’ button. Wait till you see the output.



Samples per class (training): [12500 12500 50000]

Load the test dataset in test folder

Copy and paste the code below on Jupyter Notebook and click the ‘Run’ button. Wait till you see the output.



Number of documents in test data: 25000

Samples per class (test): [12500 12500]

The machine learning assignment you want to resolve is as follows: when given a review, you want to assign the label of “positive” or “negative” based on the text content of the review. This is what is known as a typical binary classification task. However, the text data is not in a format that a machine learning model can handle. You need to convert the string representation of the text into a numeric representation that can be applied to your machine learning algorithms.

Part 4: Representing Text Data as a Bag of Words

The bag-of-words representation is one of the simplest and most common method used to represent text for machine learning. In this representation, you get rid of the structural nature of the input text, such as sections, subsections, sentences, and formatting, and only tally how repeatedly each word appears in each string in the body. Removal of the text-based structure and only totalling word amounts is what is referred as representing text as a “bag.”

Calculating the bag-of-words representation for a body of documents entails three steps:

1. Tokenization. Divide each text into tokens, for instance by dividing them based on whitespace and punctuation.

2. Vocabulary building. Accumulate expressions of all words that are contained in any of the texts, and quantify them in alphabetical order.

3. Encoding. For each text document, calculate how frequently each expression contained in the vocabulary appears throughout the sample texts.

Applying Bag-of-Words to the train IMDb Dataset

The bag-of-words representation is applied using CountVectorizer, which is a transformer.

First apply it to the IMDb text_train dataset that you compiled earlier, which consists of many samples separated by punctuations:




<75000×124255 sparse matrix of type ‘<class ‘numpy.int64′>’

      with 10315542 stored elements in Compressed Sparse Row format>

The shape of X_train, the bag-of-words representation of the training data, is 75000×124255, demonstrating that the vocabulary has 124255 entries. The data is stored as a SciPy sparse matrix.

An alternative method to access the vocabulary features is to use the get_feature_name method of the vectorizer, which returns a list where each entry is matched to a single feature:



Number of features: 124255

First 20 features:

[’00’, ‘000’, ‘0000’, ‘0000000000000000000000000000000001’, ‘0000000000001’, ‘000000001’, ‘000000003’, ‘00000001’, ‘000001745’, ‘00001’, ‘0001’, ‘00015’, ‘0002’, ‘0007’, ‘00083’, ‘000ft’, ‘000s’, ‘000th’, ‘001’, ‘002’]

Features 20010 to 20030:

[‘cheapen’, ‘cheapened’, ‘cheapening’, ‘cheapens’, ‘cheaper’, ‘cheapest’, ‘cheapie’, ‘cheapies’, ‘cheapjack’, ‘cheaply’, ‘cheapness’, ‘cheapo’, ‘cheapozoid’, ‘cheapquels’, ‘cheapskate’, ‘cheapskates’, ‘cheapy’, ‘chearator’, ‘cheat’, ‘cheata’]

Every 2000th feature:

[’00’, ‘_require_’, ‘aideed’, ‘announcement’, ‘asteroid’, ‘banquière’, ‘besieged’, ‘bollwood’, ‘btvs’, ‘carboni’, ‘chcialbym’, ‘clotheth’, ‘consecration’, ‘cringeful’, ‘deadness’, ‘devagan’, ‘doberman’, ‘duvall’, ‘endocrine’, ‘existent’, ‘fetiches’, ‘formatted’, ‘garard’, ‘godlie’, ‘gumshoe’, ‘heathen’, ‘honoré’, ‘immatured’, ‘interested’, ‘jewelry’, ‘kerchner’, ‘köln’, ‘leydon’, ‘lulu’, ‘mardjono’, ‘meistersinger’, ‘misspells’, ‘mumblecore’, ‘ngah’, ‘oedpius’, ‘overwhelmingly’, ‘penned’, ‘pleading’, ‘previlage’, ‘quashed’, ‘recreating’, ‘reverent’, ‘ruediger’, ‘sceme’, ‘settling’, ‘silveira’, ‘soderberghian’, ‘stagestruck’, ‘subprime’, ‘tabloids’, ‘themself’, ‘tpf’, ‘tyzack’, ‘unrestrained’, ‘videoed’, ‘weidler’, ‘worrisomely’, ‘zombified’]

From the output you can observe that the first 20 entries in the vocabulary are all numbers. These numbers are contained in some of the reviews, and are consequently extracted as words. A majority of these numbers may have no instantaneous semantic significance aside from “007”, which is probable referring to the James Bond’s Agent 007 code name.

Features 20010 to 20030 comprise a collection of English words starting with “che”. You will notice that both the singular and plural forms for “cheapie” and “cheapskate” are extracted in the vocabulary as separate words. These words are closely interrelated because they have the same semantic meanings, and including them as dissimilar words, conforming to different features, makes no sense.

To improve on this first get a quantitative measure of performance by training a classifier using the training labels stored in y_train and the bag-of-words representation of the training data stored in X_train. For scarce data like the IMDb dataset, a linear model such as Logistic Regression is the best choice. Because you have been given a set of independent variables that you are using to estimate a categorical dependent variable such as positive or negative in this tutorial, 0 or 1, yes or no, true or false. Cross Validation (cv) is used to check the accuracy of supervised models on unseen data. The simplest form of cross-validation is k-fold validation. For instance, if K = 3, then the training data is separated into three segments in which each of the three segments is used for testing and the remaining two segments are used for training. max_iter denotes the number of iterations or epochs that the code runs, the default is 100. It’s set at 200 because our dataset is large with 124255 features.

Begin by applying Logistic Regression using cross-validation (cv):



Mean cross-validation accuracy: 0.70

***Warning*** Ignore the error messages, which are replicated in every iteration, you will see the output after the code has run to completion. Depending on your computer’s processing power, the code might run for more than an hour or more. Be patient and do something else as you wait for the output.

The result output shows a mean cross-validation score of 70%, which is average but not a great performance to make accurate predictions especially for a balanced binary classification assignment.

You can fine tune the performance with the Logistic Regression regularization parameter, C, via cross-validation:



Best cross-validation score: 0.71

Best parameters:  {‘C’: 0.1}

The results show the highest cross-validation score of 71% using C=0.1. The accuracy of the model has increased slightly.

You can now evaluate the generalization performance of this parameter setting on the test set:




The output shows that generalization performance of C=0.01 performs poorly on the test set and therefore cannot be used to make accurate predictions.

You can try and improve the performance by improving the extraction of words.

The CountVectorizer extracts tokens using a regular expression by finding all series of characters that contain at least two letters or numbers (\w) that are separated by word boundaries (\b). It does not locate single-letter words, and it separates contractions like “doesn’t” or “”, but it recognizes a word like “h8ter” as a single word. The CountVectorizer will then convert all the words to lowercase characters, so that a word such as “look”, “Look”, and “lOok” are all matched to the same token and feature.

To remove uninformative features (like the numbers), only use tokens that appear in more than one document. Tokens that only appear in a single document are most likely absent in the test set and are therefore uninformative.

You can set the minimum number of documents a token needs to appear in with the min_df parameter:



X_train with min_df: <75000×44532 sparse matrix of type ‘<class ‘numpy.int64′>’

      with 10191240 stored elements in Compressed Sparse Row format>

By setting the min_df to at least five appearances of each token, the number of features is reduced to 44,532 (about a third of the original features) as observed in the above output.

Inspect some of the tokens once more:



First 50 features:

[’00’, ‘000’, ‘001’, ‘007’, ’00am’, ’00pm’, ’00s’, ’01’, ’02’, ’03’, ’04’, ’05’, ’06’, ’07’, ’08’, ’09’, ’10’, ‘100’, ‘1000’, ‘1001’, ‘100k’, ‘100th’, ‘100x’, ‘101’, ‘101st’, ‘102’, ‘103’, ‘104’, ‘105’, ‘106’, ‘107’, ‘108’, ‘109’, ’10am’, ’10pm’, ’10s’, ’10th’, ’10x’, ’11’, ‘110’, ‘1100’, ‘110th’, ‘111’, ‘112’, ‘1138’, ‘115’, ‘116’, ‘117’, ’11pm’, ’11th’]

Features 20010 to 20030:

[‘inert’, ‘inertia’, ‘inescapable’, ‘inescapably’, ‘inevitability’, ‘inevitable’, ‘inevitably’, ‘inexcusable’, ‘inexcusably’, ‘inexhaustible’, ‘inexistent’, ‘inexorable’, ‘inexorably’, ‘inexpensive’, ‘inexperience’, ‘inexperienced’, ‘inexplicable’, ‘inexplicably’, ‘inexpressive’, ‘inextricably’]

Every 700th feature:

[’00’, ‘accountability’, ‘alienate’, ‘appetite’, ‘austen’, ‘battleground’, ‘bitten’, ‘bowel’, ‘burton’, ‘cat’, ‘choreographing’, ‘collide’, ‘constipation’, ‘creatively’, ‘dashes’, ‘descended’, ‘dishing’, ‘dramatist’, ‘ejaculation’, ‘epitomize’, ‘extinguished’, ‘figment’, ‘forgot’, ‘garnished’, ‘goofy’, ‘gw’, ‘hedy’, ‘hormones’, ‘imperfect’, ‘insomniac’, ‘janitorial’, ‘keira’, ‘lansing’, ‘linfield’, ‘mackendrick’, ‘masterworks’, ‘miao’, ‘moorehead’, ‘natassia’, ‘nude’, ‘ott’, ‘particulars’, ‘phillipines’, ‘pop’, ‘profusely’, ‘raccoons’, ‘redolent’, ‘responding’, ‘ronno’, ‘satirist’, ‘seminal’, ‘shrews’, ‘smashed’, ‘spendthrift’, ‘stocked’, ‘superman’, ‘tashman’, ‘tickets’, ‘travelling’, ‘uncomfortable’, ‘uprising’, ‘vivant’, ‘whine’, ‘x2’]

You will notice that some of the vague words or spelling mistakes have been eliminated. Evaluate the cross validation score of training data by performing a grid search once more:



Best cross-validation score: 0.71

The best validation accuracy of the grid search remains unchanged at 71%. Removal of the vague words or spelling mistakes didn’t improve the model, but it reduces processing time and it might also make the model more understandable.


Aside from using min_df, an alternative method to get rid of vague words is by removing words that recur too often to be useful. The two key methods used are: utilizing a definite list of stopwords, or removal of words that recur too often. Scikitlearn has an in-built list of English stopwords contained in the feature_extraction.text module:



Number of stop words: 318

Every 10th stopword:

[‘me’, ‘he’, ‘latter’, ‘for’, ‘eleven’, ‘take’, ‘whether’, ‘inc’, ‘must’, ‘whither’, ‘hereafter’, ‘which’, ‘none’, ‘elsewhere’, ‘ltd’, ‘front’, ‘therefore’, ‘perhaps’, ‘since’, ‘seems’, ‘alone’, ‘otherwise’, ‘herein’, ‘your’, ‘four’, ‘twenty’, ‘thru’, ‘yours’, ‘everywhere’, ‘call’, ‘nowhere’, ‘without’]

After removing the stop words in the list the number of features might decrease by 318. This might lead to enhanced performance. To use the in-built method in scikit-learn use stop_words=”english”. You can also pass your own stop words to improve performance.



X_train with stop words:

<75000×44223 sparse matrix of type ‘<class ‘numpy.int64′>’

      with 6577418 stored elements in Compressed Sparse Row format>

The features in the dataset are reduced by 309 (44532 – 44223), which means that only 9 stop words did not appear in the dataset. Run the grid search once more:



Best cross-validation score: 0.71

The grid search performance did not show any significant improvement after removal of the 309 stopwords because excluding 309 features out of over 40,000 is not likely to improve performance. The removal of stop words will be more significant for smaller datasets, which have less data.


In this tutorial you have learnt how to use python and scikit-learn to perform text classification using the IMDb movie reviews dataset. You have also learnt how to use scikit-learn to represent text data for machine learning, perform feature extraction, represent text data as a bag of words, use Logistic Regression and Cross Validation to train text data and improve accuracy of training models. Some of the ways to try and improve the accuracy of the model could be by increasing the max_iter options, trying different solvers (“liblinear”, “newton-cg”, “lbfgs”, “sag” and “saga”), removing words that recur often, by setting the max_df or min_df options of CountVectorizer to gauge the influence of increasing or decreasing the number of features to see their influence on performance.

You May Also Like
two factor authentication
Read More

Two Factor Authentication With PHP and Africa’s Talking SMS API

Two factor authentication is an additional layer of security used to ensure only authenticated users gain access to an online account. because Passwords are historically weak, and can be easily stolen, it can't alone be the way that users access their accounts.