The inspiration behind this article comes to the fact that Data Science is a new concept. This has, in turn, led to those willing to venture into the field a little confused. With soo, much information being offered online, choosing is a gamble. However, this article should offer you a general picture of what it is like to work on a Data Science project from scratch. This, in turn, should help you know what you should look for when you go online.

Incase I came off pessimistic because of the translucent nature of this field, it is one of the highest paying careers. Hence very promising. I will be looking at a telco dataset that I recently came across. The reason for choosing this dataset is because it is industry-based and relatable. My method of approach will be intuitive. Thus, you will not need any form of pre-requisites knowledge to follow along.

Telco Receivers. 

By the end of this post, we will have covered concepts of:

  1. Basic statistics.
  2. Python’s use in Data Science - Here you will see some of the tools used in data science.
  3. Data Science techniques - For example Exploratory Data Analysis, cleaning visualization and machine learning.
  4. Python as a programming language - Here you will implement simple syntaxes in Python programing language.

In case you wish to follow along(Hands-on), then you will need to:

  1. Download anaconda, install and open jupyter notebook: https://dataiseasy.com/2019/03/how-to-install-anaconda-and-jupyter-notebook/. You will need to import any missing packages yourself: https://www.tutorialspoint.com/add-packages-to-anaconda-environment-in-python
  2. Alternatively, you can use google's colab which has the same setup as jupyter notebook: https://colab.research.google.com/. The advantage is most packages exist soo all you need to do is import them. To familiarize with colab, you can use the following link: https://www.youtube.com/watch?v=inN8seMm7UI

For this session, all packages exist for me and will only be imported.

Where to get the data we will be working on:

https://www.kaggle.com/blastchar/telco-customer-churn. You will first need to log in before you can get access to the data under the 'Data' subcategory as highlighted in blue in the figure below.

I would urge that you try to understand the process and code rather than blindly copying. Below are 10 simple steps to follow to complete your project and attain an informative output.

STEP I: Create a project folder.

On any location on your computer that you see fit, create a folder and name it. For instance, Data Science. Inside the folder, place the dataset downloaded earlier from Kaggle.

Opening up Jupyter notebook or colab

For jupyter, if opened by searching at the start icon on the bottom left of your computer, it will look as shown in the figure below.

Once opened, navigate to your project folder (Data Science). Once you are there, click on new then python 3 to open a  jupyter notebook. See in the figure below.

Launching a new notebook.

A typical jupyter notebook will look like the one below.

Jupyter notebook

Setting up your colab can be seen in the link given earlier. A typical colab notebook will look like the one below.

Colab Notebook

Rename to something sensible by clicking on the Untitled and renaming from there. However, for colab, do not remove the ipynb extension while renaming.

It is important to note that the code is written on jupyter also runs on colab.

STEP II: Importing the necessary libraries.

The libraries needed are shown in the picture below.

Im
figure 1: Importing libraries.

Python is equipped with numerous packages for statistical analysis, visualization and predictive analytics. Both traditional and current technologies.

We will be using:

· NumPy Arrays: Arrays are more efficient than python lists in numerical operations as they are specialized object with extensive optimization.

· Pandas, for data wrangling.

· Stats models for regression

· Matplotlib and seaborn for visualization

· Sklearn for splitting our data and accuracy checking.

STEP III: Loading the data.

Python, like other statistical languages, needs data to be loaded onto an Integrated Development Environment of choice for the sake of analysis. This is done with the help of pandas which has been imported as pd as seen in figure 1 above. Pandas come with a range of functions. We will use “read_csv()” function to read our data set of choice. This can be seen in figure 2 below. Once the data is loaded, it comes in the form of a data frame. The function “head()” has been used to choose the number of columns to view. In our case, three columns. However, six is the default.

Figure 2: Loading dataset

STEP  IV: Data Understanding and preparation(Cleaning).

Figure 3.
Figure 4.
Figure 5
  1. The functions “describe()” and the attribute “dtypes” in figures 3, 4 and 5, have been used to help understand the data. Statistically, the first column, customerID, is the identifier variable. There are 19 independent variables less the identifier variable that would not be useful to our analysis. There are 16 categorical variables: gender, SeniorCitizen, Partner, Dependents, Phone Services, Multiple lines, internet services, online security, online backup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, paperless billing and payment method. Also, there are 3 continuous variables: tenure, monthly charges and total charges bring the total of independent variables to 19. The dependent variable in the data is Churn and is categorical in type.
  2. Data cleaning is an important step as it paves the way for analysis and predictions. In the analysis, it can be noted that the column total charges, has empty rows and is of type object rather than float. Running the function “Telco_data['totalCharges'].astype(float)” would result into an error because of the missing values that cannot be seen with the method “isnull().sum()” as seen in figure 6. For this reason, the median of the column, which is 1400, will be used as the filler for the missing values as seen in figure 7. The reason for using the median is because it is statistically stable as it is not affected by outliers. The column in the discussion will be converted into type float instead of string which is its current state (“Telco_data['totalCharges'].astype(float)”).
  3. As seen in figure 3, the column SeniorCitizen has a minimum value of 0 and a maximum value of 1. Checking the column’s unique values, it is only 0 and 1 as seen in figure 8. For this reason, it is important to consider that column as a categorical variable rather than continuous.
Figure 6: Checking for missing values.
Figure 7: Replacing missing values.
Telco_data['SeniorCitizen'].value_counts()

# See output below.
Figure 8: Senior citizen unique values.

STEP V: Data Analysis.

Data visualization as a pictorial format of using data to present difficult concepts and trends. This, in turn, leads to an effective exploration of data.

  1. We will visualize all the 16 independent-categorical variables with the dependent categorical variable of Churn. The interesting results from the 16 results are in the figures 9 – 13  that follow.

· In figure 9, customers without partners are more likely to churn as compared to those with partners.

· In figure 10, the ratio of customer churning of those who use fibre optic is extremely higher than those who use DSL or those without internet service.

· In figure 11, customers who do not have online backup are more likely to churn compared to those who do and those who have no internet access.

· In figure 12, customers who have a month-to-month contract have a higher chance of churning as compared to those with a one-year or a two-year contract.

· In figure 13, we can see that our dataset has more customers retaining than those who opted to churn.

# Visual representation of all the categorical variables among the predictor variables.

cat_var = ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'PaperlessBilling', 'PaymentMethod', 'Churn']

for i in cat_var:
    sns.catplot(x=i, hue='Churn', kind="count", palette="ch:.25", data=raw_data)[]
Figure 9
Figure 10
Figure 11
Figure 12.
Figure 13

2. We will have a look at how the three independent-continuous variables plots against the Churn column. What we can see from figures 14-19 is:

·In figures 14,15 and 16 is the distribution of the continuous variables. Most clients have a low monthly total charge. However, there are those with extreme charges in the “TotalChages” variable(Figure 16).

· In figure 17, we see that the mean monthly charge of customers who churn are quite high as compared to those who do not. The mean being the middle line in the box plot

· In figure 18, we see that customers with lower tenures are a lot likely to churn as compared to those with high tenure

· In figure 19, we see that customers with extremely higher total charges are more likely to churn as compared to those with lower total charges. The extreme charges being represented as dots.

figure 14.
figure 15.
figure 16.
figure 17.
figure 18.
figure 19.

STEP VI: Handling outliers.

Outliers play a major role in any regression. They tend to influence the model towards accommodating them. Removing outliers would lead to better model performance. This can be attained using proper identification thresholds. Figure 18 and 19 show tenure and Total Charges having outliers on the upper side. For this reason, we will remove the top 99% of the variable tenure and the top 90% of the variable Total Charges.

The code below shows how outlier removal is done.

q = raw_data['tenure'].quantile(0.99)
raw_data = raw_data[(raw_data['tenure']<q)]

q2 = raw_data['TotalCharges'].quantile(0.9)
raw_data = raw_data[(raw_data['TotalCharges']<q2)]

STEP VII: Mapping.

With most of the cleaning out of the way, mapping is an important process as it allows for the logistic algorithm to make its prediction on numerical and integer variables. The lines of code below show how it is done programmatically.

raw_data['gender'] = raw_data['gender'].map({'Male':0,"Female":1})
raw_data['Partner'] = raw_data['Partner'].map({'No':0,"Yes":1})
raw_data['Dependents'] = raw_data['Dependents'].map({'No':0,"Yes":1})
raw_data['PhoneService'] = raw_data['PhoneService'].map({'No':0,"Yes":1})
raw_data['MultipleLines'] = raw_data['MultipleLines'].map({'No':0,"Yes":1,"No phone service":2})
raw_data['InternetService'] = raw_data['InternetService'].map({'Fiber optic':0, "DSL":1, "No":2})
raw_data['OnlineSecurity'] = raw_data['OnlineSecurity'].map({'No':0, "Yes":1, "No internet service":2})
raw_data['OnlineBackup'] = raw_data['OnlineBackup'].map({'No':0, "Yes":1, "No internet service":2})
raw_data['DeviceProtection'] = raw_data['DeviceProtection'].map({'No':0, "Yes":1, "No internet service":2})
raw_data['TechSupport'] = raw_data['TechSupport'].map({'No':0, "Yes":1, "No internet service":2})
raw_data['StreamingTV'] = raw_data['StreamingTV'].map({'No':0, "Yes":1, "No internet service":2})
raw_data['StreamingMovies'] = raw_data['StreamingMovies'].map({'No':0, "Yes":1, "No internet service":2})
raw_data['Contract'] = raw_data['Contract'].map({"Month-to-month":0, 'Two year':1, "One year":2})
raw_data['PaperlessBilling'] = raw_data['PaperlessBilling'].map({'No':0,"Yes":1,})
raw_data['PaymentMethod'] = raw_data['PaymentMethod'].map({'Electronic check':0, "Mailed check":1, "Bank transfer (automatic)":2, "Credit card (automatic)":3})
raw_data['Churn'] = raw_data['Churn'].map({'No':0,"Yes":1,})

A glance at the clean data frame in figure 20 below, shows that all the inputs are in number format after mapping.

figure 20.

STEP VIII: Preparing clean dataset for regression.

  1. Declaring the inputs and target variables is important for regression. Figure 23 shows how it is done. First, drop the dependent column Churn and assigning the result to the input variable then assign the Churn column to the target variable.
  2. The main reason for splitting data before training a model is to get an actual evaluation of the model performance from the test data afterwards. From figure 13, it is clear that the “No’s” in the variable churn overweigh the “Yes’s”. For this reason, it is important to use split the data randomly to avoid bias towards the “No’s”. This is shown in figure 21 below. The “train_test_split” function is imported from “sklearn.model_selection” library and takes argument such as the x and y, the test size and the random state for the sake of consistency in case of multiple executions of the code despite the random split.
figure 21.

STEP IX: Regression.

Logistic regression is our simple model of choice in this case. In python, it can be archived quite easily by first using the stats models Logit library followed by fitting the arguments placed in the logit function as seen in figure 22.

Interpretation of the logistic model.

The result is a 7-iteration process, as seen in figure 22, after which no changes are further observed in the model creation. The model summary shows different values but the most important for us being the Log-Likelihood and LLR p-value which show us that our model is significant. In figure 23, the weights and P-values of different input are displayed. Statistically, P-values above 0.05 shows that a column is insignificant to the model and should be removed. However, the weights of the insignificant variables are too close to 0 such that their impact on our model can be overlooked thus included in the model.

figure 22.
figure 23.

STEP X: Prediction.

Prediction is done on the testing data that was made by the "train_test_split" split function. The true values are compared to the predicted values using the confusion matrix which is a table as shown in figure 24. The confusion matrix interprets as follows: Actual 0's and predicted as 0’s are 763 entries, actual 0's and predicted as 1’s are 97, actual 1’s and predicted as 0's are 144, actual 1’s and predicted as 1’s are 199. Finally, the accuracy of the model is calculated based on the confusion matrix. The formula for accuracy as seen in figure 24 is; actual 1's predicted as 1's plus actual 0's predicted as 0's divided by all the entries of the confusion matric table. In this case, the accuracy is 79.96%.

figure 24.

In figure 25 below, following the same procedures as used to predict the testing dataset, we can also predict the training dataset. The accuracy is 79.16%. This process is very important as it checks on whether the model has been overfitted. Which is not the case. The model predicts both the train and the test extremely close indicating that the model is none biased and can be used on new data.

figure 25.

The model is generally great for analysis on any new data presented to it because it has no variance. The false positives are less than the false negatives. In this event, being that the goal is for customer retention, the false negatives must be less than the false positives. As the opposite would be detrimental to the whole objective of the study which is customer retention.

Conclusion.

Preparation(Cleaning), analysis, statistical concepts, visualization and predictions shown prove the power of Data Science using Statistical tools. Data Science, accompanied by domain knowledge and intuition, should be at the forefront of decision making not only for telco companies but also all organizations looking forward to optimizing their performance.

You've successfully subscribed to Decoded For Devs
Welcome back! You've successfully signed in.
Great! You've successfully signed up.
Your link has expired
Success! Your account is fully activated, you now have access to all content.