# A Guide To Machine Learning Model Evaluation Metrics

Model evaluation is an important step when it comes to building machine learning projects. This step helps a practitioner gauge which machine learning model is best suited to the problem in hand. In order to carry out the model evaluation, different evaluation metrics exist to assist us to measure the performance of our models. We can use the results of those metrics to improve our machine learning model. As we dive deeper into this guide, we will know which evaluation metrics are best suited for which scenario and obtain a reason on why that is the case.
REGRESSION PROBLEMS

1. R-Squared

R-squared indicates the extent to which the proportion of variance in the independent variable explains the variance in the dependent variable.
For example, if the R2  is 0.70,70% of the variation can be explained by the model’s input.

When there are too many predictors and higher-order polynomials in our model, R-squared tends to overfit the data and it outputs high R-squared values. Another problem with R-squared is that its value increases whenever a predictor is added to the model. In both cases, adjusted R-squared is preferred to it.

This metric has been adjusted for the number of predictors in our machine learning model. If an additional predictor increases the performance of our model then the value of the adjusted R-Squared goes up. It is mostly used for multilinear regression.

3. Mean Absolute Error(MAE)

This is the difference between predicted values and actual values. The metric shows how far from the actual values the predictions are thus measuring the average error magnitude. The downside of this is that we won’t get to establish whether our model is overfitting or underfitting the data provided.

4. Mean Squared Error(MSE)

Similarity exists between this metric and MAE, except in this case you should get the average of the squared difference between the predicted and actual values. Preference should incline to the current metric over MAE when working to reduce large errors.

Here’s how to implement the above metrics in python:

`````````
# import necessary modules
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

# generate dataset
X,y = make_regression(n_samples=500, n_features=5, random_state=0)

# split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=2)

# load and fit model to training set
model = Lasso()
model.fit(X_train,y_train)

# make prediction
y_pred = model.predict(X_test)

# calculate scores
print ('R Square =',r2_score(y_test, y_pred))
print ('MAE =',mean_absolute_error(y_test, y_pred))
print ('MSE =',mean_squared_error(y_test, y_pred))
print ('RMSE =',mean_squared_error(y_test, y_pred)**0.5)

> R Square = 0.9994653752036747
> MAE = 1.8102055921681555
> MSE = 5.115190416160957
> RMSE = 2.26167867217272

[](https://github.com/JeremiahKamama/Pima-Tutorial/blob/master/R-metrics.ipynb)``````

CLASSIFICATION PROBLEMS

1. Confusion Matrix

Let’s use a medical example of malaria tests with the assistance of the image below to explain some terms.

True positive: a correctly identified positive case.

True negative: a correctly identified negative case.

False-positive: an incorrectly identified positive case.

False-negative: an incorrectly identified negative case.

True Negative Rate(TNR): the ratio of the number of incorrectly predicted positive cases to the number of positive cases

True Positive Rate(TPR):  Ratio of the number of correctly predicted negative cases to the number of negative cases

False Positive Rate(FPR): Ratio of the number of incorrectly predicted positive cases to the number of negative cases

False Negative Rate(FNR): Ratio of the number of correctly predicted negative cases to the number of positive cases

Specificity: The ability of a model to correctly identify negative cases from the actual present negative cases.

Sensitivity: the ability of a model to correctly identify positive cases from the actual present positive cases.

2. Recall and Precision

This pair of metrics are used when there is a class imbalance problem(when the classes are unequally distributed in the dataset). Recall is the measure of completeness because it calculates the percentage of correct samples that the model identified. Precision is the measure of exactness because it calculates the percentage of positive samples which are actually in the positive class.

The two are calculated as follows:

Here’s how to calculate precision and recall in python:

`````````
# import necessary modules
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import fbeta_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score

# generate two class dataset
X,y = make_classification(n_samples=1000,n_classes=2,random_state=1)

# split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=2)

model = LogisticRegression()

# fit model
model.fit(X_train,y_train)

y_pred = model.predict(X_test)

# calculate scores
precision_score(y_test,y_pred)
> 0.8365019011406845

recall_score(y_test,y_pred)
> 0.8461538461538461

fbeta_score(y_test,y_pred,beta=1)
> 0.8413001912045889

[](https://github.com/JeremiahKamama/Pima-Tutorial/blob/master/Precision.ipynb)``````

3. F1 Score

In order to get the best precision and recall, we can calculate the F1 score. This score is the harmonic mean of precision and recall and is the measurement of how robust and precise our model is.

If we get an F1 score of 1, it means that we have attained perfect precision and recall. This metric is a variety of applications in natural language processing as well as in information retrieval. In NLP, it is used for named entity recognition and word segmentation.

This score, however, does not take into account true negatives thus Matthews correlation coefficient(MCC) is preferred to measure the performance of binary classifiers.

4. Area Under the Curve-Receiver Operating Characteristics

(AUC-ROC Curve)

This classification performance measurement shows us how capable the model is in distinguishing between classes.

First, let’s calculate AUC-ROC in python:

import libraries and necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix,roc_auc_score,roc_curve

df.describe()

drop target variable and scale training data
X = df.drop(‘Outcome’, axis=1)
X = StandardScaler().fit_transform(X)
y = df[‘Outcome’]

split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=0)

model = SVC()

parameters = [{‘kernel’: [‘rbf’],

``````           'gamma': [1e-3, 1e-4],

'C': [1, 10, 100, 1000]}]``````

perform grid search
grid = GridSearchCV(estimator=model, param_grid=parameters, cv=5,scoring=’roc_auc’)
grid.fit(X, y)

###### GridSearchCV(cv=5, estimator=SVC(),param_grid=[{‘C’: [1, 10, 100, 1000], ‘gamma’: [0.001, 0.0001],‘kernel’: [‘rbf’]}],scoring=’roc_auc’)

grid.best_estimator_

###### SVC(C=1000, gamma=0.001)

roc_auc = np.mean(cross_val_score(grid, X, y, cv=5, scoring=’roc_auc’))
print(‘Score: {}’.format(roc_auc))

###### Score: 0.8311460517120894

load model with values from grid search
model = SVC(C=1000, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=’ovr’, degree=3, gamma=0.001, kernel=’rbf’, probability=True, random_state=None, shrinking=True,
tol=0.001, verbose=False)

fit model
model.fit(X_train,y_train)

###### SVC(C=1000, gamma=0.001, probability=True)

y_predict = model.predict(X_test)
y_prob = model.predict_proba(X_test)

calculate roc_auc score
roc_auc_score(y_test,y_predict)

###### 0.7280397022332507

plot roc_auc curve
plt.figure(figsize=(10,7))
fpr, tpr, thresh = roc_curve(y_test, y_prob[:,1])
plt.plot(fpr, tpr,color=’blue’)
plt.plot([0, 1], [0, 1],’r–‘)
plt.ylabel(‘True Positive Rate’)
plt.xlabel(‘False Positive Rate’)
plt.show() The plot above shows the ROC curve(blue) while the AUC is the area under the blue curve.

The ROC curve is useful in evaluating the performance of binary classifiers. It is a probability curve that is used to identify the best threshold in making a decision. AUC represents the degree of separability. It is useful in deciding which classification method is better.

5. Gini Coefficient

This metric is based on the area under the curve. It can be used as a model selection metric to quantify the model’s performance.

6. Logarithmic/ cross-entropy Loss

Log loss is good for evaluating multi-class classifications. To measure the performance of a model, log loss uses probability estimates which lie between 0 and 1. The best model will score a log loss of 0 while scores greater than 0 indicate lower model accuracy.

Check out the following resources for more extensive reads:

Feel free to leave your questions and feedback in the comments section.

##### You May Also Like ## How to Recognize Patterns in Text Using ML Models

In this article, we will explore the applications of Machine Learning (ML) algorithms in Natural Language Processing (NLP).… ## Fashion Recognition with TensorFlow.

1.      INTRODUCTION TO TENSORFLOW ARCHITECTURE TensorFlow library is an open source machine learning framework used to… ## How to Structure Machine Learning Projects

This article aims to help Machine Learning practitioners who are starting out to: i) Organize machine learning tasks…  