A Guide To Machine Learning Model Evaluation Metrics

Model evaluation is an important step when it comes to building machine learning projects. This step helps a practitioner gauge which machine learning model is best suited to the problem in hand. In order to carry out the model evaluation, different evaluation metrics exist to assist us to measure the performance of our models. We can use the results of those metrics to improve our machine learning model. As we dive deeper into this guide, we will know which evaluation metrics are best suited for which scenario and obtain a reason on why that is the case.
                                 REGRESSION PROBLEMS

  1. R-Squared

R-squared indicates the extent to which the proportion of variance in the independent variable explains the variance in the dependent variable.
For example, if the R2  is 0.70,70% of the variation can be explained by the model’s input.

When there are too many predictors and higher-order polynomials in our model, R-squared tends to overfit the data and it outputs high R-squared values. Another problem with R-squared is that its value increases whenever a predictor is added to the model. In both cases, adjusted R-squared is preferred to it.

2. Adjusted R-Squared

This metric has been adjusted for the number of predictors in our machine learning model. If an additional predictor increases the performance of our model then the value of the adjusted R-Squared goes up. It is mostly used for multilinear regression.

From Data Pre-processing to Optimizing a Regression Model Performance

3. Mean Absolute Error(MAE)

This is the difference between predicted values and actual values. The metric shows how far from the actual values the predictions are thus measuring the average error magnitude. The downside of this is that we won’t get to establish whether our model is overfitting or underfitting the data provided.

Tutorial: Understanding Linear Regression and Regression Error Metrics
source: https://www.dataquest.io/blog/understanding-regression-error-metrics/

4. Mean Squared Error(MSE)

Similarity exists between this metric and MAE, except in this case you should get the average of the squared difference between the predicted and actual values. Preference should incline to the current metric over MAE when working to reduce large errors.

Machine learning: an introduction to mean squared error and regression lines                                 
source: https://www.freecodecamp.org/news/machine-learning-mean-squared-error-regression-line-c7dde9a26b93/

Here’s how to implement the above metrics in python:

# import necessary modules
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

# generate dataset
X,y = make_regression(n_samples=500, n_features=5, random_state=0)

# split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=2)
# load and fit model to training set 
model = Lasso()

# make prediction
y_pred = model.predict(X_test)

# calculate scores
print ('R Square =',r2_score(y_test, y_pred))
print ('MAE =',mean_absolute_error(y_test, y_pred))
print ('MSE =',mean_squared_error(y_test, y_pred))
print ('RMSE =',mean_squared_error(y_test, y_pred)**0.5)

> R Square = 0.9994653752036747
> MAE = 1.8102055921681555
> MSE = 5.115190416160957
> RMSE = 2.26167867217272


  1. Confusion Matrix

Let’s use a medical example of malaria tests with the assistance of the image below to explain some terms.

source: https://en.wikipedia.org/wiki/Sensitivity_and_specificity

True positive: a correctly identified positive case.

True negative: a correctly identified negative case.

False-positive: an incorrectly identified positive case.

False-negative: an incorrectly identified negative case.

True Negative Rate(TNR): the ratio of the number of incorrectly predicted positive cases to the number of positive cases

Image for post

True Positive Rate(TPR):  Ratio of the number of correctly predicted negative cases to the number of negative cases

Image for post

False Positive Rate(FPR): Ratio of the number of incorrectly predicted positive cases to the number of negative cases

Image for post

False Negative Rate(FNR): Ratio of the number of correctly predicted negative cases to the number of positive cases

Image for post

Specificity: The ability of a model to correctly identify negative cases from the actual present negative cases.

source: https://en.wikipedia.org/wiki/Sensitivity_and_specificity

Sensitivity: the ability of a model to correctly identify positive cases from the actual present positive cases.

source: https://en.wikipedia.org/wiki/Sensitivity_and_specificity

2. Recall and Precision

This pair of metrics are used when there is a class imbalance problem(when the classes are unequally distributed in the dataset). Recall is the measure of completeness because it calculates the percentage of correct samples that the model identified. Precision is the measure of exactness because it calculates the percentage of positive samples which are actually in the positive class.

The two are calculated as follows:

Image for post

Here’s how to calculate precision and recall in python:

# import necessary modules
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import fbeta_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score

# generate two class dataset
X,y = make_classification(n_samples=1000,n_classes=2,random_state=1)

# split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=2)

# load model
model = LogisticRegression()

# fit model

y_pred = model.predict(X_test)

# calculate scores
> 0.8365019011406845

> 0.8461538461538461

> 0.8413001912045889 


3. F1 Score

In order to get the best precision and recall, we can calculate the F1 score. This score is the harmonic mean of precision and recall and is the measurement of how robust and precise our model is.

If we get an F1 score of 1, it means that we have attained perfect precision and recall. This metric is a variety of applications in natural language processing as well as in information retrieval. In NLP, it is used for named entity recognition and word segmentation.

This score, however, does not take into account true negatives thus Matthews correlation coefficient(MCC) is preferred to measure the performance of binary classifiers.

4. Area Under the Curve-Receiver Operating Characteristics

(AUC-ROC Curve)

This classification performance measurement shows us how capable the model is in distinguishing between classes.

First, let’s calculate AUC-ROC in python:

import libraries and necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix,roc_auc_score,roc_curve

load and read data
df = pd.read_csv(“diabetes.csv”)


drop target variable and scale training data
X = df.drop(‘Outcome’, axis=1)
X = StandardScaler().fit_transform(X)
y = df[‘Outcome’]

split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=0)

load model
model = SVC()

parameters = [{‘kernel’: [‘rbf’],

           'gamma': [1e-3, 1e-4],                      

           'C': [1, 10, 100, 1000]}]

perform grid search
grid = GridSearchCV(estimator=model, param_grid=parameters, cv=5,scoring=’roc_auc’)
grid.fit(X, y)

GridSearchCV(cv=5, estimator=SVC(),
param_grid=[{‘C’: [1, 10, 100, 1000], ‘gamma’: [0.001, 0.0001],
‘kernel’: [‘rbf’]}],


SVC(C=1000, gamma=0.001)

roc_auc = np.mean(cross_val_score(grid, X, y, cv=5, scoring=’roc_auc’))
print(‘Score: {}’.format(roc_auc))

Score: 0.8311460517120894

load model with values from grid search
model = SVC(C=1000, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=’ovr’, degree=3, gamma=0.001, kernel=’rbf’, probability=True, random_state=None, shrinking=True,
tol=0.001, verbose=False)

fit model

SVC(C=1000, gamma=0.001, probability=True)

y_predict = model.predict(X_test)
y_prob = model.predict_proba(X_test)

calculate roc_auc score


plot roc_auc curve
fpr, tpr, thresh = roc_curve(y_test, y_prob[:,1])
plt.title(‘Receiver Operating Characteristic’)
plt.plot(fpr, tpr,color=’blue’)
plt.plot([0, 1], [0, 1],’r–‘)
plt.ylabel(‘True Positive Rate’)
plt.xlabel(‘False Positive Rate’)

The plot above shows the ROC curve(blue) while the AUC is the area under the blue curve.

The ROC curve is useful in evaluating the performance of binary classifiers. It is a probability curve that is used to identify the best threshold in making a decision. AUC represents the degree of separability. It is useful in deciding which classification method is better.

5. Gini Coefficient

This metric is based on the area under the curve. It can be used as a model selection metric to quantify the model’s performance.

6. Logarithmic/ cross-entropy Loss

Log loss is good for evaluating multi-class classifications. To measure the performance of a model, log loss uses probability estimates which lie between 0 and 1. The best model will score a log loss of 0 while scores greater than 0 indicate lower model accuracy.

Check out the following resources for more extensive reads:

  1. The truth of the F-measure
  2. Classification: ROC Curve and AUC
  3. What is log loss
  4. Gini-coefficient
  5. ROC and AUC clearly explained

Feel free to leave your questions and feedback in the comments section.

You May Also Like