# A Guide To Machine Learning Model Evaluation Metrics

Model evaluation is an important step when it comes to building machine learning projects. This step helps a practitioner gauge which machine learning model is best suited to the problem in hand. In order to carry out the model evaluation, different evaluation metrics exist to assist us to measure the performance of our models. We can use the results of those metrics to improve our machine learning model. As we dive deeper into this guide, we will know which evaluation metrics are best suited for which scenario and obtain a reason on why that is the case.** REGRESSION PROBLEMS**

**R-Squared**

R-squared indicates the extent to which the proportion of variance in the independent variable explains the variance in the dependent variable.

For example, if the R2 is 0.70,70% of the variation can be explained by the model’s input.

When there are too many predictors and higher-order polynomials in our model, R-squared tends to overfit the data and it outputs high R-squared values. Another problem with R-squared is that its value increases whenever a predictor is added to the model. In both cases, adjusted R-squared is preferred to it.

**2. Adjusted R-Squared**

This metric has been adjusted for the number of predictors in our machine learning model. If an additional predictor increases the performance of our model then the value of the adjusted R-Squared goes up. It is mostly used for multilinear regression.

**3. Mean Absolute Error(MAE)**

This is the difference between predicted values and actual values. The metric shows how far from the actual values the predictions are thus measuring the average error magnitude. The downside of this is that we won’t get to establish whether our model is overfitting or underfitting the data provided.

**4. Mean Squared Error(MSE)**

Similarity exists between this metric and MAE,except in this case you should get the average of the squared difference between the predicted and actual values.Preference should incline to the current metric over MAE when working to reduce large errors.

**Here's how to implement the above metrics in python:**

```
# import necessary modules
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
# generate dataset
X,y = make_regression(n_samples=500, n_features=5, random_state=0)
# split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=2)
# load and fit model to training set
model = Lasso()
model.fit(X_train,y_train)
# make prediction
y_pred = model.predict(X_test)
# calculate scores
print ('R Square =',r2_score(y_test, y_pred))
print ('MAE =',mean_absolute_error(y_test, y_pred))
print ('MSE =',mean_squared_error(y_test, y_pred))
print ('RMSE =',mean_squared_error(y_test, y_pred)**0.5)
> R Square = 0.9994653752036747
> MAE = 1.8102055921681555
> MSE = 5.115190416160957
> RMSE = 2.26167867217272
[](https://github.com/JeremiahKamama/Pima-Tutorial/blob/master/R-metrics.ipynb)
```

**CLASSIFICATION PROBLEMS**

**Confusion Matrix**

Let’s use a medical example of malaria tests with the assistance of the image below to explain some terms.

**True positive**: a correctly identified positive case.

**True negative**: a correctly identified negative case.

**False positive**: an incorrectly identified positive case.

**False negative**: an incorrectly identified negative case.

**True Negative Rate(TNR): **ratio of the number of incorrectly predicted positive cases to the number of positive cases

**True Positive Rate(TPR): **Ratio of the number of correctly predicted negative cases to the number of negative cases

**False Positive Rate(FPR): **Ratio of the number of incorrectly predicted positive cases to the number of negative cases

**False Negative Rate(FNR): **Ratio of the number of correctly predicted negative cases to the number of positive cases

**Specificity:** The ability of a model to correctly identify negative cases from the actual present negative cases.

**Sensitivity: **the ability of a model to correctly identify positive cases from the actual present positive cases.

**2. Recall and Precision**

This pair of metrics are used when there is a class imbalance problem(when the classes are unequally distributed in the dataset). Recall is the measure of completeness because it calculates the percentage of correct samples that the model identified. Precision is the measure of exactness because it calculates the percentage of positive samples which are actually in the positive class.

The two are calculated as follows:

**Here's how to calculate precision and recall in python:**

```
# import necessary modules
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import fbeta_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
# generate two class dataset
X,y = make_classification(n_samples=1000,n_classes=2,random_state=1)
# split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=2)
# load model
model = LogisticRegression()
# fit model
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
# calculate scores
precision_score(y_test,y_pred)
> 0.8365019011406845
recall_score(y_test,y_pred)
> 0.8461538461538461
fbeta_score(y_test,y_pred,beta=1)
> 0.8413001912045889
[](https://github.com/JeremiahKamama/Pima-Tutorial/blob/master/Precision.ipynb)
```

**3. F1 Score**

In order to get the best precision and recall, we can calculate the F1 score. This score is the harmonic mean of precision and recall and is the measurement of how robust and precise our model is.

If we get an F1 score of 1, it means that we have attained perfect precision and recall. This metric is a variety of applications in natural language processing as well as in information retrieval. In NLP, it is used for named entity recognition and word segmentation.

This score, however, does not take into account true negatives thus Matthews correlation coefficient(MCC) is preferred to measure the performance of binary classifiers.

**4. Area Under the Curve-Receiver Operating Characteristics**

**(AUC-ROC Curve)**

This classification performance measurement shows us how capable the model is in distinguishing between classes.

First, let's calculate AUC-ROC in python:

**import libraries and necessary modules**

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sb

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.model_selection import cross_val_score, cross_val_predict

from sklearn.svm import SVC

from sklearn.metrics import classification_report, confusion_matrix,roc_auc_score,roc_curve

**load and read data**

df = pd.read_csv("diabetes.csv")

df.head()

df.describe()

**drop target variable and scale training data**

X = df.drop('Outcome', axis=1)

X = StandardScaler().fit_transform(X)

y = df['Outcome']

**split into training and testing sets**

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.25, random_state=0)

**load model**

model = SVC()

parameters = [{'kernel': ['rbf'],

```
'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]}]
```

**perform grid search**

grid = GridSearchCV(estimator=model, param_grid=parameters, cv=5,scoring='roc_auc')

grid.fit(X, y)

GridSearchCV(cv=5, estimator=SVC(),

param_grid=[{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001],

'kernel': ['rbf']}],

scoring='roc_auc')

grid.best_estimator_

SVC(C=1000, gamma=0.001)

roc_auc = np.mean(cross_val_score(grid, X, y, cv=5, scoring='roc_auc'))

print('Score: {}'.format(roc_auc))

Score: 0.8311460517120894

**load model with values from grid search**

model = SVC(C=1000, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,

decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf', probability=True, random_state=None, shrinking=True,

tol=0.001, verbose=False)

**fit model**

model.fit(X_train,y_train)

SVC(C=1000, gamma=0.001, probability=True)

y_predict = model.predict(X_test)

y_prob = model.predict_proba(X_test)

**calculate roc_auc score**

roc_auc_score(y_test,y_predict)

0.7280397022332507

**plot roc_auc curve**

plt.figure(figsize=(10,7))

fpr, tpr, thresh = roc_curve(y_test, y_prob[:,1])

plt.title('Receiver Operating Characteristic')

plt.plot(fpr, tpr,color='blue')

plt.plot([0, 1], [0, 1],'r--')

plt.ylabel('True Positive Rate')

plt.xlabel('False Positive Rate')

plt.show()

The plot above shows the ROC curve(blue) while the AUC is the area under the blue curve.

The ROC curve is useful in evaluating the performance of binary classifiers. It is a probability curve that is used to identify the best threshold in making a decision**. **AUC represents the degree of separability. It is useful in deciding which classification method is better.

**5. Gini Coefficient**

This metric is based on the area under the curve.It can be used as a model selection metric to quantify the model's performance.

**6. Logarithmic/ cross-entropy Loss**

Log loss is good for evaluating multi-class classifications. To measure the performance of a model, log loss uses probability estimates which lie between 0 and 1. The best model will score a log loss of 0 while scores greater than 0 indicate lower model accuracy.

**Check out the following resources for more extensive reads:**

- The truth of the F-measure
- Classification: ROC Curve and AUC
- What is log loss
- Gini-coefficient
- ROC and AUC,clearly explained

**Feel free to leave your questions and feedback in the comments section.**