Model Evaluation: Confusion Matrix

ConfusionAndMatrix

The machine learning workflow usually involves several steps: getting a representation of the data, looking for the most suitable features, train the model with those features, and finally check whether the current model fits the solution we were aiming for or not. This last step is called evaluation and tries to check whether our model fits our purposes or not.

Accuracy

Most of the time I’m focusing on getting the highest score when training my machine learning models. The score I’m looking for is based on accuracy. The accuracy here is the number of predicted samples that were correctly labeled divided by the total number of samples.

54b5899f9d81d5aacad393ee14400f9b.png
Figure 1. accuracy formula

Is not a bad estimate to get started but is not a silver bullet. Depending on the problem, accuracy may not be the best evaluation method. For example, if we’re dealing with a health related machine learning problem, such as classifying a certain type of cancer cases, even a 90% of accuracy is not good enough. In this case we are ok by having false positives if we’re sure that we don’t miss any true positive. False positives can be validated by a doctor afterwards.

Confusion Matrix

In order to choose which could be the most suitable metric used for a given problem, we need to know which is the goal of the solution.

  • Is a health related app and we want to make sure we don’t miss any positive case

  • We need to store items into boxes and we need to make sure the accuracy is close to 100%, in other words, to be accurate

One helpful way to visualize how the model is performing is to use a confusion matrix. A confusion matrix is a matrix were all posible outcomes of the model are classified in different quadrants. Depending on which quadrant is most important for us to highlight as result, we will choose the classifier that fits best the quadrants we are interested in. An evaluation matrix for a binary classification problem could look like the following:

diag d6c7574b75c4cbbab4321be48cecea13
Figure 2. Confusion matrix

Lets create a theoretical example. Lets say we have a binary classification problem. We’re classifying images, and we need to classify 100 images between cats and dogs. The confusion matrix for my actual model is the following:

diag 85c43faa821d8ed8acab35e8be805d4e
Figure 3. Binary classification

If we establish that the positive class (1) is to find a dog and a negative class (0) is to find a cat, according to this confusion matrix of a 100 samples we can make certain assumptions:

  • 20% of the time the model classifies correctly that a given animal is not a dog (True Negative - TN)

  • 30% of the time the model classifies incorrectly that a given animal is not a dog (False Negative - FN)

  • 19% of the time the model classifies incorrectly that a given animal is a dog (False Positive - FP)

  • 31% of the time the model classifies correctly that a given animal is a dog (True Positive - TP)

The evaluation matrix helps us to sort all of the samples in four different boxes and then applying different metrics depending on the goal we’re aiming, Do we want to not miss any dog even if we accept to classify a cat as a dog from time to time ? Do we want to have a 100% of accuracy classifying dogs ? How 100% of accuracy would be represented in the confusion matrix by the way ?

diag cc13dee039c79ee2e5b0e367f3d57e7e
Figure 4. Accuracy metric

But as we’re mentioned so far, we may want to prioritize some other type of metric. Other metrics could be: precision, recall or specificity.

Precision, Recall and F1 score

Precision

If you’d like to maximize precision, that would mean (in the current example), that everytime we say a given sample is a dog, it’s going to be a dog. But that also could mean that when we say something is a cat we may find out it’s a dog.

96d2dbcd621a2da7cff58eaed68d87c2.png
Figure 5. precision formula

In our example our first model was performing like this:

diag 85c43faa821d8ed8acab35e8be805d4e
Figure 6. Precision

Calculating the precision would give us 31 / 31 + 19 = 0.62, 62% precision. A model that improves precision would look like the following:

diag f7b708d44c46c7a0de4ec56588f13bad
Figure 7. Improved precission

Which would give us a precision of: 41 / 41 + 0 = 1, 100% precision. Notice we are still missing some dogs (49 dogs are classified as false negatives), but here the point is that whenever the model say it’s a dog, it’s a dog.

Precision is good for…​

In general precision could be a good metric for tasks such as:

  • search engine rankings, query suggestion

  • document classification

  • many customer-facing tasks as customers are very sensitive to errors

Recall

What if we don’t want to miss any dog even though we can end up classifying cats as a dog ? We want a high recall. First the formula:

9fc030ab004eb3b12a815227ce62da52.png
Figure 8. recall formula

We start with a model that performs like this:

diag 85c43faa821d8ed8acab35e8be805d4e
Figure 9. Recall

Calculating the recall we end up with: 31 / 31 + 30 = 0.508, around 50% of time time when we say it’s a dog it’s a dog, that’s not better than flipping a coin. Let’s checkout another model:

diag 43d86b0d4c9951ad53a248b51efbd5a7
Figure 10. Improved recall

In this ocassion we’ve got that: 31 / 31 + 0 = 1, 100% of recall. This time this would mean that every time we say a dog is a dog we may get a cat, but we are completely sure, that no dogs have been classified as cats. Looking at the matrix:

  • 50% of the samples have been correctly classified as cats (True Negative - TN)

  • 0% of the samples have been incorrectly classified as cats (False Negative - FN)

  • 19% of the samples have been incorrectly classified as dogs (False Positive - FP)

  • 31% of the samples have been correctly classified as dogs (True Positive - TP)

Recall is good for…​

In general recall could be a good metric for tasks such as:

  • Search and information extraction in legal discovery

  • tumor detection

  • tasks with a man-in-the-loop to filter out false positives

F1 score

F1 score combines precision and recall into a single number. The formula is:

f7ce964042ecded59c4495ac23e5e10b.png
Figure 11. f1 score formula

As it’s mentioned this article F1 Score might be a better measure to use if we need to seek a balance between Precision and Recall AND there is an uneven class distribution (large number of Actual Negatives).

Using Scikit to show metrics

To show how confusion matrix can help deciding which classifier to use I’m using a dataset from Madrid City Open Data. This dataset shows the interventions of the urgency and rescue city service (SAMUR in spanish) in 2020.

SAMUR

With this dataset I’d like to be able to find a classifier that answers the question: Is this type of intervention going to take to the urgency service more time than average to respond ?

First I did some work to prepare the data the best I could:

  • Converting categorical data to numerical

  • Applying under-sampling to correct imbalance

  • Choosing best features

Then I chose several classifiers:

classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.dummy import DummyClassifier

# different classifiers
logistic    = LogisticRegression().fit(X_train, y_train)
svc         = LinearSVC(random_state=0).fit(X_train, y_train)
knn         = KNeighborsClassifier(5).fit(X_train, y_train)
tree        = DecisionTreeClassifier(max_depth=3).fit(X_train, y_train)
rff         = RandomForestClassifier(max_depth=2, n_estimators=4, max_features=2).fit(X_train, y_train)
ada         = AdaBoostClassifier(n_estimators=100).fit(X_train, y_train)
dummy       = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)

And then I printed out their confusion matrices:

confusion matrix of every classifier
from sklearn.metrics import confusion_matrix

classifiers = [dummy, logistic, svc, knn, tree, rff, ada]

for clsf in classifiers:
    y_predicted = clsf.predict(X_test)
    print(clsf.__class__.__name__)
    print('===========================================')
    print(confusion_matrix(y_test, y_predicted))
    print()

Here’s a nicer representation of the confusion matrices.

diag ca0c55ac6ede9997ec25f87b8f510c9d
Figure 12. Confusion matrices

First impressions:

  • None of them is performing at a decent level

  • Is not even possible to see a decent recall because both dummy and linear classifiers are the different sides of choosing the most frequent outcome…​ useless.

  • I’d say that the classifier that gives me the highest precision is the AdaBoostClassifier.

But to be sure, I’m producing the precision, recall and f1-score summary with sklearn’s classification_report of each of them:

summary
from sklearn.metrics import classification_report

classifiers = [dummy, logistic, svc, knn, tree, rff, ada]

for clsf in classifiers:
    y_predicted = clsf.predict(X_test)
    print(clsf.__class__.__name__)
    print('===========================================')
    print(classification_report(y_test, y_predicted, target_names = ['above_avg', 'not_above_avg']))

I’m summarizing the results in the following table:

Table 1. Highest results for possitive class
Classifier Precision Recall F1

DummyClassifier

0.50

1.00

0.57

LogisticRegression

0.57

0.70

0.63

LinearSVC

0.00

0.00

0.00

KNeighborsClassifier

0.61

0.61

0.61

DecisionTreeClassifier

0.56

0.83 ←----

0.67

RandomForestClassifier

0.58

0.66

0.61

AdaBoostClassifier

0.62 ←----

0.70

0.66

The summary is telling me that the classifier that ranks best for precision is the AdaBoostClassifier and the classifier that ranks best for recall is the DecisionTreeClassifier. The take-away is:

  • If I wanted to be sure that I’m missing the few positive samples possible I’d look for the highest recall classifier

  • If I wanted to be sure that I’m getting a true positive sample every time I’d look for the highest precision classifier

And finally, this whole evaluation process has serve to realize that I need to go back to square one, because as I mentioned earlier NONE of the classifiers is performing decently, so I need to check where I can improve the process:

  • Enrich data

  • Increase/Decrease samples

  • Tune classifiers to get the desired precision/recall/accuracy…​

Resources