* Balanced accuracy* is a metric that one can use when evaluating how good a binary classifier is. It is especially useful when the classes are imbalanced, i.e. one of the two classes appears a lot more often than the other. This happens often in many settings such as anomaly detection and the presence of a disease.

As with all discussions on the performance of a binary classifier, we start with a confusion matrix:

In the above, the “positive” or “negative” in TP/FP/TN/FN refers to the prediction made, not the actual class. (Hence, a “false positive” is a case where we wrongly predicted positive.)

Balanced accuracy is based on two more commonly used metrics: * sensitivity* (also known as

*or*

**true positive rate***) and*

**recall***(also known as*

**specificity***, or*

**true negative rate***). Sensitivity answers the question: “How many of the positive cases did I detect?” Or to put it in a manufacturing setting: “How many (truly) defective products did I manage to recall?” Specificity answers that same question but for the negative cases. Here are the formulas for sensitivity and specificity in terms of the confusion matrix:*

**1 –****false positive rate*** Balanced accuracy* is simply the arithmetic mean of the two:

Let’s use an example to illustrate how balanced accuracy can be a better judge of performance in the imbalanced class setting. Assume that we have a binary classifier and it gave us the results in the confusion matrix below:

The accuracy of this classifier, i.e. the proportion of correct predictions, is . That sounds really impressive until you realize that simply by predicting all negative, we would have obtained an accuracy of , which is better than our classifier!

Balanced accuracy attempts to account for the imbalance in classes. Here is the computation for balanced accuracy for our classifier:

Our classifier is doing a great job at picking out the negatives but not so for the positives. Balanced accuracy still seems a little high if identifying the positives is what we care about, but it’s much lower than what accuracy suggested.

For comparison, let’s do the computation for the classifier that always predicts 0 (negative):

Based on balanced accuracy, we would say that our classifier is doing a little better than the naive “all negatives” classifier, but not much better. This seems like a reasonable conclusion since our classifier is able to pick out some positives but not many of them.

Here is some R code that you can use to compute these measures:

TP <- 0 TN <- 10050 FP <- 0 FN <- 15 # metrics accuracy <- (TP + TN) / (TP + TN + FP + FN) sensitivity <- TP / (TP + FN) specificity <- TN / (TN + FP) balanced_accuracy <- (sensitivity + specificity) / 2 # print out metrics options(digits = ) cat("Accuracy:", accuracy, "\n", "Sensitivity:", sensitivity, "\n", "Specificity:", specificity, "\n", "Balanced accuracy:", balanced_accuracy)

This reference points out that balanced accuracy can be extended easily to the multi-class setting: there it is simply the arithmetic mean of the recall for all the classes.Note:

Another popular metric one can use for imbalanced datasets is theNote:, which is the harmonic mean of precision and recall.F1 score

Hey! I think your R-code misses the /2 for the balanced accuracy ðŸ˜‰

LikeLike

Thanks for pointing that out! I have updated the post to include it.

LikeLike

specificity is known as true negative rate and not false positive rate as you wrote it.

LikeLike

good catch! it is TNR = 1 – FPR. I have amended the post to make this clear.

LikeLike