Accuracy is often chosen as the benchmark performance metric for predictive models. But is this always a sensible choice? Our keen readers can look at the length of this post and answer this themselves: “No and it’s not that simple”.
As a reminder, accuracy is the proportion of cases the model classifies correctly and can be a useful indicator of how well the model is performing. In some cases, however, accuracy can be very misleading. Take for example a model that predicts whether a patient has a certain rare disease based on some measurements. A model always predicting the patient is healthy would have an accuracy of 99.99% for a disease occurring in 1 in 10000 people. This model, albeit having a particularly high accuracy, is clearly quite useless for medical diagnostics.
This example raises the key point on measuring model performance: the choice of a performance metric must reflect the real-world problem
In our example, it might make more sense to measure the proportion of patients with the disease correctly identified as such (a.k.a. sensitivity). This metric would discard our overly simple all-are-healthy model by giving it an underwhelming sensitivity of 0%.
A sensible choice in the performance metric is crucial for selecting a model that will best fulfill its purpose and tackle the business issue at hand. It will also be important when tuning the model parameters. Here we’ll review some of the most common measures of performance and point out their uses and caveats.
The confusion matrix
One of the most widely used visualisation for classification results is the rather-oddly-named confusion matrix.
Short anecdote on the name:
As many statistical terms, this one arose from a specific experiment. Subjects would be performing a series of psychological tests and their responses were measured. The table listed whether the subjects had given the expected answer or were confused and chose the wrong one, hence the name.
Back on topic. Confusion matrices are tables showing counts with rows showing the observed (real) outcomes and columns the predictions. Here is an example for a two-class problem with possible outcomes being positive and negative (e.g., for medical diagnostics).
||True positives (TP)
||False negatives (FN)
||Total observed positives
||False positives (FP)
||True negatives (TN)
||Total observed negatives
||Total predicted positives
||Total predicted negatives
These counts are also used to compute widely used performance metrics. Here we list the most common:
- Accuracy: (TP + TN) / All observations. Measures the overall classification rate of the model.
- Sensitivity: TP / Total observed positives. Measures the capacity of the model to detect positive cases.
- Specificity: TN / Total observed negatives. Measures how well the model leaves negative cases as negatives. In a way, it checks whether the model is predicting positives only for actual positives.
- Precision: TP / Total predicted positives. Measures how trustworthy a positive prediction is, i.e. the proportion of predicted positives that are indeed positive.
- Negative predictive value (NPV): TN / Total predicted negatives. Similarly to the precision, this gives a measure of confidence on predicted negatives.
Accuracy, precision and NPV
Accuracy, precision and the negative predictive value are very dependent on the rate of occurrence of the positive cases as seen by the example on the disease. For very unbalanced cases such as this one (i.e., where the proportion of positive and negative cases in the data are very different) these indicators could prove tricky to use as they might favour the large class too much. This is especially problematic when the minority class is the main interest of the model. In balanced cases though, they could prove useful to measure either how well a model performs in general or regarding a specific class.
It’s worth noting that precision and NPV are equivalent as they both measure the amount of correct predictions amongst all predictions of a given class. These indicators are not directly applicable to problems where the outcomes do not correspond to an event happening or not such as for our positive/negative case, or where there are more than two classes. We can nevertheless always follow this logic to obtain similar indicators for each class.
Sensitivity and specificity
Sensitivity and specificity are less sensitive to the class sizes and therefore usually the preferred choice for performance measurements. In particular, they are commonly used to tune the model parameters.
Many models output not a class directly but the probability of belonging to that class. The data scientist must therefore set a threshold that will determine whether a prediction goes one way or the other.
The perfect model would lead to a probability of being “positive” of 0 for all “negative” cases and 1 for all “positives”. Here, the choice of threshold is irrelevant and any value will make both the sensitivity and specificity (as well as the accuracy) equal to 1.
Needless to say this is never the case (or you are a very, very good statistician!). In reality, there is a tradeoff between sensitivity and specificity. Indeed, in the extreme where all predictions are “positive”, the sensitivity will be 1 and the specificity will be 0. Conversely, if all cases are predicted as “negatives”, the sensitivity will be 0 and the specificity 1 (the case in our disease example). These extreme cases lead, of course, to useless models so the threshold should go somewhere in the middle.
A common measure of how well a model deals with this tradeoff is the area under the ROC curve (Receiver operating characteristic, again, don’t mind the odd name) or simply AUC. This curve plots the values of sensitivity and specificity for all values of the threshold. The area under it measures how close the model is to the fabled perfect model.
Measuring whether a model is generalisable
Setting the choice of indicators aside, remember it is crucial to always calculate performance metrics of models on a set of values NOT used in the training of the model. Indeed, the model is meant to deal with previously unseen data and we want to measure its performance in the “real-world”.
There are many ways of setting some data aside for testing a model once its been trained but the main purpose of all these methods is to prevent the model from overfitting the training data. This means we want to avoid a model that follows our data so closely that it only works for these specific points.
The problems that can arise from this are exemplified in the figure above. Here, a set of points distributed roughly along a line can be fitted by a linear regression that will best reflect the trend (red line), or by a curve that passes each point but completely misses the trend (grey line). Measuring the performance on the points used to build the model would lead us to chose the “perfect” fit rather than the actual trend.
Putting it all together
Now that we have reviewed different possibilities comes the actual task of choosing one performance metric. This metric can either be one of the above-mentioned ratios but could also be a indicator made of two or more of these values. The grouping of values can be as simple as an average and also be much more complex such as for the calculation of the AUC. Typical grouping methods include weighted averaging, geometric averaging, or harmonic means.
It’s important to mention that there is no single perfect choice but rather a preferred direction this choice can take. To identify this direction, the business interests around the model have to be clearly identified. This involves answering questions such as:
- Do the outcomes occur with similar frequency (i.e., is this a balanced problem)?
This will determine whether it is safe to use metrics that are sensitive to the proportion of the classes. If the problem is not balanced, these should be avoided. In our toy example, the problem was very unbalanced which rendered the accuracy measure useless.
- What error is preferable/less costly?
This is best understood with an example: For a fraud detection model, is it better to think an individual is fraudulent when they are not or the inverse? The former would lead to further investigation but the latter would allow the fraudsters to get away with their crime (and prevent any recovery by the authorities).
In less clean cut cases, the relative cost of each error can be used to combine indicators through weighted averages. This would lead to trying to minimise errors in general whilst encouraging the model to chose one type of mistake over the other.
- Is there a specific output the business is interested in or should we focus on overall performance?
In other words, are we choosing between equally important outcomes (e.g., market goes up or down) and only global performance of the model is needed (e.g., accuracy, AUC), or is it a prediction of whether a key event is happening or not (e.g., predicting system failures)? In the latter case, we will probably prefer indicators that give us information on specific outcomes.
General performance metrics are more commonly used in choosing one model type over the other whereas specific ones such as the weighted averages mentioned above are used to set the decision thresholds.
Choosing the right performance metric to evaluate a predictive model can make the difference between it being useful or not. It will also make sure expectations on the utility of the model are realistic. Indeed, our initial example showed how a wrong indicator could present the model as great when it is in fact useless.
It is also one of the key points of contact between the modelling work and the business interests. The involvement of the business in this decision and a clear formulation its needs is therefore crucial in order to make any modelling project a success.
by Pablo Cogis
dataroots is a data science consultancy company