###### Marketing analytics
Accuracy is often chosen as the benchmark performance metric for predictive models. But is this always a sensible choice? Our keen readers can look at the length of this post and answer this themselves: “No and it’s not that simple”. As a reminder, accuracy is the proportion of cases the model classifies correctly and can be a useful indicator of how well the model is performing. In some cases, however, accuracy can be very misleading. Take for example a model that predicts whether a patient has a certain rare disease based on some measurements. A model always predicting the patient is healthy would have an accuracy of 99.99% for a disease occurring in 1 in 10000 people. This model, albeit having a particularly high accuracy, is clearly quite useless for medical diagnostics. This example raises the key point on measuring model performance: the choice of a performance metric must reflect the real-world problem. In our example, it might make more sense to measure the proportion of patients with the disease correctly identified as such (a.k.a. sensitivity). This metric would discard our overly simple all-are-healthy model by giving it an underwhelming sensitivity of 0%. A sensible choice in the performance metric is crucial for selecting a model that will best fulfill its purpose and tackle the business issue at hand. It will also be important when tuning the model parameters. Here we’ll review some of the most common measures of performance and point out their uses and caveats.

## The confusion matrix

One of the most widely used visualisation for classification results is the rather-oddly-named confusion matrix.
Short anecdote on the name: As many statistical terms, this one arose from a specific experiment. Subjects would be performing a series of psychological tests and their responses were measured. The table listed whether the subjects had given the expected answer or were confused and chose the wrong one, hence the name.
Back on topic. Confusion matrices are tables showing counts with rows showing the observed (real) outcomes and columns the predictions. Here is an example for a two-class problem with possible outcomes being positive and negative (e.g., for medical diagnostics).
 Predicted positive Predicted negative Total Observed positive True positives (TP) False negatives (FN) Total observed positives Observed negative False positives (FP) True negatives (TN) Total observed negatives Total Total predicted positives Total predicted negatives Total
These counts are also used to compute widely used performance metrics. Here we list the most common:
• Accuracy: (TP + TN) / All observations. Measures the overall classification rate of the model.
• Sensitivity: TP / Total observed positives. Measures the capacity of the model to detect positive cases.
• Specificity: TN / Total observed negatives. Measures how well the model leaves negative cases as negatives. In a way, it checks whether the model is predicting positives only for actual positives.
• Precision: TP / Total predicted positives. Measures how trustworthy a positive prediction is, i.e. the proportion of predicted positives that are indeed positive.
• Negative predictive value (NPV): TN / Total predicted negatives. Similarly to the precision, this gives a measure of confidence on predicted negatives.

### Accuracy, precision and NPV

Accuracy, precision and the negative predictive value are very dependent on the rate of occurrence of the positive cases as seen by the example on the disease. For very unbalanced cases such as this one (i.e., where the proportion of positive and negative cases in the data are very different) these indicators could prove tricky to use as they might favour the large class too much. This is especially problematic when the minority class is the main interest of the model. In balanced cases though, they could prove useful to measure either how well a model performs in general or regarding a specific class. It’s worth noting that precision and NPV are equivalent as they both measure the amount of correct predictions amongst all predictions of a given class. These indicators are not directly applicable to problems where the outcomes do not correspond to an event happening or not such as for our positive/negative case, or where there are more than two classes. We can nevertheless always follow this logic to obtain similar indicators for each class.

### Sensitivity and specificity

Sensitivity and specificity are less sensitive to the class sizes and therefore usually the preferred choice for performance measurements. In particular, they are commonly used to tune the model parameters. Many models output not a class directly but the probability of belonging to that class. The data scientist must therefore set a threshold that will determine whether a prediction goes one way or the other. The perfect model would lead to a probability of being “positive” of 0 for all “negative” cases and 1 for all “positives”. Here, the choice of threshold is irrelevant and any value will make both the sensitivity and specificity (as well as the accuracy) equal to 1. Needless to say this is never the case (or you are a very, very good statistician!). In reality, there is a tradeoff between sensitivity and specificity. Indeed, in the extreme where all predictions are “positive”, the sensitivity will be 1 and the specificity will be 0. Conversely, if all cases are predicted as “negatives”, the sensitivity will be 0 and the specificity 1 (the case in our disease example). These extreme cases lead, of course, to useless models so the threshold should go somewhere in the middle. A common measure of how well a model deals with this tradeoff is the area under the ROC curve (Receiver operating characteristic, again, don’t mind the odd name) or simply AUC. This curve plots the values of sensitivity and specificity for all values of the threshold. The area under it measures how close the model is to the fabled perfect model.

## Measuring whether a model is generalisable

Setting the choice of indicators aside, remember it is crucial to always calculate performance metrics of models on a set of values NOT used in the training of the model. Indeed, the model is meant to deal with previously unseen data and we want to measure its performance in the “real-world”. There are many ways of setting some data aside for testing a model once its been trained but the main purpose of all these methods is to prevent the model from overfitting the training data. This means we want to avoid a model that follows our data so closely that it only works for these specific points. The problems that can arise from this are exemplified in the figure above. Here, a set of points distributed roughly along a line can be fitted by a linear regression that will best reflect the trend (red line), or by a curve that passes each point but completely misses the trend (grey line). Measuring the performance on the points used to build the model would lead us to chose the “perfect” fit rather than the actual trend.

## Putting it all together

Now that we have reviewed different possibilities comes the actual task of choosing one performance metric. This metric can either be one of the above-mentioned ratios but could also be a indicator made of two or more of these values. The grouping of values can be as simple as an average and also be much more complex such as for the calculation of the AUC. Typical grouping methods include weighted averaging, geometric averaging, or harmonic means. It’s important to mention that there is no single perfect choice but rather a preferred direction this choice can take. To identify this direction, the business interests around the model have to be clearly identified. This involves answering questions such as:
• Do the outcomes occur with similar frequency (i.e., is this a balanced problem)?
This will determine whether it is safe to use metrics that are sensitive to the proportion of the classes. If the problem is not balanced, these should be avoided. In our toy example, the problem was very unbalanced which rendered the accuracy measure useless.
• What error is preferable/less costly?
This is best understood with an example: For a fraud detection model, is it better to think an individual is fraudulent when they are not or the inverse? The former would lead to further investigation but the latter would allow the fraudsters to get away with their crime (and prevent any recovery by the authorities). In less clean cut cases, the relative cost of each error can be used to combine indicators through weighted averages. This would lead to trying to minimise errors in general whilst encouraging the model to chose one type of mistake over the other.
• Is there a specific output the business is interested in or should we focus on overall performance?
In other words, are we choosing between equally important outcomes (e.g., market goes up or down) and only global performance of the model is needed (e.g., accuracy, AUC), or is it a prediction of whether a key event is happening or not (e.g., predicting system failures)? In the latter case, we will probably prefer indicators that give us information on specific outcomes. General performance metrics are more commonly used in choosing one model type over the other whereas specific ones such as the weighted averages mentioned above are used to set the decision thresholds.

## Wrapping up

Choosing the right performance metric to evaluate a predictive model can make the difference between it being useful or not. It will also make sure expectations on the utility of the model are realistic. Indeed, our initial example showed how a wrong indicator could present the model as great when it is in fact useless. It is also one of the key points of contact between the modelling work and the business interests. The involvement of the business in this decision and a clear formulation its needs is therefore crucial in order to make any modelling project a success.
by Pablo Cogis pablo@dataroots.io dataroots is a data science consultancy company

###### General
It is no secret that data science is on the rise. While the biggest hype is arguably behind us, data scientists are in high demand and data science seems to be here to stay. While the applications of data science are being picked up rapidly by firms in the tech industry, there seems to be a lag in other domains. That said, there is clear potential for value creation in these other areas as well. While (in my estimate) only a small portion of companies is today effectively using data science to drive decisions, most are aware of the developments and are at the very least exploring what could be in it for them.
A first difficulty to overcome is to understand what data science is. This is not the easiest thing to do, but it is essential in order to select a first relevant use case. When someone asks me what a data scientist does for a living I generally tell them something along the lines of: a data scientist is someone who is at ease with data, programming languages, mathematics and statistics and is able to quickly build up knowledge on the business domain the project is situated in. He (or she) uses these skills to deliver analyses and models which in turn aid the decision making of the company. The decision making power of a model is often very concrete and specialized. A model can for example aid the decision making in cases such as “Is the risk this customer going to churn high enough for us to take preventive action on?”, “Is the probability high enough that this transaction is fraudulent in order to block the transaction?” or “How much of resource X should I stock at location Y to meet the demand for the coming week?”. As you can see, the models in these examples can only give answers to very specific questions. A first take-away is therefore that a data science project will (at the very best) give you an accurate answer to a specific question. If it is able to do so, it can potentially create tremendous value. This illustrates how important it is to correctly define a data science project and, thus, also your first use case. I will not try to delve any deeper into what data science is as whole books have been written on the topic. However it is interesting to keep in mind a sort of scale by which to measure data science projects. The two extremes of this scale are on the one hand (simple) descriptive analyses and on the other hand fully automated decision making solutions. Generally, but not always, the former has a lower business impact than the latter. Another aspect to consider is that data science projects are not IT projects. At the start of an IT project, the objective and feasibility are often quite clear. When we look at the feasibility of a data science project we usually cannot make an accurate estimate before the start of the project. While the objective can be very clear (and should be), gaining a thorough understanding on the feasibility will always require an exploration phase. A clear red flag is someone who says “My model will have at least an accuracy of X percent” without exploring the actual data and building the first test models. Below you can find a step by step list of things we picked up over the years that are important when going for a first successful use case.

### 0. Understand what data science is

Make sure that you understand what data science is and what it could potentially mean for you and your business. A basic business-level understanding of data science is essential, as it will allow you to discuss potential projects in the context of their proposed value offering. There are quite a lot of books on the subject, one of the business-focused ones I can recommend is Data Science for Business. Another interesting page is the Awesome Data Science Ideas list, which has gathered quite a few business relevant examples on data science use cases.

### 1. Assemble the right project team

The right team is key (isn’t it always?). The right team in this case is one that is multidisciplinary. It needs someone who has a thorough understanding of the business, someone with decision power who will make the difficult decisions when they come up, someone who is able to make the project visible within the company, someone to follow-up on the project status, someone with good knowledge on the company’s data resources and last but not least a strong data scientist. While not all roles have to be distributed over separate people (starting small can be a good thing), all roles are valuable.
The data scientist will be at the center of the team and is the one who generates the main deliverable. Therefore, it is essential that an expert level data scientist is selected to work on this first use case so that there is no doubt whatsoever about his skills.

### 2. Selecting the right case

For a company to gain a good understanding of the value that data science can offer, a first use case should have a clear impact on value creation or at the very least show a potential to do so. In this phase, there are four questions which I generally use to estimate the potential success of a first use case.
• Do we expect the case to have an impact on value creation (and aid decision making)?
• Are the results translatable to other departments, allowing to stimulate engagement and induce excitement around the topic of data science within the company?
• Do we have the basic necessities to start working on this case (availability of domain experts within the company, the right data, etc.)?
• Is the objective of the case defined in such a way that the output of the case is actionable? In other words, will it be able to solve a business problem in the short to medium term? In the long run this correlates strongly with value creation.

### 3. Invest in the case

Once a case is selected we need to make sure that the data scientist has the basics worked out. This means that he needs easy access to the data as well as input from business domain experts. It is important to support him in such a way that he is able to focus most of his time on relevant exploration and modelling. Allow him to work with the tools he prefers, as he will perform best when using these. As it is doubtful that you would have this person in-house when you are starting out on first project, there is a good chance that you will have to look outside of your company for data science consultants. This is a landscape that is somewhat difficult to navigate. There have been a huge amount of developments in the data science world in terms of both theory and tools. This has made the big players in the consultancy world to have had a hard time keeping up, and consultants working only with a specific proprietary tool often lag behind the state-of-the-art. Too often people with a very limited resume are presented as senior data scientist. Be critical about resumes, press for (more) detailed descriptions of the projects they’ve done and judge them based on their actual experience with data science.
Data is crucial. To be able to go for a successful first use case, make sure the data scientist has the correct data at hand. While building models on small scale data is not impossible, the chance of mediocre results is generally higher. This is not something that you would want to risk as a first project. If the data does not exist in the right format, or does not exist at all but is essential to the case, invest in making it available through manual efforts or go for another first project. In this phase, you might very well get the advise from someone to invest in a “Big Data Infrastructure” like Hadoop. I’ve even heard about investing in big data infrastructure as being presented as as a first project with data science. While big data infrastructure has some very concrete merits, it is not a show stopper if you do not have this infrastructure at the moment or simply find it too early to invest in. A data scientist is generally able to adapt rather quickly to the environment and will make due with the means he has available. While in the short run this could potentially mean that you are not able to deploy a solution in a more stable production environment, it is not keeping you from getting the first results and gaining an in-depth understanding of the feasibility and value of the project when it does scale.

### 4. Explore quickly and fail fast

As already mentioned, exploration is an essential part of any data science project. During this exploration phase, the data scientist tries to better understand if it is possible to find a solution to the problem at hand. If these initial explorations lead you to conclude that the resulting model will most likely have a below-par result, you should either adjust the objective based on the new knowledge that was gained or provide/invest in the correct resources which do enable you to attain the original objective. This makes for a specific environment where failure is possible and well-founded (intermediary) conclusions should be cheered on, even if those conclusions are found to be negative.