Your address will show here +12 34 56 78
Marketing analytics
Accuracy is often chosen as the benchmark performance metric for predictive models. But is this always a sensible choice? Our keen readers can look at the length of this post and answer this themselves: “No and it’s not that simple”. As a reminder, accuracy is the proportion of cases the model classifies correctly and can be a useful indicator of how well the model is performing. In some cases, however, accuracy can be very misleading. Take for example a model that predicts whether a patient has a certain rare disease based on some measurements. A model always predicting the patient is healthy would have an accuracy of 99.99% for a disease occurring in 1 in 10000 people. This model, albeit having a particularly high accuracy, is clearly quite useless for medical diagnostics. This example raises the key point on measuring model performance: the choice of a performance metric must reflect the real-world problem. In our example, it might make more sense to measure the proportion of patients with the disease correctly identified as such (a.k.a. sensitivity). This metric would discard our overly simple all-are-healthy model by giving it an underwhelming sensitivity of 0%. A sensible choice in the performance metric is crucial for selecting a model that will best fulfill its purpose and tackle the business issue at hand. It will also be important when tuning the model parameters. Here we’ll review some of the most common measures of performance and point out their uses and caveats.

The confusion matrix

Confusion matrix One of the most widely used visualisation for classification results is the rather-oddly-named confusion matrix.
Short anecdote on the name: As many statistical terms, this one arose from a specific experiment. Subjects would be performing a series of psychological tests and their responses were measured. The table listed whether the subjects had given the expected answer or were confused and chose the wrong one, hence the name.
Back on topic. Confusion matrices are tables showing counts with rows showing the observed (real) outcomes and columns the predictions. Here is an example for a two-class problem with possible outcomes being positive and negative (e.g., for medical diagnostics).
Predicted positive Predicted negative Total
Observed positive True positives (TP) False negatives (FN) Total observed positives
Observed negative False positives (FP) True negatives (TN) Total observed negatives
Total Total predicted positives Total predicted negatives Total
These counts are also used to compute widely used performance metrics. Here we list the most common:
  • Accuracy: (TP + TN) / All observations. Measures the overall classification rate of the model.
  • Sensitivity: TP / Total observed positives. Measures the capacity of the model to detect positive cases.
  • Specificity: TN / Total observed negatives. Measures how well the model leaves negative cases as negatives. In a way, it checks whether the model is predicting positives only for actual positives.
  • Precision: TP / Total predicted positives. Measures how trustworthy a positive prediction is, i.e. the proportion of predicted positives that are indeed positive.
  • Negative predictive value (NPV): TN / Total predicted negatives. Similarly to the precision, this gives a measure of confidence on predicted negatives.

Accuracy, precision and NPV

Accuracy, precision and the negative predictive value are very dependent on the rate of occurrence of the positive cases as seen by the example on the disease. For very unbalanced cases such as this one (i.e., where the proportion of positive and negative cases in the data are very different) these indicators could prove tricky to use as they might favour the large class too much. This is especially problematic when the minority class is the main interest of the model. In balanced cases though, they could prove useful to measure either how well a model performs in general or regarding a specific class. performance metric It’s worth noting that precision and NPV are equivalent as they both measure the amount of correct predictions amongst all predictions of a given class. These indicators are not directly applicable to problems where the outcomes do not correspond to an event happening or not such as for our positive/negative case, or where there are more than two classes. We can nevertheless always follow this logic to obtain similar indicators for each class.

Sensitivity and specificity

Sensitivity and specificity are less sensitive to the class sizes and therefore usually the preferred choice for performance measurements. In particular, they are commonly used to tune the model parameters. Many models output not a class directly but the probability of belonging to that class. The data scientist must therefore set a threshold that will determine whether a prediction goes one way or the other. The perfect model would lead to a probability of being “positive” of 0 for all “negative” cases and 1 for all “positives”. Here, the choice of threshold is irrelevant and any value will make both the sensitivity and specificity (as well as the accuracy) equal to 1. Needless to say this is never the case (or you are a very, very good statistician!). In reality, there is a tradeoff between sensitivity and specificity. Indeed, in the extreme where all predictions are “positive”, the sensitivity will be 1 and the specificity will be 0. Conversely, if all cases are predicted as “negatives”, the sensitivity will be 0 and the specificity 1 (the case in our disease example). These extreme cases lead, of course, to useless models so the threshold should go somewhere in the middle. AUC performance metric A common measure of how well a model deals with this tradeoff is the area under the ROC curve (Receiver operating characteristic, again, don’t mind the odd name) or simply AUC. This curve plots the values of sensitivity and specificity for all values of the threshold. The area under it measures how close the model is to the fabled perfect model.  

Measuring whether a model is generalisable

Setting the choice of indicators aside, remember it is crucial to always calculate performance metrics of models on a set of values NOT used in the training of the model. Indeed, the model is meant to deal with previously unseen data and we want to measure its performance in the “real-world”. There are many ways of setting some data aside for testing a model once its been trained but the main purpose of all these methods is to prevent the model from overfitting the training data. This means we want to avoid a model that follows our data so closely that it only works for these specific points. Overfitting example The problems that can arise from this are exemplified in the figure above. Here, a set of points distributed roughly along a line can be fitted by a linear regression that will best reflect the trend (red line), or by a curve that passes each point but completely misses the trend (grey line). Measuring the performance on the points used to build the model would lead us to chose the “perfect” fit rather than the actual trend.

Putting it all together

Now that we have reviewed different possibilities comes the actual task of choosing one performance metric. This metric can either be one of the above-mentioned ratios but could also be a indicator made of two or more of these values. The grouping of values can be as simple as an average and also be much more complex such as for the calculation of the AUC. Typical grouping methods include weighted averaging, geometric averaging, or harmonic means. It’s important to mention that there is no single perfect choice but rather a preferred direction this choice can take. To identify this direction, the business interests around the model have to be clearly identified. This involves answering questions such as:
  • Do the outcomes occur with similar frequency (i.e., is this a balanced problem)?
This will determine whether it is safe to use metrics that are sensitive to the proportion of the classes. If the problem is not balanced, these should be avoided. In our toy example, the problem was very unbalanced which rendered the accuracy measure useless.
  • What error is preferable/less costly?
This is best understood with an example: For a fraud detection model, is it better to think an individual is fraudulent when they are not or the inverse? The former would lead to further investigation but the latter would allow the fraudsters to get away with their crime (and prevent any recovery by the authorities). In less clean cut cases, the relative cost of each error can be used to combine indicators through weighted averages. This would lead to trying to minimise errors in general whilst encouraging the model to chose one type of mistake over the other.
  • Is there a specific output the business is interested in or should we focus on overall performance?
In other words, are we choosing between equally important outcomes (e.g., market goes up or down) and only global performance of the model is needed (e.g., accuracy, AUC), or is it a prediction of whether a key event is happening or not (e.g., predicting system failures)? In the latter case, we will probably prefer indicators that give us information on specific outcomes. General performance metrics are more commonly used in choosing one model type over the other whereas specific ones such as the weighted averages mentioned above are used to set the decision thresholds.

Wrapping up

Choosing the right performance metric to evaluate a predictive model can make the difference between it being useful or not. It will also make sure expectations on the utility of the model are realistic. Indeed, our initial example showed how a wrong indicator could present the model as great when it is in fact useless. It is also one of the key points of contact between the modelling work and the business interests. The involvement of the business in this decision and a clear formulation its needs is therefore crucial in order to make any modelling project a success.  
by Pablo Cogis dataroots is a data science consultancy company

Marketing analytics

What’s Expected Customer Lifetime Value?

Expected Customer lifetime value (eCLV) is a term which describes the net profit resulting from the future relationship with a customer. eCLV is an interesting concept as it will show you how much you can spend on acquiring a new customer. A calculation of eCLV generally sums the expected profits in future time periods and estimated retention rate. Most models will also discount the cashflow. To make it a bit more concrete, you can find a simplified calculation of the eCLV below. In this example we expect $100 of net profit each year and a retention rate that drops 25 percentage points each year. We end up with an expected CLV of $250. $100 * 100% (year 1) + $100 * 75% (year 2) + $100 * 50% (year 3) + $100 * 25% (year 4) + $100 * 0% (year 5) = $250

Pitfalls in traditional customer lifetime value

While this number we end up with is actionable (i.e. we know how much we can spend on acquiring the customer) it is not necessarily correct. This traditional definition of eCLV has a number of pitfalls. The most notable ones are uncertainty and missing details. The first aspect – uncertainty – will always be present, i.e. the retention rate or the customer’s spending ends up differently then expected. Uncertainty also exponentially increases with the expected lifetime of a customer; we might be able to give relevant predictions for the coming year, but it becomes much more difficult to do so for the coming ten years. The second aspect – not enough detail – comes down to aggregation. Many of these eCLV models aggregate too much and thus do not take into account the specificity of certain customers. I.e. the eCLV could be wildly different for a 35 year old mother of two versus an 18 year old student.

Predictive customer lifetime value

As theory in this domain has evolved we have come to use the term predictive CLV instead of expected CLV. In predictive CLV we use machine learning techniques in combination with time series modeling to take into account all the details of a lead. So if the lead is a 40 year old male, likes to do sports, has two kids and works a job in retail, the predictive CLV model will take all this into account in order to calculate the value of this customer. By doing so, the pCLV will give a much more correct view of the value of a specific potential customer as the value will be based on all the characteristics of a person. So instead of asking yourself “how much can I spend on acquiring one customer?” you can now ask yourself “how much can I spend on acquiring a customer who is a male of 40 years, who likes to do sports, has two kids and works a job in retail?” and end up with a much more realistic answer. So why is it relevant to take all these aspects into account? Say for example you are in the sports retail industry, it seems logical that hours of sports per week are relevant in driving CLV. The graph above validates this train of thought, but also shows another interesting aspect. If we wouldn’t take into account gender our model would average out the specificities of each gender, ending with a much flatter line (the red one). This almost flat line seems to suggest that hours of sports per week is not important when it comes to driving CLV and removes our understanding of the specificities of male vs. female behaviour. While we talk about the relationship between CLV, gender and hours of sports here, you can imagine that in a real life case much more variables come into play. Once we reach more then 3 / 4 different characteristics and want to take into account the possible interplay between these, it becomes almost impossible to model this accurately using traditional CLV methods. Predicted Customer Lifetime Value (pCLV) can play an important role in making the eCLV model much more accurate. In pCLV, specific machine learning techniques are applied to understand the relationship between all these variables and build a model that can use all this information to most accurately estimate the pCLV. When training such a model it is possible to estimate the level of certainty we have about a given pCLV. This information is very useful in order to use the output in an actionable way. Next to making more informed decisions, these objective measures on accuracy can also be used to further optimize the model.

Linking acquisition, retention & churn to predictive CLV

Within the field of marketing analytics the state-of-the-art has evolved quite rapidly. While topics such as acquisition, churn and retention each have their own specific best-practices in terms of modeling, the current techniques allow for a strong integration of these different analyses. The value of linking up these models is that it allows one to make strongly informed decisions. As the pCLV model can be made very detailed it allows to give very detailed information on current and potential customers. For example, it will be able to show whether or not it is wise to invest in trying to acquire a specific customer. It will show the value of a current customer, allowing one to custom tailor the retention strategy. Furthermore, it will show the potential value lost if a customer churns, which will help in the decision on whether or not a recovery intervention should be executed.

Low hanging fruit

Out of experience, we can say that if up to now no data science methods were applied to optimize a company’s CLV calculation, there is almost always room to improve it using more state-of-the-art methods. It is difficult to estimate upfront what the cost of a predictive customer lifetime model will be as this is influenced by a lot of factors (e.g. available data, quality of the data, business domain, volume of customer set). However, generally is quite fast to do a first explorative analysis to understand the business process, have a look at the available data and make an initial feasibility study on whether or not a custom tailored pCLV model can translate into business value for a company.