# How do we know that your customers are leaving?

ML classification algorithms may save companies money by making churn predictions. To build such a classifier you will need access to the profiles of your customers and the initial churn labelling. Thus this is a typical task of so-called supervised learning.

But in the first place businesses may need to predict when those losses are going to happen precisely. By knowing that they will be able to push early individual promo offers in an attempt to keep customers.

The latter seems to be a stickier challenge considering that in real-life scenarios you often have no more than a list of orders per customer to manipulate. At RBC Group we have defined a solution to get (nearly) online labelling for churn customers based purely on the frequency and the sum of their previous orders.

## Data generation and the core concept

So, according to the above, the only data we will use to demonstrate our approach is the randomly generated dataset containing records of order sum and date within two years for 1000 unique customers. The order frequency will be considered the difference between two subsequent orders.

Thus our final goal is to predict whether the number of days since the last known order is within the given customer’s ‘regular’ order frequency. In other words, we will calculate the probability of the above-said number of days being greater than expected by the known cumulative distribution function (1 - CDF) of the customer’s order frequency. If that probability is above a certain threshold — the customer is likely on its leave.

## Tweaking probability distribution function

Fitting the customer’s lifecycle (i.e. his regular order frequency) will be done further with scipy.stats.truncnorm. Yet we suggest considering not only the breaks between orders (i.e. order frequency) but also the sum of those orders. This is the second predictor available for us to model the behaviour of our customers. One can suppose that the average sum of orders (as a consumer pattern) is as important as their regularity. It means that we should rely more heavily on orders with amounts of money paid (hereinafter as order amount) close to an average level of a customer’s expenses.

Thus while fitting our empirical CDF we will calculate the mean order frequency for a given customer as a weighted average with weights inversely proportional (i.e. *fading*) to the distance of the order’s amount to an average amount of all known customer’s orders (i.e.* **zscore**)*.

That’s what we called tweaking the probability distribution function.

This is how our approach will perform in practice pretending to use the last customer’s order as a test. *(Note that all the probabilities here within the ‘churn risk’ zone are false positive, as the test orders were after all made by the customers, we just clipped them out from the corresponding subsets while modelling ECDF)*.

## Evaluating the performance

For this purpose, we need to split our generated dataset into train and test using the three-sigma rule of thumb — which means that we will calculate the duration of the test period as:*(mean order frequency (in days) for all the customers + 3 x SD)*

All the customers that have made at least one purchase within the test period are considered to be *stay_true*, while all the rest (i.e. those who made their last purchase only within the training period) — are *churn_true*.

The end of the training period is the so-called *check_date*, which means we will calculate the difference between the last customer’s order and the *check_date* to estimate the confidence (probability) of this difference (in days) to be outside the regular order frequency using tweaked ECDF.

As the basic approach, we will use the above-said three-sigma rule of thumb: if the difference between check_date and the date of the last order is above *(mean + 3 x SD) *— the customer is labelled as *churn_pred*. Alternatively, we will apply our ECDF method to classify the customers based on their calendar of orders using different thresholds for the labelling function.

Here is the confusion matrix for all the experiments. *(Note: since data is randomly generated the churn rate may seem implausible).*

So here you see how our proposed approach outperformed the three-sigma rule in terms of detecting leaving customers more accurately. Also note how increasing the threshold decreases the number of false-negative errors, meaning fewer customers will receive potentially annoying reminders from you (aka spam). The choice, as usual, depends on the nature of your business.