CLV Prediction in R

Jul 10, 2022 5 min read PredictiveAnalytics, Probability, R, tutorial

Introduction

In the previous posts, we talked about the theoretical part of Pareto/NBD model. This time, we will go over the CLV prediction in R.

Customer lifetime value (CLV) is a huge topic and CLV prediction is super important for the retailers. The decision maker could use CLV to determine “who” is the best customer and get ready to target them.

Roughly speaking,

$$ CLV \approx \text{future num of transactions} \cdot \text{future average spending} $$

apply the Pareto/NBD model to predict the future # of transactions
apply the Gamma-Gamma model to predict the future average spending

This approach is a more probabilistic approach. Of course, we can use machine learning (tree based models) to predict the future total spending, say during next year, using tidymodels workflow, but personally I prefer the simple but powerful probability models.

R package

I mainly use the following 3 R package to fulfill the CLV task:

CLVTools is pretty comprehensive and powerful over all 3. BTYDPlus provides lots of extensions for BTYD, for example its MBG/NBD model fixed the limitation of the BG/BD. BTYD is the oldest one but good for learning propose.

library(zetaEDA)
library(zetaclv)
library(BTYD)
library(BTYDplus)
enable_zeta_ggplot_theme()

Import the fake transactional data for this tutorial,

cohort19 <- eg_trans_data %>% 
  # add column first purchase year
  with_groups(cust, mutate, fp_yr = min(lubridate::year(date))) %>% 
  filter(fp_yr == 2019) %>% 
  select(-fp_yr)

# have a look
cohort19 %>% 
  head() %>% 
  arrange(cust, date) %>%
  print_kbl(cap = "Cohort 2019 transaction data")

Table 1: Cohort 2019 transaction data
cust	date	sales
uid0001	2019-12-02	3470
uid0001	2020-04-20	1050
uid0001	2020-04-20	124
uid0005	2019-08-08	1169
uid0006	2019-04-20	508
uid0006	2020-04-09	922

Build the Pareto/NBD Model using CLVTools 📦.

# build clv class of data
dclv <- generate_clvdata(
  data = cohort19,
  timeUnit = "week", 
  splitDate = NULL
)

## Note that: time unit is in < week >

# model object
pareto_nbd <- pnbd(dclv)

summary(pareto_nbd)

## Pareto NBD Standard  Model 
## 
## Call:
## pnbd(clv.data = dclv)
## 
## Fitting period:                                
## Estimation start  2019-01-01    
## Estimation end    2021-06-07    
## Estimation length 126.8571 Weeks
## 
## Coefficients:
##       Estimate Std. Error  z-val Pr(>|z|)    
## r      0.45945    0.04783  9.606   <2e-16 ***
## alpha 20.06906    1.71733 11.686   <2e-16 ***
## s      0.18148    0.01842  9.853   <2e-16 ***
## beta   1.01572    0.44456  2.285   0.0223 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Optimization info:                  
## LL     -12872.0449
## AIC    25752.0898 
## BIC    25775.0767 
## KKT 1  TRUE       
## KKT 2  TRUE       
## fevals 24.0000    
## Method L-BFGS-B   
## 
## Used Options:                 
## Correlation FALSE

All these packages have common modeling steps, which will be showed in the next section.

Build `cbs` data

(C)ustomer-(B)y-(S)ufficient data includes:

x : number of repeat transactions.
- Usually, multiple transactions on the same day just count once; in other words, count the how many purchase days.
- 同一天多次购买，算一次。一般来说，是计数多少个不同的购买日期。
t.x : Time between first and last transactions.
- 首次购买至最后一次购买相隔的时间。
T.cal: Time between first transaction and end of calibration period.
- 首次购买至当前观测日相隔的时间。

More details for the cbs data, check this:

# the help doc explains all above variables
?BTYDplus::elog2cbs()

We have many ways to get cbs data from transactional data, here I use my own function.

cbs_dat <- generate_cbs(
  data = cohort19,
  timeUnit = "week", 
  splitDate = NULL
) %>% 
  select(cust, x, t.x, T.cal)

## Note that: time unit is in < week >

# check
cbs_dat %>% 
  head() %>% 
  print_kbl(cap = "example: CBS data")

Table 2: example: CBS data
cust	x	t.x	T.cal
uid0001	1	20.00000	79.00000
uid0005	0	0.00000	95.57143
uid0006	1	50.71429	111.28571
uid0010	0	0.00000	120.42857
uid0011	0	0.00000	124.85714
uid0012	0	0.00000	87.00000

Predict PAlive

This is the most tough part w.r.t computation in Pareto/NBD modeling. If you forget the story, please check my previous post here.

Result from “A Note on Deriving the Pareto/NBD Model and Related Expressions” shows that:

Checking source code

If we check the source code of the following functions, we’ll find they apply the above formula to get the probability of alive.

# check the source code:

# BTYD::pnbd.PAlive()

# BTYD::pnbd.generalParams

After checking, you will find that BTYD::pnbd.PAlive() internally called BTYD::pnbd.generalParams .

Insight

The probability of alive (PALive) for a given customer is a function of:

The 4 model parameters $ (r, \alpha, s, \beta) $.
The customer’s own CBS info, $ (x, t_x, T) $.

To sum up,

PAlive is a function of 4 parameters and individual’s CBS data.

$$ \text{PAlive} = \text{function}(r, \alpha, s, \beta, x, t_x, T) $$

PAlive DOES NOT depend on the prediction horizon $t $ ! This is a common mistake when someone explains this prediction result.

Interpretation

Let’s find the PAlive for cust:uid0001 who had x=1 repeat transactions (or 2 transactions in total),

# his own cbs
cbs_dat %>% 
  filter(cust == "uid0001")

##      cust x t.x T.cal
## 1 uid0001 1  20    79

# palive is a function of 4 model params and his own cbs
palive_val <- BTYD::pnbd.PAlive(
  params = unname(coef(pareto_nbd)),
  x = 1,
  t.x = 20, 
  T.cal = 79
)

palive_val

## [1] 0.6198724

For customer uid0001, the probability that he is still “alive” at present (i.e. $ \tau > T $ ) is 62.0%. In other words, there is 38.0% chance that the customer uid0001 had already leaved the company.

What about the customers with single purchase?

single_cbs_dat <- cbs_dat %>% 
  filter(x == 0) %>% 
  arrange(T.cal)

single_cbs_dat %>% 
  head() %>% 
  print_kbl(cap = "Example of the single purchase cbs data")

Table 3: Example of the single purchase cbs data
cust	x	t.x	T.cal
uid2509	0	0	74.85714
uid3443	0	0	74.85714
uid4183	0	0	74.85714
uid0301	0	0	75.00000
uid0727	0	0	75.00000
uid0799	0	0	75.00000

single_ids <- single_cbs_dat$cust

# use predict in clvtools to get all prediction results
all_preds <- predict(pareto_nbd, prediction.end = 52)

# discover the ones with single purchase
plotdata <- all_preds %>%
  filter(Id %in% single_ids) %>% 
  select(Id, PAlive) %>% 
  left_join(single_cbs_dat, by = c("Id" = "cust")) %>% 
  arrange(T.cal)

# have a look
plotdata %>% 
  head() %>% 
  print_kbl(cap = "Example of palive for the one with single purchase")

Table 4: Example of palive for the one with single purchase
Id	PAlive	T.cal
uid2509	0.3266206	74.85714
uid3443	0.3266206	74.85714
uid4183	0.3266206	74.85714
uid0301	0.3263572	75.00000
uid0727	0.3263572	75.00000
uid0799	0.3263572	75.00000

plotdata %>% 
  ggplot(aes(x = T.cal, y = PAlive)) +
  geom_jitter(alpha = .2, size = 2.5, width = .42) +
  geom_line(color = "tomato", size = 1.4) +
  labs(
    x = "T.cal (unit: week)",
    title = "PAlive (of single purchase customers) as a function of T.cal"
  )

If the customer had only 1 purchase, the earlier the purchase made, the more likely he will drop out. Fits our intuition.

Predict Conditional Expected Transaction

Insight

From last post, we have showed the Conditional Expected Transaction, CET, is the product of PAlive and the updated mean of Pareto/NBD:

CET = PAlive * updated mean of Pareto/NBD
The value of CET has already involved the probability of alive (PAlive)!

Example

Let’s check it by manually calculate the CET for the future 52 weeks.

params <- unname(coef(pareto_nbd))

# model parameters
r = params[1]
a = params[2]
s = params[3]
b = params[4]

# the customers cbs
cbs_dat %>% 
  filter(cust == "uid0001")

##      cust x t.x T.cal
## 1 uid0001 1  20    79

Setting inputs,

# note that from above formula 't.x' is not needed for the updated mean
x = 1
Tcal = 79

# prediction time horizon
t = 52

# follow the formula to compute updated mean of pnbd
part1 <- (r+x)*(b+Tcal)
part2 <- (a+Tcal)*(s-1)
part3 <- 1 - ((b+Tcal)/(b+Tcal+t))^(s-1)

updated_mean <- part1/part2*part3

updated_mean

## [1] 0.7295102

The “updated” mean of Pareto/NBD DOES NOT involve the individual’s t.x, only involves x and Tcal of the individual.

The individual’s t.x has nothing to do with “updated” mean of Pareto/NBD. This is a interesting point!
However, individual’s t.x definitely affects the PAlive.

# CET calculated manually
updated_mean * palive_val

## [1] 0.4522032

# CET calculated using BTYD
BTYD::pnbd.ConditionalExpectedTransactions(
  params = params,
  T.star = 52, 
  x = 1, 
  t.x = 20, 
  T.cal = 79
)

## [1] 0.4522032

# CET calculated using CLVTools
all_preds %>% 
  select(Id, CET) %>% 
  filter(Id == "uid0001") %>% 
  pull(CET)

## [1] 0.4522032

Oh yeah, all the above results are the same 🙂.

What happened on “updated” mean of Pareto/NBD?

# mean of Pareto/NBD
# without the individual level information
mean_pnbd <- BTYD::pnbd.Expectation(params = params, t = 52)
mean_pnbd

## [1] 0.6949746

# updated mean of Pareto/NBD
# for customer uid0001
updated_mean

## [1] 0.7295102

After updating the parameter with the individual’s behavior, we can see the “new mean” increase a little bit from 0.695 to 0.73.

Interpreation

Under customer uid0001’s past behavior, i.e. given his cbs data, the model expects he will make 0.45 number of transactions in the future 52 weeks.

Predict Average Spending

Insight

We will predict the customer’s future average spending using Gamma-Gamma-Model (if you don’t remember the story, click it). We have showed that ,

The moral of the story is that:

If the customer has many transactions, then his own average spending “will say more”, and the population mean “will say less”.
On the other hand, if the customer has a small number of transactions, then the population mean “will say more” on his future average spending.

Parameter Estimation (including the 1st trans.)

spend_cbs_data <- generate_cbs(
  data = cohort19,
  timeUnit = "week",
  splitDate = NULL
) %>% 
  # total number of tran
  mutate(num_trans = x + 1) %>% 
  # average spending
  mutate(avg_spend = sales/num_trans) %>% 
  mutate(is_single = x == 0) %>%
  as_tibble()

## Note that: time unit is in < week >

Same results obtained from two packages:

# check the function in BTYD
?BTYD::spend.EstimateParameters

params <- BTYD::spend.EstimateParameters(
  m.x.vector = spend_cbs_data$avg_spend,
  x.vector = spend_cbs_data$num_trans
)

names(params) <- c("p", "q", "r")

params

##           p           q           r 
##    1.749611    2.686046 2939.254721

# check the gg function in CLVTools
?CLVTools::gg

ggamma <- CLVTools::gg(clv.data = dclv, remove.first.transaction = FALSE)

ggamma

## Gamma-Gamma Model
## 
## Call:
## CLVTools::gg(clv.data = dclv, remove.first.transaction = FALSE)
## 
## Coefficients:
##        p         q     gamma  
##    1.750     2.686  2939.255  
## KKT1: TRUE 
## KKT2: TRUE

Understand weighted average

Calculate the weight for the given individual’s observed average,

# calculate the weight for the given individual's observed average
cal_ind_wt <- function(param_vec, num_trans) {
  p = param_vec[1]
  q = param_vec[2]
  r = param_vec[3]
  x = num_trans
  
  unname(
    (p*x) / (p*x + q - 1)
  )
}

Calculate the population mean,

cal_pop_mean <- function(param_vec) {
  p <- param_vec[1]
  q <- param_vec[2]
  r <- param_vec[3]

  if (q < 1) {
    stop("q must be greater than 1!")
  }

  # return
  res <- (p * r) / (q - 1)
  
  unname(res)
}

# what is the pop mean
pop_mean_spend <- cal_pop_mean(params)

pop_mean_spend

## [1] 3050.065

Now let’s compute the weight for each customer,

plotdata <- spend_cbs_data %>% 
  select(cust, is_single, num_trans, avg_spend) %>%
  mutate(ind_wt = cal_ind_wt(params, num_trans)) %>%
  mutate(pop_wt = 1 - ind_wt)

set.seed(123)

plotdata %>% 
  slice_sample(n = 10) %>% 
  mutate(across(ends_with("wt"), fmt_pct)) %>% 
  mutate(avg_spend = fmt_comma(avg_spend)) %>%
  print_kbl(cap = "weight for individual vs. weight for pop")

Table 5: weight for individual vs. weight for pop
cust	is_single	num_trans	avg_spend	ind_wt	pop_wt
uid4895	FALSE	3	1,574	76%	24%
uid1199	TRUE	1	6,809	51%	49%
uid0433	FALSE	4	1,971	81%	19%
uid4051	FALSE	12	2,022	93%	7%
uid2552	TRUE	1	3,132	51%	49%
uid2792	TRUE	1	280	51%	49%
uid2829	TRUE	1	4,485	51%	49%
uid2348	TRUE	1	1,730	51%	49%
uid1525	FALSE	4	2,005	81%	19%
uid3588	TRUE	1	1,706	51%	49%

plotdata %>% 
  distinct(num_trans, .keep_all = TRUE) %>% 
  select(num_trans, ind_wt) %>% 
  ggplot(aes(x = num_trans, y = ind_wt)) +
  geom_point() +
  geom_line() + 
  scale_y_continuous(labels = fmt_pct) +
  labs(title = "Individual Weight as a function of Num. of Trans.")

Parameter Estimation (without the 1st trans.)

params2 <- CLVTools::gg(clv.data = dclv, remove.first.transaction = TRUE) %>% coef

params2

##           p           q       gamma 
##    1.309883    2.542147 3975.335041

plotdata <- spend_cbs_data %>% 
  select(cust, is_single, num_trans, avg_spend) %>%
  mutate(ind_wt = cal_ind_wt(params2, num_trans)) %>%
  mutate(pop_wt = 1 - ind_wt)

plotdata %>% 
  distinct(num_trans, .keep_all = TRUE) %>% 
  select(num_trans, ind_wt) %>% 
  ggplot(aes(x = num_trans, y = ind_wt)) +
  geom_point() +
  geom_line() + 
  scale_y_continuous(labels = fmt_pct) +
  labs(title = "Individual Weight as a function of Num. of Trans.")

Probability Rprogramming PNBD CLV CLVTools BTYD BTYDPlus marketing zeta.clv

Chen Xing

Founder & Data Scientist

Enjoy Life & Enjoy Work!

CLV Prediction in R

Introduction

R package

Build cbs data

Predict PAlive

Checking source code

Insight

Interpretation

What about the customers with single purchase?

Predict Conditional Expected Transaction

Insight

Example

What happened on “updated” mean of Pareto/NBD?

Interpreation

Predict Average Spending

Insight

Parameter Estimation (including the 1st trans.)

Understand weighted average

Parameter Estimation (without the 1st trans.)

Chen Xing

Founder & Data Scientist

Related

Build `cbs` data