Incorporating Time-Invariant Covariates into PNBD II

Introduction

In this post, we’ll continue discover the difference between the Cohort Modeling and the Extended PNBD Model with time-invariant covariates. Check here to see the previous post.

Data Prepare

First let’s import the transactional data, and select the customers who first purchase happened in 2017-2018.

trans_dat <- read_rds("~/Desktop/trans_dat.rds")

# select the customers who first purchase happened in 2017-2018
trans_dat <- trans_dat %>%
  filter(date <= as.Date("2019-12-31")) %>%
  filter(fp_yr <= 2018)

# glimpse
head(trans_dat)
## # A tibble: 6 × 4
##   cust  date        sales fp_yr
##   <chr> <date>      <dbl> <dbl>
## 1 10017 2018-03-08   674.  2018
## 2 10021 2017-08-27   495.  2017
## 3 10040 2017-05-13  2781   2017
## 4 10041 2018-07-21    72   2018
## 5 10048 2018-10-10  4472.  2018
## 6 10049 2017-02-05 18780   2017
summary(trans_dat)
##      cust                date                sales              fp_yr     
##  Length:14943       Min.   :2017-01-01   Min.   :     0.5   Min.   :2017  
##  Class :character   1st Qu.:2017-06-11   1st Qu.:   414.2   1st Qu.:2017  
##  Mode  :character   Median :2018-02-16   Median :  1003.8   Median :2017  
##                     Mean   :2018-04-08   Mean   :  2147.1   Mean   :2017  
##                     3rd Qu.:2018-12-28   3rd Qu.:  2168.5   3rd Qu.:2018  
##                     Max.   :2019-12-31   Max.   :476659.0   Max.   :2018

How many new customers joined the enterprise in 2017 and 2018?

trans_dat %>%
  group_by(fp_yr) %>%
  summarize(new_cust = n_distinct(cust))
## # A tibble: 2 × 2
##   fp_yr new_cust
##   <dbl>    <int>
## 1  2017     5114
## 2  2018     2767

Train/Test

  • Training: Two years, 2017-2018, transactional data.

  • Testing: One year, 2019, transactional data.

The Cohort Models

Cohort 2017

Prepare CLV data,

ch17_dat <- trans_dat %>%
  filter(fp_yr == 2017)

# build clv type of data
dclv_17 <- generate_clvdata(
  data = ch17_dat,
  splitDate = "2018-12-31"
)
## Note that: time unit is in < week >

Build PNBD Model,

m17 <- standard_PNBD_model(dclv_17)

# check
summary(m17)
## Pareto NBD Standard  Model 
## 
## Call:
## pnbd(clv.data = clvData, optimx.args = optimx.args)
## 
## Fitting period:                                
## Estimation start  2017-01-01    
## Estimation end    2018-12-31    
## Estimation length 104.1429 Weeks
## 
## Coefficients:
##       Estimate Std. Error  z-val Pr(>|z|)    
## r      0.84155    0.08460  9.947  < 2e-16 ***
## alpha 54.31617    4.38954 12.374  < 2e-16 ***
## s      0.32609    0.03272  9.967  < 2e-16 ***
## beta   3.20325    1.06076  3.020  0.00253 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Optimization info:                  
## LL     -16587.4554
## AIC    33182.9107 
## BIC    33209.0697 
## KKT 1  TRUE       
## KKT 2  TRUE       
## fevals 409.0000   
## Method Nelder-Mead
## 
## Used Options:                 
## Correlation FALSE

Check the tracking plots,

plot(m17)
plot(m17, cumulative = TRUE)

On the testing period, the model curve is under the actual number of repeat transactions curve.

Cohort 2018

Similarly, we’ll get,

dclv_18 <- trans_dat %>%
  filter(fp_yr == 2018) %>%
  generate_clvdata(splitDate = "2018-12-31")
## Note that: time unit is in < week >
# model
m18 <- standard_PNBD_model(dclv_18)

# check model coef
summary(m18)
## Pareto NBD Standard  Model 
## 
## Call:
## pnbd(clv.data = clvData, optimx.args = optimx.args)
## 
## Fitting period:                               
## Estimation start  2018-01-01   
## Estimation end    2018-12-31   
## Estimation length 52.0000 Weeks
## 
## Coefficients:
##       Estimate Std. Error z-val Pr(>|z|)    
## r      1.05005    0.30318 3.463 0.000533 ***
## alpha 40.98040   10.09464 4.060 4.92e-05 ***
## s      0.31008    0.04534 6.839 7.95e-12 ***
## beta   0.42186    0.22740 1.855 0.063572 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Optimization info:                  
## LL     -2529.2006 
## AIC    5066.4012  
## BIC    5090.1032  
## KKT 1  TRUE       
## KKT 2  TRUE       
## fevals 285.0000   
## Method Nelder-Mead
## 
## Used Options:                 
## Correlation FALSE

Again, checking the tracking plots

plot(m18)
plot(m18, cumulative = TRUE)

On the testing period, the model curve is under the actual number of repeat transactions curve.

Interpret Model Coef

  • The 2017 cohort has lower transaction rate and lower dropout rate comparing with the 2018-cohort.
Table 1: cohort model coefs
cohort r alpha s beta lambda mu
2017 0.842 54.316 0.326 3.203 0.015 0.102
2018 1.050 40.980 0.310 0.422 0.026 0.735

Accuracy of Predictions

Let’s find the MAE and RMSE of the conditional expected transactions (CET) on the testing set.

Table 2: Cohort Model Error Measure on cohort 2017, 2018 and all together
cohort MAE RMSE Corr
2017 0.431 0.924 0.517
2018 0.615 1.089 0.424
All 0.496 0.985 0.484

Let’s check the error distribution for cohort 2017,

Why the customers purchased more in 2019, i.e. the actual transactions are larger than the model predictions?

After we talked with the company’s manger, we knew that the company started to perform a new membership service plan in 2019. The new service mode worked pretty well and leaded to the increasing number of repeat transactions.

The Extended PNBD Model

# set the fp_yr as a factor variable
trans_dat <- trans_dat %>%
  mutate(fp_yr = factor(fp_yr))

# check the factor level
levels(trans_dat$fp_yr)
## [1] "2017" "2018"

Note that, the model will use 2017 cohort as baseline or reference group.

# prepare static covariates data
static_cov_dat <- trans_dat %>%
  distinct(cust, fp_yr)

# clv data
dclv_all <- generate_clvdata(trans_dat, splitDate = "2018-12-31")
## Note that: time unit is in < week >
# make static covariates data as clv type of data
dclv_stat <- SetStaticCovariates(
  clv.data = dclv_all,
  data.cov.life = static_cov_dat,
  data.cov.trans = static_cov_dat,
  names.cov.life = "fp_yr",
  names.cov.trans = "fp_yr",
  name.id = "cust"
)

Now, we build the extended PNBD model with time-invariant covariates, which are fp_yr in this case.

ext_m <- standard_PNBD_model(
  dclv_stat,
  start.params.model = c(r = .88, alpha = 56.16, s = .29, beta = 2.25),
  start.params.life = c(fp_yr2018 = 2.01),
  start.params.trans = c(fp_yr2018 = .55)
)

# check the model
summary(ext_m)
## Pareto NBD with Static Covariates  Model 
## 
## Call:
## pnbd(clv.data = clvData, optimx.args = optimx.args, start.params.life = ..1, 
##     start.params.trans = ..2)
## 
## Fitting period:                                
## Estimation start  2017-01-01    
## Estimation end    2018-12-31    
## Estimation length 104.1429 Weeks
## 
## Coefficients:
##                 Estimate Std. Error  z-val Pr(>|z|)    
## r                0.86825    0.08145 10.660  < 2e-16 ***
## alpha           55.50864    4.28397 12.957  < 2e-16 ***
## s                0.31854    0.02583 12.332  < 2e-16 ***
## beta             2.90289    0.80920  3.587 0.000334 ***
## life.fp_yr2018   1.67367    0.29780  5.620 1.91e-08 ***
## trans.fp_yr2018  0.45153    0.10167  4.441 8.95e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Optimization info:                  
## LL     -19116.9873
## AIC    38245.9746 
## BIC    38287.8079 
## KKT 1  TRUE       
## KKT 2  TRUE       
## fevals 581.0000   
## Method Nelder-Mead
## 
## Used Options:                     
## Correlation     FALSE
## Regularization  FALSE
## Constraint covs FALSE

After checking the summary of the model, we know that the fp_yr affects both purchase process and attrition process (P-values are significant).

Now let’s check the tracking plot,

plot(ext_m)
plot(ext_m, cumulative = TRUE)

Interpret Model Coef

Since the 2017 cohort as the reference group, we already know its corresponding r, alpha, s, beta. However, we still need to calculate the alpha and beta for the 2018 cohort.

Plug into the formula: alpha = alpha0 * exp(-gamma1 * z1) and beta = beta0 * exp(-gamma2 * z2)

Table 3: Extended PNBD with time-invariate covariates model coefficients
cohort r alpha s beta lambda mu
2017 0.868 55.509 0.319 2.903 0.016 0.110
2018 0.868 35.340 0.319 0.544 0.025 0.585

If we compare this model’s coefficients with the cohort models’, we’ll find the purchase/dropout rate estimates are quite similar for both models.

Accuracy of Predictions

Again, calculate the MAE and RMSE on CET.

ext_m_pred <- predict(ext_m) %>%
  select(cust = Id, actual.x, CET) %>%
  left_join(static_cov_dat)

ext_m_17_acc <- ext_m_pred %>%
  filter(fp_yr == 2017) %>%
  cal_err()

ext_m_18_acc <- ext_m_pred %>%
  filter(fp_yr == 2018) %>%
  cal_err()

ext_m_all_acc <- ext_m_pred %>%
  cal_err()

# result
bind_rows(ext_m_17_acc, ext_m_18_acc, ext_m_all_acc) %>%
  mutate(cohort = c(2017:2018, "All"), .before = 1) %>%
  mutate(across(where(is.numeric), round, digits = 3)) %>%
  print_kbl(
    cap = "Extended PNBD Model Error Measure on cohort 2017, 2018 and all together"
  )
Table 4: Extended PNBD Model Error Measure on cohort 2017, 2018 and all together
cohort MAE RMSE Corr
2017 0.432 0.924 0.517
2018 0.615 1.085 0.429
All 0.496 0.983 0.486

Conclusion

Comparing the prediction error of two models, we find that two models prediction behaviors are very close. I prefer the extended PNBD model for this data.

Chen Xing
Chen Xing
Founder & Data Scientist

Enjoy Life & Enjoy Work!

Related