Santander Case — Part B: Net Promoter Score (NPS)

Here you will find: a system to score the satisfaction level of your customers.

The Problem

NPS is a management tool used as a measure of customer satisfaction and has been shown to correlate with revenue growth relative to competitors. NPS has been widely adopted by Fortune 500 companies and other organizations.

The metric was developed by (and is a registered trademark of) Fred Reichheld, Bain & Company and Satmetrix. It was introduced by Reichheld in his 2003 Harvard Business Review article, “The One Number You Need to Grow”. Its popularity and broad use have been attributed to its simplicity and its openly available methodology.

In this task, we need to give a rate from 1 to 5 for each customer of the test base respecting the ‘TARGET’ feature, that represents their level of satisfaction. The following points will guide the score system:

  • 1 represents the most dissatisfied and 5 the most satisfied;
  • The retention program should only be applied to customers with a satisfaction score of 1.

You can check the complete notebook with this solution on my Github.

This Case was made as a parte of the prize for winning the Santander Data Masters Competition. I explain more about the competition itself and the hard skills I learned and soft skills I used in my way to winning it in this article.

1 Loading Data and Packages

# Loading packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time%matplotlib inline# Loading the Train and Test datasets
df_train = pd.read_csv("data/train.csv")
df_test = pd.read_csv("data/test.csv")

The data can be found in this old Santanders Competition.

2 The Classification Model

Knowing the satisfaction score, allows us to take the results that maximize the profits as well as to understand the behaviour and satisfaction of each customer.

2.1 Method

The classification model, that we build in Part A, has also the option to output the probability of the customer being unsatisfied.

By using this type of output, we can then create 5 intervals, one for each level of satisfaction. The customer will receive a satisfaction label according to the interval in which the outputted probability fits.

So let’s first rebuild the model of Part A.

2.2 Dataset Split (train — test)

As said in Part A, section 3, the train_test_split method does the segmentation randomly. Even with an extremely unbalanced dataset, the split should occur so that both training and testing have the same proportion of unsatisfied customers.
However, as it is difficult to guarantee randomness in fact, we can make a stratified split based on the TARGET variable, thus ensuring that the proportion is exact in both datasets.

from sklearn.model_selection import train_test_split# Spliting the dataset on a proportion of 80% for train and 20% for test.X_train, X_test, y_train, y_test = train_test_split(df_train.drop('TARGET', axis = 1), df_train.TARGET, 
train_size = 0.8, stratify = df_train.TARGET, random_state = 42)
# Checking the split
X_train.shape, y_train.shape[0], X_test.shape, y_test.shape[0]
Output of the code above.

2.3 Rebuilding the selected dataset

Here we need to:

  • Remove constant / semi-constat features;
  • Remove duplicate features;
  • Select only the best 96 features we found in Part-A.

Removing constant and semi-constant feature:

# Investigating if there are constant or semi-constat feature in X_train
from sklearn.feature_selection import VarianceThreshold
# Removing all features that have variance under 0.01
selector = VarianceThreshold(threshold = 0.01)
selector.fit(X_train)
mask_clean = selector.get_support()
X_train = X_train[X_train.columns[mask_clean]]

Removing duplicate features:

# Checking if there is any duplicated column
remove = []
cols = X_train.columns
for i in range(len(cols)-1):
column = X_train[cols[i]].values
for j in range(i+1,len(cols)):
if np.array_equal(column, X_train[cols[j]].values):
remove.append(cols[j])
# If yes, than they will be dropped here
X_train.drop(remove, axis = 1, inplace=True)

Selecting the 96 best features:

# Selection the 96 best features aconrdingly to f_classif
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
selector_fc = SelectKBest(score_func = f_classif, k = 96)
selector_fc.fit(X_train, y_train)
mask_selected = selector_fc.get_support()
# Saving the selected columns in a list
selected_col = X_train.columns[mask_selected]
# Creating datasets where only with the selected 96 features are included
X_train_selected = X_train[selected_col]
X_test_selected = X_test[selected_col]

Now that we have successfully rebuild the selected datasets, we can move forward to the next steps.

2.4 Retraining the model

Because we want the model developed in Part A to generate probabilities for each customer to be unsatisfied, we need to retrain it as we did in Part A. The good news is that we already have the optimal hyperparameters and they can be used in the model training. Let’s recap the best hyperparameters:

  • learning rate: 0.007961566078062952;
  • n_estimators: 1397;
  • max_depth: 4;
  • min_child_weight: 5.711008778424264;
  • gamma: 0.2816441089227697;
  • subsample: 0.692708251269958;
  • colsample_bytree: 0.5079831261101071.

So let’s train the model.

# Generating the model with the optimized hyperparametersclf_optimized = xgb.XGBClassifier(learning_rate = 0.007961566078062952, n_estimators = 1397, max_depth = 4, min_child_weight = 5.711008778424264, gamma = 0.2816441089227697, subsample = 0.692708251269958, colsample_bytree = 0.507983126110107, seed = 42)# Fitting the model to the X_train_selected dataset
clf_optimized.fit(X_train_selected, y_train)

Now that we have a trained model, we can check if its performance is the same as in Part A, using the test split (X_test_selected).

# Evaluating the performance of the model in the test data (which have not been used so far).
y_predicted = clf_optimized.predict_proba(X_test_selected)[:,1]
auc(y_test, y_predicted)
AUC of the trained model on test split data (X_test_selected)

As in Part A the model scored an AUC of 0.8477! It means we have a model as we want and can now continue to the next steps. But first, let’s take a look at how the model’s output is in probabilities format.

# checking the output in probability format
clf_optimized.predict_proba(X_test_selected)[:,1]
The output of the code above in probability format.

As we can see, the output is an array of probabilities with values between 0 and 1, where 0 means satisfied customer and 1 means unsatisfied customer. The probabilities lay in this range.

Now that we have a model and its output in the way we need to create the NPS system, let’s move forward.

3 Strategie & Method

3.1 Threshold selection

Now that we have a probability output that lay within the range 0 to 1, we can split this range into 5 intervals. Each interval will be a score of satisfaction and knowing the probability output for a specific customer, we are able to give him a satisfaction label. The question is just how we should split this range in a way that gives us the best NPS system. To answer this question, let’s plot the distribution of probabilities for the test split data (X_test_selected).

# Plotting the distribution of probailities for the X_test_selected
fig, ax = plt.subplots(figsize = (18, 8))
ax.hist(clf_optimized.predict_proba(X_test_selected)[:,1], bins = 20);
ax.set_xlim(0, 1);
plt.xticks(np.arange(0, 1, 0.1))
plt.title('Probability distribution for unsatisfied classification', fontsize=18);plt.ylabel('Frequency', fontsize=16);
plt.xlabel('Probability', fontsize=16);
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
Distribution of probabilities output of the test split dataset (X_test_selected).

As we can see above, the probabilities are in most part less than 0,1. This behaviour meets our expectations because just about 4,11% of the customers are unsatisfied.

Now let us define the threshold to classify the customer as unsatisfied. The idea is the following:

  • If probability output < threshold — > SATISFIED CUSTOMER
  • If probability output > threshold — > UNSATISFIED CUSTOMER

So let’s plot the ROC curve in order to choose the best threshold

# Code base on this post: https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python
import sklearn.metrics as metrics
# Calculate FPR and TPR for all thresholds
fpr, tpr, threshold = metrics.roc_curve(y_test, y_predicted)
roc_auc = metrics.auc(fpr, tpr)
# Plotting the ROC curve
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize = (20, 8))
plt.title('Receiver Operating Characteristic', fontsize=18)
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.4f' % roc_auc)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.legend(loc = 'upper left', fontsize = 16, frameon = False)
plt.plot([0, 1], [0, 1], color = 'grey', linestyle = '--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate', fontsize = 16)
plt.xlabel('False Positive Rate', fontsize = 16)
ax.plot([0.12, 0.12], [0, 0.61], color='green', linestyle='dashed', label='FPR = 0.12')
ax.plot([0, 0.12], [0.61, 0.61], color='green', linestyle='dashed', label='FPR = 0.12')
plt.text(0.13, 0.59, 'FPR = 0.12 | TPR = 0.61', fontsize = 12, color = 'green')
plt.show()
ROC curve with the best threshold marked by the green lines.

As known, the best threshold is the one closest to the upper-left corner. In our case, this best point is marked by the dotted green line and we know that it is where False Positive Rate (FPR) = 0.12 and True Positive Rate (TPR) = 0.61. Now, let’s use the FPR of 0.12 to find out the threshold that we will use.

# Finding the threshold value for FPR = 0,12.
threshold[np.where(np.around(fpr, decimals = 2) == 0.12)[0][0]]
Threshold output of the code above.

Once the customer who has output greater than this threshold is classified as unsatisfied, we could also use this value to define the first interval where the satisfaction score (NPS) is 1.

In summary, if the output probability is equal or greater than 0.09166643 it represents an unsatisfied customer and will receive label 1 in our NPS system. On the other hand, we still know that if the output probability is less than 0.09166643, the customer is satisfied.

Now we want to label satisfied customers. They can receive label 2, 3, 4 or 5, where 5 means most satisfied. To label the customers we will split the interval from 0 to 0.09166643 in 4 subintervals. The closes the output is to 0, the highest the NPS label is. So let’s divide this interval into 4 equal parts.

# Defining the subintervals
increment = threshold[np.where(np.around(fpr, decimals = 2) == 0.12)[0][0]] / 4
label_5_lower = 1 * increment
label_4_lower = 2 * increment
label_3_lower = 3 * increment
label_2_lower = 4 * increment
# Printing the intervals
print('0 to {} - label 5'.format(label_5_lower))
print('{} to {} - label 4'.format(label_5_lower, label_4_lower))
print('{} to {} - label 3'.format(label_4_lower, label_3_lower))
print('{} to {} - label 2'.format(label_3_lower, label_2_lower))
Subintervals generated by the code above.

Now that the subintervals are defined for labelling the satisfied and the unsatisfied customers, we can develop a function that will label just one customer or even the entire dataset.

3.2 Function to label the NPS

def NPS(df, predicted_probabilities):
"""Label the customers as their satisfaction level by fitting to an invervalthe outputted probability of the customer be unsatisfied (Target = 1).

Parameters:

df could be just one instance that shoud receive a score or an entire dataframe with multiple instances.
predicted_probabilities is an element or an entire array with all the probabilites geterated by the model for this instance or dataset.

Return:
return the inputed instance or dataframe with the label column appended, which contains the NPS for each customer.
"""

label_list = []
for i in range(0, len(predicted_probabilities)):

if predicted_probabilities[i] >= label_2_lower:
label_list.append(1)
elif label_3_lower <= predicted_probabilities[i] < label_2_lower:
label_list.append(2)
elif note_4_lower <= predicted_probabilities[i] < label_3_lower:
label_list.append(3)
elif label_5_lower <= predicted_probabilities[i] < label_4_lower:
label_list.append(4)
elif 0 <= predicted_probabilities[i] < label_5_lower:
label_list.append(5)


df['NPS'] = label_list
return df

4 Results Analysis

With the function of the customer labelling task ready, we can test it in the X_test_selected dataset created in section 2.

4.1 Results on X_test (known Target)

It is important to start the analysis in this part because this dataset has Target features. That means we are testing, first, on data that we know and it give us, therefore, the opportunity to evaluate and understand the NPS system developed.

# Re-adding the target feature to the test dataset
X_test['target'] = y_test
# Predicting the probabilities for the X_test_selected dataframe
y_predicted_prob = clf_optimized.predict_proba(X_test_selected)[:,1]

Now that we have the dataset with the target column — so we can control whether the customer is unsatisfied or not — and we have a function label NPS for each customer, we can apply this system to each customer and then investigate the results.

We now have the test split dataset with the target column re-added — in order to control how many unsatisfied customers receive NPS 1 (a high rate is desired) — as well as an array with the probabilities of all customer from the test split dataset to input the NPS function. In this way, we can finally label all the customers.

# Scoring all costumers in test split dataset
NPS(X_test, y_predicted_prob)
The output of the NPS function.

As we can notice, the NPS column was added to the dataset. Finally, it is possible to compare how the customer’s NPS labelling performed, since the target values are known. So the more the customers receive NPS of 1, for data where target = 1, the better is the model’s performance.

# Checking the NPS distribution for unsatisfied customers.
X_test[X_test['target'] == 1].NPS.value_counts()
Distribution of NPS for unsatisfied customers.

Let’s check the number of unsatisfied customers for this dataset.

# Number of unsatisfied customers
X_test.target.sum()

And finally, we can check the rate of unsatisfied customers with NPS 1.

# Proportion of unsatisfied customer that were labeled with NPS 1.
(357 / 602) * 100
The proportion of unsatisfied customers with NPS 1.

It can be seen that of a total of 602 unsatisfied customers, 357 received NPS equal to 1. This means that approximately 59.3% of dissatisfied customers received NPS 1. Although not all unsatisfied customers receive NPS of 1, 59% is already showing a good result.

Now let’s check the NPS distribution for the entire test split dataset.

# Checking the amount of each label in the dataset
X_test.NPS.value_counts().sort_index()
NPS distribution for the entire test split dataset.

Let’s visualize this distribution.

# Plotting the distribution of NPS
index = X_test.NPS.value_counts().sort_index().index.tolist()
values = X_test.NPS.value_counts().sort_index().values.tolist()
fig, ax = plt.subplots(figsize = (20, 8))
plt.bar(index, values);
plt.title('NPS distribution X_test', fontsize = 18);
plt.ylabel('Frequency', fontsize = 16);
plt.xlabel('Label', fontsize = 16);
NPS distribution for the X_test dataset.

Let’s calculate the proportion of NPS 1 for all the data.

# Calculating the proportion of label 1 for satisfied and unsatisfied customers
print('{} %'.format((X_test.NPS.value_counts().
sort_index().values[0] / X_test.shape[0]) * 100))
Proportion of NPS 1 in the whole X_test dataset

Therefore, it is clear that the value of notes 1 assumes a relevant amount of approximately 13.46% in the distribution and works well as a condition for the dissatisfied customer retention program to be applied.

We can also use the proportion to guide the use of NPS score on the test data (df_test), one that our model has never seen before and that doesn’t have the target label to guide the scoring!

4.2 Results on df_test (unknown data)

Finally, we can apply the NPS function to df_test and analyse it in a way similar as before.

# Predicting the probabilities for the df_predicted dataframe
y_predicted_prob = clf_optimized.predict_proba(df_test[selected_col])[:,1]
NPS(df_test, y_predicted_prob)
Labelled test dataset (df_test).

As mentioned above, there is no Target column for this dataset, what can be noticed at the image above. It means that we are working with real data, and that’s why we are using the proportion found in the previous section.

Therefore, we can analyse the distribution of labels and how it looks like regarding the NPS distribution above. We can as well analyse the label 1 proportion for the whole dataset.

# Checking the amount of each label in the dataset
df_test.NPS.value_counts().sort_index()
NPS distribution for the entire df_test dataset.

Let’s see how this distribution looks like.

NPS distribution for the X_test dataset.

Let’s calculate the proportion of NPS 1 for all the test data (df_test).

# Calculating the proportion of label 1 for satisfied and unsatisfied customers
print('{} %'.format((df_test.NPS.value_counts().
sort_index().values[0] / df_test.shape[0]) * 100))

Analysing the distribution of the df_test dataset, it is clear that its behaviour is very similar to the X_test presented in the prior section. This finding is corroborated by the proportion of grades 1 being approximately 12.82%, something close to the 13.46% observed in the X_test dataframe.

Thus, it can be concluded that the NPS system has a satisfactory performance for known data (X_test) and generalized well for unknown data (df_test) and can be used as a method to control which customers should be targeted by the retention program, as well as to better understand the level of satisfaction of each customer!

5 Next steps

For further iterations on this project in order to improve the analysis and the results, I would suggest 3 main points:

  • As Caio Martins (https://github.com/CaioMar/) did and suggested me, a nice improvement would be to create a function that calculates the total profit. It is possible once we have values for TP and FP. So we could precise the amount of money we will spare by using this NPS system as a double-check method for applying for the retention program.
  • Further, improve the NPS system by searching modern methods for NPS.

6 References

[1] Banerjee. Prashant, Comprehensive Guide on Feature Selection., https://www.kaggle.com/prashant111/comprehensive-guide-on-feature-selection
[2] D. Beniaguev., Advanced Feature Exploration. https://www.kaggle.com/selfishgene/advanced-feature-exploration
[3] M. Filho., A forma mais simples de selecionar as melhores variáveis usando Scikit-learn. https://www.youtube.com/watch?v=Bcn5e7LYMhg&t=2027s
[4] M. Filho., Como Remover Variáveis Irrelevantes de um Modelo de Machine Learning, https://www.youtube.com/watch?v=6-mKATDSQmk&t=1454s
[5] M. Filho., Como Tunar Hiperparâmetros de Machine Learning Sem Perder Tempo, https://www.youtube.com/watch?v=WhnkeasZNHI
[6] G. Caponetto., Random Search vs Grid Search for hyperparameter optimization, https://towardsdatascience.com/random-search-vs-grid-search-for-hyperparameter-optimization-345e1422899d
[7] A. JAIN., Complete Guide to Parameter Tuning in XGBoost with codes in Python, https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/ [8] How to plot ROC curve in Python, https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python
[9] F. Santana., Algoritmo K-means: Aprenda essa Técnica Essêncial através de Exemplos Passo a Passo com Python, https://minerandodados.com.br/algoritmo-k-means-python-passo-passo/
[10] A. Géron., Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, Alta Books, Rio de Janeiro, 2019, 516 p.
[11] W. McKinney., Python for data analysis, Novatec Editora Ltda, São Paulo, 2019, 613 p.

Data Science Enthusiast — linkedin.com/in/couto-pdo/