Santander Case — Part C: Clustering customers

Here you will find: Profit Optimization, Feature Scaling, Elbow Method, Silhouette Method and Clustering (KMeans).

--

The Problem

The importance of customer clustering is already widely known, and it might be even bigger for a bank who wants to maximize its profits. With clustering, we can better understand the behaviour, personality and much more about the customers. By clustering customers, we want to understand which groups have the most number of unsatisfied customers, so the bank can apply the retention program on them and maximize profits as well as to have a notion of where to focus the efforts on.

The third task is to find the three natural groups that have the highest expected profits. That means the 3 groups that have the highest amount of unsatisfied customers.

1 Loading Data and Packages

# Loading packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
%matplotlib inline# Loading the Train and Test datasets
df_train = pd.read_csv("data/train.csv")
df_test = pd.read_csv("data/test.csv")

2 Data Preparation

2.1 Dataset Split (train — test)

As said in Part A, section 3, the train_test_split method does the segmentation at random. Even with an extremely unbalanced dataset, the split should occur so that both training and testing have the same proportion of unsatisfied customers.
However, as it is difficult to guarantee randomness in fact, we can make a stratified split based on the TARGET variable, thus ensuring that the proportion is exact in both datasets.

from sklearn.model_selection import train_test_split# Spliting the dataset on a proportion of 80% for train and 20% for test.
X_train, X_test, y_train, y_test = train_test_split(df_train.drop('TARGET', axis = 1), df_train.TARGET,
train_size = 0.8, stratify = df_train.TARGET,
random_state = 42)
#Checando o resultado do splot
X_train.shape, y_train.shape[0], X_test.shape, y_test.shape[0]
Output of the code above.

2.2 Rebuilding the selected dataset

Even though it is a clustering problem, we can use the selected dataset for classification because it is much smaller than the full datasets, and it is still robust enough and useful for clustering.

Here we need to:

  • Remove constant / semi-constat features;
  • Remove duplicate features;
  • Select only the best 96 features found in Part A.

Removing constant and semi-constant feature:

# Investigating if there are constant or semi-constat feature in X_train
from sklearn.feature_selection import VarianceThreshold
# Removing all features that have variance under 0.01
selector = VarianceThreshold(threshold = 0.01)
selector.fit(X_train)
mask_clean = selector.get_support()
X_train = X_train[X_train.columns[mask_clean]]

Removing duplicate features:

# Checking if there is any duplicated column
remove = []
cols = X_train.columns
for i in range(len(cols)-1):
column = X_train[cols[i]].values
for j in range(i+1,len(cols)):
if np.array_equal(column, X_train[cols[j]].values):
remove.append(cols[j])
# If yes, than they will be dropped here
X_train.drop(remove, axis = 1, inplace=True)

Selecting the 96 best features:

# Selection the 96 best features aconrdingly to f_classif
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
selector_fc = SelectKBest(score_func = f_classif, k = 96)
selector_fc.fit(X_train, y_train)
mask_selected = selector_fc.get_support()
# Saving the selected columns in a list
selected_col = X_train.columns[mask_selected]
# Creating datasets where only with the selected 96 features are included
X_train_selected = X_train[selected_col]
X_test_selected = X_test[selected_col]

2.3 Feature Scaling

Because the KMeans algorithm will be used, the difference in scales between the characteristics is a big problem for the model. The algorithm is based on distance calculation and therefore the data must be transformed to a standardised scale.

First, let’s import some important packages for the next steps.

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

To scale the features, we will use the Standard Scaling method. The feature will be scaled in the interval -1 to1.

# Scaling the features using Standardization
scaled_features = StandardScaler().fit_transform(X_train_selected)

Now that we have the data ready, we can move forward to the next step.

3 Clusters Number

To work with KMeans algorithm, the most important part is to find the right value for K, that is, the number of natural clusters existent on the data. In order to find out the best value for K, we will use two methods:

  • Elbow Method;
  • Silhouette Method;

3.1 Elbow Method

To find the number of natural groups, that is, the K value, we can use the Elbow method.

The method consists of trying different values for K out and registering the Within Cluster Sum of Squares (WCSS). We will choose the value of K that is in the threshold of a large reduction of the WCSS for K values before it and a small reduction of the WCSS for K values after it. In practice, we want to find out the value of K where the graph literally forms an Elbow.

As we should find the 3 groups that have the highest expected profits per customer, let’s start this analysis considering that there are at least 6 clusters on the data.

# Appling the Elbow Method to find the best K value
wcss = []

for i in range(6, 50):
kmeans = KMeans(n_clusters = i, init = 'random')
kmeans.fit(scaled_features)
print(i, kmeans.inertia_)
wcss.append(kmeans.inertia_)
# Plotting the Elbow Curve
fig, ax = plt.subplots(figsize = (20, 8))
plt.plot(range(6, 50), wcss)
plt.title('Elbow Curve', fontsize = 18)
plt.xlabel('Number of Clusters', fontsize = 16)
plt.xticks(np.arange(6, 50, 1))
plt.ylabel('Within Cluster Sum of Squares', fontsize = 16)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
Elbow curve for identifying the best value of K.

The weakness of this method is that sometimes there is no evident elbow and we are not able to identify the best value for K. Unfortunately, this was the case at this time, we see no evident elbow in the curve above. That means we are not able to determine the best value of K using the Elbow Method.

Let’s try out the Silhouette method.

3.2 Silhouette Method

The silhouette method computes the silhouette coefficient for all samples. This coefficient is calculated using the mean intra-cluster distance (x) and the mean nearest-cluster distance (y) for each sample. It can be calculated so:

Silhouette Coefficient = (y — x) / max(x, y)

Where:

  • x is the mean intracluster distance: the mean distance between the points of the cluster and its center;
  • y depicts mean nearest cluster distance i.e. mean distance to the instances of the next closest cluster.

The coefficient varies between -1 and 1. A value close to 1 implies that the instance is close to its cluster and is a part of the right cluster. Whereas, a value close to -1 means that the value is assigned to the wrong cluster.

For this method, we want to compute the silhouette score for each value of K and pick the highest value.

from sklearn.metrics import silhouette_scorecoefficients = []for i in range(6, 50):
kmeans = KMeans(n_clusters = i)
kmeans.fit(scaled_features)
score = silhouette_score(scaled_features, kmeans.labels_)
coefficients.append(score)

print("For K = {}, silhouette score is {})".format(i, score))
# Plotting the Elbow Curve
fig, ax = plt.subplots(figsize = (20, 8))
plt.plot(range(6, 50), coefficients)
plt.title('Silhouette Score for different values of K', fontsize = 18)
plt.xlabel('Value of K', fontsize = 16)
plt.xticks(np.arange(6, 50, 1))
plt.ylabel('Score', fontsize = 16)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
Silhouette curve for identifying the best value of K.

As we can see from the graph, the highest coefficient value (0.5186710574329028) is when K = 25. Since it is the highest score for K, we know that there are 25 natural clusters in this data and can now move forward to the next step.

4 Results Analysis

Once defined that there should be 25 clusters, the 3 groups that generate the highest profit per customer are those that have the greatest proportion of unsatisfied customers so that the retention program is applied to a greater number of truly dissatisfied customers and generates profit.

# Creating and training the KMeans model
kmeans = KMeans(n_clusters = 25).fit(scaled_features)
# Checking the labels
kmeans.labels_
Array with the labels for each instance of the training dataset.

Now that we have a trained model and labels for our train data, let’s create a dataframe only with the Target and the Label columns to analyse the results in a clearer way.

# Creating a Data DataFramer result analysis
result_train = pd.DataFrame({'target': y_train, 'labels': kmeans.labels_})
result_train.head()
Result dataframe with train split data.

As mentioned before, the 3 groups that have the highest expected profit are those who have the biggest number of unsatisfied customers because we could apply the retention program and have a profit of $90 for each unsatisfied customer. So let’s filter the data to have only unsatisfied customers and its labels in the dataframe.

# Get the distribution only for unsatisfied customer
unsatisfied_dist = result_train[result_train['target'] == 1].
labels.value_counts().sort_index()

And finally, we can plot the distribution of unsatisfied customers over the clusters and find the 3 ones with the highest values.

# Plotting the unsatified customers ditribution for the clusters
fig, ax = plt.subplots(figsize = (20, 8))
plt.bar(unsatisfied_dist.index + 1, unsatisfied_dist.values);
plt.title('Unsatisfied customer amount for each cluster',
fontsize = 18);
plt.xlabel('Cluster', fontsize = 16);
plt.ylabel('Amount', fontsize = 16)
plt.xticks(range(1, 32, 1))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
Distribution of unsatisfied customers over the clusters for the train split data (X_train).

By analysing the graph above, we conclude that clusters 1, 4 and 10 have clearly the biggest numbers of unsatisfied customers.

Based on this fact, we can infer that these three clusters have the highest expected profit and we should start working from them in order to maximize effort and profit.

This analysis was made on the train split part, but we can also cluster the test split part (X_test) and check if the results match as a way of double-checking when working on data that has never seen before.

scaled_features = StandardScaler().fit_transform(X_test_selected)
result_test = pd.DataFrame({'target': y_test, 'labels': kmeans.predict(scaled_features)})
result_test.head()
Result dataframe with test split data.

Now let’s get the distribution of unsatisfied customers over the clusters for the test split data (unknown data).

# Get the distribution only for unsatisfied customer on X_test_selected
unsatisfied_dist_test = result_test[result_test['target'] == 1].
labels.value_counts().sort_index()
# Plotting the unsatified customers ditribution for the clusters
fig, ax = plt.subplots(figsize = (20, 8))
plt.bar(unsatisfied_dist_test.index + 1, unsatisfied_dist_test.values);
plt.title('Unsatisfied customer amount for each cluster',
fontsize = 18);
plt.xlabel('Cluster', fontsize = 16);
plt.ylabel('Amount', fontsize = 16)
plt.xticks(range(1, 26, 1))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
Distribution of unsatisfied customers over the clusters for the test split data (X_test).

As expected, the clusters 1, 4 and 10 have the most part of unsatisfied customers. It is a clear evidence that our model performs well for new instances!

So this is the end of the Case Problem made together with Santander. We have optimized models for Classification, for NPS and Clustering the unsatisfied customers and in possession of all those resources we can maximize the profit as well as get a deeper understanding of the level of satisfaction of each customer!

5 Next steps

For further iterations on this project in order to improve the analysis and the results, I would suggest 3 main points:

  • As Caio Martins (https://github.com/CaioMar/) did and suggested me, a nice improvement would be to create a function that calculates the total profit. It is possible once we have values for TP and FP. So we could precise the amount of money we will spare by using this NPS system as a double-check method for applying the retention program.
  • Another improvement is to try out other cluster algorithms like Agglomerative Hierarchical Clustering.

6 References

[1] Banerjee. Prashant, Comprehensive Guide on Feature Selection., https://www.kaggle.com/prashant111/comprehensive-guide-on-feature-selection
[2] D. Beniaguev., Advanced Feature Exploration. https://www.kaggle.com/selfishgene/advanced-feature-exploration
[3] M. Filho., A forma mais simples de selecionar as melhores variáveis usando Scikit-learn. https://www.youtube.com/watch?v=Bcn5e7LYMhg&t=2027s
[4] M. Filho., Como Remover Variáveis Irrelevantes de um Modelo de Machine Learning, https://www.youtube.com/watch?v=6-mKATDSQmk&t=1454s
[5] M. Filho., Como Tunar Hiperparâmetros de Machine Learning Sem Perder Tempo, https://www.youtube.com/watch?v=WhnkeasZNHI
[6] G. Caponetto., Random Search vs Grid Search for hyperparameter optimization, https://towardsdatascience.com/random-search-vs-grid-search-for-hyperparameter-optimization-345e1422899d
[7] A. JAIN., Complete Guide to Parameter Tuning in XGBoost with codes in Python, https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/ [8] How to plot ROC curve in Python, https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python
[9] F. Santana., Algoritmo K-means: Aprenda essa Técnica Essêncial através de Exemplos Passo a Passo com Python, https://minerandodados.com.br/algoritmo-k-means-python-passo-passo/
[10] A. Géron., Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, Alta Books, Rio de Janeiro, 2019, 516 p.
[11] W. McKinney., Python for data analysis, Novatec Editora Ltda, São Paulo, 2019, 613 p.

--

--