Customer Data Analysis and Clustering

This notebook aims to analyze and cluster data from clients of a wholesale distributor using techniques such as Principle Component Analysis and k-means clustering.

The data is fetched from the UCI Machine Learning repository. Visit the project Github repository or its Kaggle notebook.

Author: Bruce Nguyen

Table of Contents

  1. Data Cleaning
  2. Exploratory Data Analysis
  3. Principle Component Analysis (PCA)
  4. Kernel Principle Component Analysis (KPCA)
  5. Clustering

Data Cleaning

We start by reading the data and trying to get familliar with it.

To have more informative categorical variables, the numbering in variables Channel and Region was substituted by their actual meaning in the description of the dataset.

Well, looks like the data is clean enough with no missing value! All we need to do left is to save it for future uses.

Exploratory Data Analysis

We can see that there are 2 types of feature in this dataset: categorical and continuous. Let's create lists to keep track of them.

Now, we can start exploring! We can begin by looking into the categorical features first for logicalness.

By first taking a sanity check, we can see that there is a healthy number of counts in all the categories. Looking at the first plot more closely, it is visible that the HoReCa (Hotels/Restaurants/Cafes) channel is used by a lot more customers compared to the other one. Concerning the regions, a big number of the orders does not record their name, leaving us with a substantial number of 'Other' category. Moving on, how about we combine both of the 2 features together?

In the new plot, not much changed about the relative quantities of the categories. That's it for those 2 features. We now move on to the more important part - analysing the numbers! A good start is to plot them all at once and have a preliminary check of their distributions.

We can see that all of the quantities follow a shape akin to the Chi-squared distribution, with most of the value centered at 0. That said, to have a better understanding of the relative difference, we should put them in the same plot on a single scale. This can be done by using violin plots.

Looks great! Now we know that Delicatessen has the sharpest peak in terms of distribution of them all, while Fresh has fatter tails than the rest of the features. To investigate the relationship between different continuous features, we can do pair plots of between each of them. This way, we might even get a hint of what kind of clusters we are going to get.

Overall, there is not much correlation between the features, with the exception of Grocery and Detergents_Paper. However, we can go even one step further by plotting the pairs along with the labels from categorical variables. All it takes is another single line of code.

We can immediately see a lot of interesting details, with clusters forming in many scatter plots. Notably, Retail customers buy far more Detergents_Paper in Dollars compared to HoReCa ones, with the feature being able to segment the two categories nicely. The same thing can be said with regards to Grocery, however, to a lesser extent. Overall, there are a lot of difference between the two type of customers, which should be taken note of from a business perspective.

We move on to also plotting the continuous features against the categories within Region.

Unfortunately, there is apparently not much variability between the different categories in Region along all the features. However, we need to dive further in the analysis before coming to any conclusion.

With the exploratory analysis out of the way, we move on to actually clustering the data.

Principle Component Analysis (PCA)

Before applying any clustering algorithm, we first need to reduce the number of dimensions in the data. This is because in dimensions, the performance of many classical machine learning algorithms drop significantly, and the need for computational resources increase exponentially ("curse of dimensionality")

The PCA process reduces the number of dimensions by computing the principal components that explain the most variance in data. Think of it as a photographer trying to find the best angle to capture a picture of a group of people, reducing the number of dimensions while minizing the loss of information (Of course the angle is in front of the group!).

We need to scale the data before hand so that every feature has an equal effect on the analysis.

An ad hoc test is then used to find the smallest number of components that still capture most of the data. This works by plotting the cumulative amount of variance explained by progressively higher number of components.

Here, we choose to use 4 components, as there is not much more to gain by using more according to the plot. After this, it is simply about fitting the data and computing the components!

We can see that our components explain up to 95% amount of the variance in data, which means we have saved a lot of information. Next, the relationship between the new components and the original features can be visualized using scatterplots. Just for convenience, I only use the first 2 principle components.

Looking at the plots, we can see some very interesting 'directions' for the change in data. Take the Fresh and Frozen plots as an example, the colors and the sizes here change mostly along the y axis, showing that the 2nd principle component explains more of the variance in magnitude of these 2 features. On the other hand, the Detergents_Paper and Grocery plots sees most of its changes lying along the x-axis, showing that it is better described by the first component.

We can also check how the components relate to the Chanel and Region categorical variables. In a way, PCA is also a clustering method, as the most dominant patterns that separate groups are usually captured by PCA, especially within the first component. Nevertheless, we need to plot these features against the value of the components to find out.

In the first chart, it can already be seen that the first principle component manages to separate the values labeled Retail with those from HoReCa. In the second plot, things do not go that smoothly with the regions, with values from the 3 categories still clustering together.

For even more information, we add another dimension to our plot - the 3rd component - and see how much information the new component can give us visually.

The 3D space surely looks more intuitive and fun to play with! However, there are not much improvement in the new plots over the 2D ones regarding the task of clustering itself.

From the 4 plots, we can see that there is a significant holistic difference in customer spending between the categories of Channel, further confirming our initial assumption. On the other hand, there is, unsurprisingly, not much variability between the customers coming from different regions, contradicting the wide-spread assumption in the business world.

That said, I believe that we can go even further by employing a more advanced technique: Kernel Principal Component Analysis.

Kernel Principal Component Analysis (KPCA)

KPCA extends PCA by using a kernel (function) by giving us access to higher dimensions without the complications that follows. Besides the process of computing the components itself, this section repeats many of the steps from the last one.

We can already see that the direction of changes in value has become very non-linear, and the data has become much more scattered.

In this new space, the data looks comparatively more separated by categories, especially with regards to Channel. However, the separation is still not very satisfactory, and I can feel the need for an extra dimension. In the other feature, the situation remains the same.

Both of these plots look much better than the ones before. Thus, I choose to perform clustering on the KPCA values in order to generate the best results. This way, the principal components becomes the latent variable that is fed into the clustering algorithms. All we need to do now is to save the values into the data frame and create a new .csv file - for good measure.

Clustering

In this project, we will use the k-means clustering algorithm. The algorithm works by finding $k$ centers, one for each cluster, by minimizing within-cluster variances. However, before actually using the algorithm, we need to find the right number of clusters, the right $k$, by using another elbow test.

The diagnostics points us towards using 5 clusters, and our task now is simply clustering the data with that parameter.

All is good so far! Now is the moment the numbers turn in to geometric shapes. First we plot the clusters against the Fresh,Milk and Grocery values as well as the Channel feature.

Even though the clusters separate the values along these variables to some extent, the segmentation is not exactly clear-cut along the lines. Now comes the same plot but with the features Frozen, Detergents_Paper and Delicatessen.

We can see the same pattern in this plot. Nevertheless, I proceed with creating more visualizations to give us a better perspective and, in turn, insights.

This plot shows that there are different interesting patterns within each group, especially in group 1. The amount purchased in this group is significant across almost all the continuous features. To see more details, we can expand on the graph with categorical information.

We can see some clusters has its values concentrated around the categories within either Channel or Region, such as cluster number 1. Meanwhile, some other clusters has its values distributed more equally across those features.

Now that all the auxiliary visualizations are done, we proceed with arguably the most important graph in this project: the polar plot. This plot is crucial in order to see the characteristics of each cluster with regards to the amount of money its customers spend, respectively. To make the chart, we need to min-max scaling the data first.

Finally, all we need to do is plotting the data now!

Voilá! We can see all the characteristics of each cluster now. For example, the first cluster is a group of customers who buy a lot of Fresh products - much more than any other group in fact. Cluster 1 is another interesting grouping, with its member consume considerably more Delicatessen. We can keep going on with this type of analysis with the rest of the clusters.

All of such information extracted from this will help the business target their offerings and promotions as well as develop customized marketing campaigns. The clusters formed above will act as customer personae that help guiding business decisions.