2 A Case Study

2.1 Introduction

k-means is an unsupervised machine learning algorithm used to find groups of observations (clusters) that share similar characteristics. What is the meaning of unsupervised learning? It means that the observations given in the data set are unlabeled, there is no outcome to be predicted. We are going to use a Wine data set to cluster different types of wines. This data set contains the results of a chemical analysis of wines grown in a specific area of Italy.

2.2 Loading data

First we need to load some libraries and read the data set.

# Load libraries
library(tidyverse)
library(corrplot)
library(gridExtra)
library(GGally)
library(knitr)

# Read the stats
wines <- read.csv("wine.csv")

We don’t need the Customer_Segment column. As we have said before, k-means is an unsupervised machine learning algorithm and works with unlabeled data.

# Remove the Type column
wines <- wines[, -14]

Let’s get an idea of what we’re working with.

2.2.1 First rows

# First rows 
kable(head(wines))
Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD
1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92
1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40
1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17
1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45
1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93
1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85

2.2.2 Last rows

# Last rows 
kable(tail(wines))
Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD
173 3 14.16 2.51 2.48 20.0 91 1.68 0.70 0.44 1.24 9.7 0.62 1.71
174 3 13.71 5.65 2.45 20.5 95 1.68 0.61 0.52 1.06 7.7 0.64 1.74
175 3 13.40 3.91 2.48 23.0 102 1.80 0.75 0.43 1.41 7.3 0.70 1.56
176 3 13.27 4.28 2.26 20.0 120 1.59 0.69 0.43 1.35 10.2 0.59 1.56
177 3 13.17 2.59 2.37 20.0 120 1.65 0.68 0.53 1.46 9.3 0.60 1.62
178 3 14.13 4.10 2.74 24.5 96 2.05 0.76 0.56 1.35 9.2 0.61 1.60

2.2.3 Summary

# Summary
kable(summary(wines))
Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD
Min. :1.000 Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60 Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300 Min. :0.410 Min. : 1.280 Min. :0.4800 Min. :1.270
1st Qu.:1.000 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205 1st Qu.:0.2700 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 1st Qu.:1.938
Median :2.000 Median :13.05 Median :1.865 Median :2.360 Median :19.50 Median : 98.00 Median :2.355 Median :2.135 Median :0.3400 Median :1.555 Median : 4.690 Median :0.9650 Median :2.780
Mean :1.938 Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49 Mean : 99.74 Mean :2.295 Mean :2.029 Mean :0.3619 Mean :1.591 Mean : 5.058 Mean :0.9574 Mean :2.612
3rd Qu.:3.000 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875 3rd Qu.:0.4375 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 3rd Qu.:3.170
Max. :3.000 Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00 Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600 Max. :3.580 Max. :13.000 Max. :1.7100 Max. :4.000

2.2.4 Structure

# Structure 
str(wines)
## 'data.frame':    178 obs. of  13 variables:
##  $ Wine                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Alcohol             : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Malic.acid          : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Ash                 : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Acl                 : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Mg                  : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ Phenols             : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Flavanoids          : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Nonflavanoid.phenols: num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Proanth             : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Color.int           : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ Hue                 : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ OD                  : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...

2.3 Data analysis

First we have to explore and visualize the data.

# Histogram for each Attribute
wines %>%
  gather(Attributes, value, 1:13) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_histogram(colour="black", show.legend=FALSE) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values", y="Frequency",
       title="Wines Attributes - Histograms") +
  theme_bw()

# Density plot for each Attribute
wines %>%
  gather(Attributes, value, 1:13) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_density(colour="black", alpha=0.5, show.legend=FALSE) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values", y="Density",
       title="Wines Attributes - Density plots") +
  theme_bw()

# Boxplot for each Attribute  
wines %>%
  gather(Attributes, values, c(1:4, 6:12)) %>%
  ggplot(aes(x=reorder(Attributes, values, FUN=median), y=values, fill=Attributes)) +
  geom_boxplot(show.legend=FALSE) +
  labs(title="Wines Attributes - Boxplots") +
  theme_bw() +
  theme(axis.title.y=element_blank(),
        axis.title.x=element_blank()) +
  ylim(0, 35) +
  coord_flip()

We haven’t included magnesium and proline, since their values are very high and worsen the visualization.

What is the relationship between the different attributes? We can use the corrplot() function to create a graphical display of a correlation matrix.

# Correlation matrix 
corrplot(cor(wines), type="upper", method="ellipse", tl.cex=0.9)

There is a strong linear correlation between Total_Phenols and Flavanoids. We can model the relationship between these two variables by fitting a linear equation.

# Relationship between Phenols and Flavanoids
ggplot(wines, aes(x='Total_Phenols', y='Flavanoids')) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(title="Wines Attributes",
       subtitle="Relationship between Phenols and Flavanoids") +
  theme_bw()
## `geom_smooth()` using formula 'y ~ x'

Now that we have done a exploratory data analysis, we can prepare the data in order to execute the k-means algorithm.

2.4 Data preparation

We have to normalize the variables to express them in the same range of values. In other words, normalization means adjusting values measured on different scales to a common scale.

# Normalization
winesNorm <- as.data.frame(scale(wines))

# Original data
p1 <- ggplot(wines, aes(x='Alcohol', y='Malic_Acid')) +
  geom_point() +
  labs(title="Original data") +
  theme_bw()

# Normalized data 
p2 <- ggplot(winesNorm, aes(x='Alcohol', y='Malic_Acid')) +
  geom_point() +
  labs(title="Normalized data") +
  theme_bw()

# Subplot
grid.arrange(p1, p2, ncol=2)

The points in the normalized data are the same as the original one. The only thing that changes is the scale of the axis.

2.5 k-means execution

In this section we are going to execute the k-means algorithm and analyze the main components that the function returns.

# Execution of k-means with k=2
set.seed(1234)
wines_k2 <- kmeans(winesNorm, centers=2)

The kmeans() function returns an object of class “kmeans” with information about the partition:

  • cluster. A vector of integers indicating the cluster to which each point is allocated.

  • centers. A matrix of cluster centers.

  • size. The number of points in each cluster.

# Cluster to which each point is allocated
wines_k2$cluster
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 2 1 1 1
##  [75] 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1
## [112] 1 2 1 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# Cluster centers
wines_k2$centers
##         Wine     Alcohol Malic.acid        Ash        Acl          Mg
## 1 -0.5939424  0.04186090 -0.3811653 -0.1089132 -0.2963363  0.08872942
## 2  1.0325460 -0.07277357  0.6626412  0.1893414  0.5151693 -0.15425269
##      Phenols Flavanoids Nonflavanoid.phenols    Proanth  Color.int        Hue
## 1  0.5413132  0.6003536           -0.4806503  0.4107892 -0.3117353  0.5059593
## 2 -0.9410522 -1.0436916            0.8355920 -0.7141412  0.5419399 -0.8795908
##           OD
## 1  0.6133643
## 2 -1.0663102
# Cluster size
wines_k2$size
## [1] 113  65

Additionally, the kmeans() function returns some ratios that let us know how compact is a cluster and how different are several clusters among themselves.

  • betweenss. The between-cluster sum of squares. In an optimal segmentation, one expects this ratio to be as higher as possible, since we would like to have heterogeneous clusters.

  • withinss. Vector of within-cluster sum of squares, one component per cluster. In an optimal segmentation, one expects this ratio to be as lower as possible for each cluster,

  • since we would like to have homogeneity within the clusters.

  • tot.withinss. Total within-cluster sum of squares.

  • totss. The total sum of squares.

# Between-cluster sum of squares
wines_k2$betweenss
## [1] 739.565
# Within-cluster sum of squares
wines_k2$withinss
## [1] 1039.824  521.611
# Total within-cluster sum of squares 
wines_k2$tot.withinss
## [1] 1561.435
# Total sum of squares
wines_k2$totss
## [1] 2301

2.6 How many clusters?

To study graphically which value of k gives us the best partition, we can plot betweenss and tot.withinss vs Choice of k.

bss <- numeric()
wss <- numeric()

# Run the algorithm for different values of k 
set.seed(1234)

for(i in 1:10){

  # For each k, calculate betweenss and tot.withinss
  bss[i] <- kmeans(winesNorm, centers=i)$betweenss
  wss[i] <- kmeans(winesNorm, centers=i)$tot.withinss

}

# Between-cluster sum of squares vs Choice of k
p3 <- qplot(1:10, bss, geom=c("point", "line"), 
            xlab="Number of clusters", ylab="Between-cluster sum of squares") +
  scale_x_continuous(breaks=seq(0, 10, 1)) +
  theme_bw()

# Total within-cluster sum of squares vs Choice of k
p4 <- qplot(1:10, wss, geom=c("point", "line"),
            xlab="Number of clusters", ylab="Total within-cluster sum of squares") +
  scale_x_continuous(breaks=seq(0, 10, 1)) +
  theme_bw()

# Subplot
grid.arrange(p3, p4, ncol=2)

Which is the optimal value for k? One should choose a number of clusters so that adding another cluster doesn’t give much better partition of the data. At some point the gain will drop, giving an angle in the graph (elbow criterion). The number of clusters is chosen at this point. In our case, it is clear that 3 is the appropriate value for k.

2.7 Results

# Execution of k-means with k=3
set.seed(1234)

wines_k3 <- kmeans(winesNorm, centers=3)

# Mean values of each cluster
aggregate(wines, by=list(wines_k3$cluster), mean)
##   Group.1     Wine  Alcohol Malic.acid      Ash      Acl        Mg  Phenols
## 1       1 1.065574 13.69590   2.002623 2.480984 17.62623 108.00000 2.863934
## 2       2 1.970588 12.26809   1.909265 2.214706 19.77059  92.85294 2.229412
## 3       3 2.979592 13.15163   3.344490 2.434694 21.43878  99.02041 1.678163
##   Flavanoids Nonflavanoid.phenols  Proanth Color.int       Hue       OD
## 1  3.0201639            0.2913115 1.939672  5.472951 1.0691803 3.177541
## 2  2.0276471            0.3610294 1.586324  3.039265 1.0528824 2.768088
## 3  0.7979592            0.4508163 1.163061  7.343265 0.6859184 1.690204
# Clustering 
ggpairs(cbind(wines, Cluster=as.factor(wines_k3$cluster)),
        columns=1:6, aes(colour=Cluster, alpha=0.5),
        lower=list(continuous="points"),
        upper=list(continuous="blank"),
        axisLabels="none", switch="both") +
        theme_bw()

2.8 Summary

In this entry we have learned about the k-means algorithm, including the data normalization before we execute it, the choice of the optimal number of clusters (elbow criterion) and the visualization of the clustering.

It has been a pleasure to make this post, I have learned a lot! Thank you for reading and if you like it, please upvote it.

2.9 Citations for used packages

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Hadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse

Taiyun Wei and Viliam Simko (2017). R package “corrplot”: Visualization of a Correlation Matrix (Version 0.84). Available from https://github.com/taiyun/corrplot

Baptiste Auguie (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.3. https://CRAN.R-project.org/package=gridExtra

Barret Schloerke, Jason Crowley, Di Cook, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg and Joseph Larmarange (2017). GGally: Extension to ‘ggplot2’. R package version 1.3.2. https://CRAN.R-project.org/package=GGally

Yihui Xie (2018). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.20.

Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963

Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595