2 A Case Study

2.1 Introduction

k-means is an unsupervised machine learning algorithm used to find groups of observations (clusters) that share similar characteristics. What is the meaning of unsupervised learning? It means that the observations given in the data set are unlabeled, there is no outcome to be predicted. We are going to use a Wine data set to cluster different types of wines. This data set contains the results of a chemical analysis of wines grown in a specific area of Italy.

2.2 Loading data

First we need to load some libraries and read the data set.

# Load libraries
library(tidyverse)
library(corrplot)
library(gridExtra)
library(GGally)
library(knitr)

# Read the stats
wines <- read.csv("wine.csv")

We don’t need the Customer_Segment column. As we have said before, k-means is an unsupervised machine learning algorithm and works with unlabeled data.

# Remove the Type column
wines <- wines[, -14]

Let’s get an idea of what we’re working with.

2.2.1 First rows

# First rows 
kable(head(wines))

Wine	Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD
1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92
1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40
1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17
1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45
1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93
1	14.20	1.76	2.45	15.2	112	3.27	3.39	0.34	1.97	6.75	1.05	2.85

2.2.2 Last rows

# Last rows 
kable(tail(wines))

	Wine	Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD
173	3	14.16	2.51	2.48	20.0	91	1.68	0.70	0.44	1.24	9.7	0.62	1.71
174	3	13.71	5.65	2.45	20.5	95	1.68	0.61	0.52	1.06	7.7	0.64	1.74
175	3	13.40	3.91	2.48	23.0	102	1.80	0.75	0.43	1.41	7.3	0.70	1.56
176	3	13.27	4.28	2.26	20.0	120	1.59	0.69	0.43	1.35	10.2	0.59	1.56
177	3	13.17	2.59	2.37	20.0	120	1.65	0.68	0.53	1.46	9.3	0.60	1.62
178	3	14.13	4.10	2.74	24.5	96	2.05	0.76	0.56	1.35	9.2	0.61	1.60

2.2.3 Summary

# Summary
kable(summary(wines))

Wine	Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD
Min. :1.000	Min. :11.03	Min. :0.740	Min. :1.360	Min. :10.60	Min. : 70.00	Min. :0.980	Min. :0.340	Min. :0.1300	Min. :0.410	Min. : 1.280	Min. :0.4800	Min. :1.270
1st Qu.:1.000	1st Qu.:12.36	1st Qu.:1.603	1st Qu.:2.210	1st Qu.:17.20	1st Qu.: 88.00	1st Qu.:1.742	1st Qu.:1.205	1st Qu.:0.2700	1st Qu.:1.250	1st Qu.: 3.220	1st Qu.:0.7825	1st Qu.:1.938
Median :2.000	Median :13.05	Median :1.865	Median :2.360	Median :19.50	Median : 98.00	Median :2.355	Median :2.135	Median :0.3400	Median :1.555	Median : 4.690	Median :0.9650	Median :2.780
Mean :1.938	Mean :13.00	Mean :2.336	Mean :2.367	Mean :19.49	Mean : 99.74	Mean :2.295	Mean :2.029	Mean :0.3619	Mean :1.591	Mean : 5.058	Mean :0.9574	Mean :2.612
3rd Qu.:3.000	3rd Qu.:13.68	3rd Qu.:3.083	3rd Qu.:2.558	3rd Qu.:21.50	3rd Qu.:107.00	3rd Qu.:2.800	3rd Qu.:2.875	3rd Qu.:0.4375	3rd Qu.:1.950	3rd Qu.: 6.200	3rd Qu.:1.1200	3rd Qu.:3.170
Max. :3.000	Max. :14.83	Max. :5.800	Max. :3.230	Max. :30.00	Max. :162.00	Max. :3.880	Max. :5.080	Max. :0.6600	Max. :3.580	Max. :13.000	Max. :1.7100	Max. :4.000

2.2.4 Structure

# Structure 
str(wines)

## 'data.frame':    178 obs. of  13 variables:
##  $ Wine                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Alcohol             : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Malic.acid          : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Ash                 : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Acl                 : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Mg                  : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ Phenols             : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Flavanoids          : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Nonflavanoid.phenols: num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Proanth             : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Color.int           : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ Hue                 : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ OD                  : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...

2.3 Data analysis

First we have to explore and visualize the data.

# Histogram for each Attribute
wines %>%
  gather(Attributes, value, 1:13) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_histogram(colour="black", show.legend=FALSE) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values", y="Frequency",
       title="Wines Attributes - Histograms") +
  theme_bw()

# Density plot for each Attribute
wines %>%
  gather(Attributes, value, 1:13) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_density(colour="black", alpha=0.5, show.legend=FALSE) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values", y="Density",
       title="Wines Attributes - Density plots") +
  theme_bw()

# Boxplot for each Attribute  
wines %>%
  gather(Attributes, values, c(1:4, 6:12)) %>%
  ggplot(aes(x=reorder(Attributes, values, FUN=median), y=values, fill=Attributes)) +
  geom_boxplot(show.legend=FALSE) +
  labs(title="Wines Attributes - Boxplots") +
  theme_bw() +
  theme(axis.title.y=element_blank(),
        axis.title.x=element_blank()) +
  ylim(0, 35) +
  coord_flip()

We haven’t included magnesium and proline, since their values are very high and worsen the visualization.

What is the relationship between the different attributes? We can use the corrplot() function to create a graphical display of a correlation matrix.

# Correlation matrix 
corrplot(cor(wines), type="upper", method="ellipse", tl.cex=0.9)

There is a strong linear correlation between Total_Phenols and Flavanoids. We can model the relationship between these two variables by fitting a linear equation.

# Relationship between Phenols and Flavanoids
ggplot(wines, aes(x='Total_Phenols', y='Flavanoids')) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(title="Wines Attributes",
       subtitle="Relationship between Phenols and Flavanoids") +
  theme_bw()

## `geom_smooth()` using formula 'y ~ x'

Now that we have done a exploratory data analysis, we can prepare the data in order to execute the k-means algorithm.

2.4 Data preparation

We have to normalize the variables to express them in the same range of values. In other words, normalization means adjusting values measured on different scales to a common scale.

# Normalization
winesNorm <- as.data.frame(scale(wines))

# Original data
p1 <- ggplot(wines, aes(x='Alcohol', y='Malic_Acid')) +
  geom_point() +
  labs(title="Original data") +
  theme_bw()

# Normalized data 
p2 <- ggplot(winesNorm, aes(x='Alcohol', y='Malic_Acid')) +
  geom_point() +
  labs(title="Normalized data") +
  theme_bw()

# Subplot
grid.arrange(p1, p2, ncol=2)

The points in the normalized data are the same as the original one. The only thing that changes is the scale of the axis.

2.5 k-means execution

In this section we are going to execute the k-means algorithm and analyze the main components that the function returns.

# Execution of k-means with k=2
set.seed(1234)
wines_k2 <- kmeans(winesNorm, centers=2)

The kmeans() function returns an object of class “kmeans” with information about the partition:

cluster. A vector of integers indicating the cluster to which each point is allocated.
centers. A matrix of cluster centers.
size. The number of points in each cluster.

# Cluster to which each point is allocated
wines_k2$cluster

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 2 1 1 1
##  [75] 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1
## [112] 1 2 1 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

# Cluster centers
wines_k2$centers

##         Wine     Alcohol Malic.acid        Ash        Acl          Mg
## 1 -0.5939424  0.04186090 -0.3811653 -0.1089132 -0.2963363  0.08872942
## 2  1.0325460 -0.07277357  0.6626412  0.1893414  0.5151693 -0.15425269
##      Phenols Flavanoids Nonflavanoid.phenols    Proanth  Color.int        Hue
## 1  0.5413132  0.6003536           -0.4806503  0.4107892 -0.3117353  0.5059593
## 2 -0.9410522 -1.0436916            0.8355920 -0.7141412  0.5419399 -0.8795908
##           OD
## 1  0.6133643
## 2 -1.0663102

# Cluster size
wines_k2$size

## [1] 113  65

Additionally, the kmeans() function returns some ratios that let us know how compact is a cluster and how different are several clusters among themselves.

betweenss. The between-cluster sum of squares. In an optimal segmentation, one expects this ratio to be as higher as possible, since we would like to have heterogeneous clusters.
withinss. Vector of within-cluster sum of squares, one component per cluster. In an optimal segmentation, one expects this ratio to be as lower as possible for each cluster,
since we would like to have homogeneity within the clusters.
tot.withinss. Total within-cluster sum of squares.
totss. The total sum of squares.

# Between-cluster sum of squares
wines_k2$betweenss

## [1] 739.565

# Within-cluster sum of squares
wines_k2$withinss

## [1] 1039.824  521.611

# Total within-cluster sum of squares 
wines_k2$tot.withinss

## [1] 1561.435

# Total sum of squares
wines_k2$totss

## [1] 2301

2.6 How many clusters?

To study graphically which value of k gives us the best partition, we can plot betweenss and tot.withinss vs Choice of k.

bss <- numeric()
wss <- numeric()

# Run the algorithm for different values of k 
set.seed(1234)

for(i in 1:10){

  # For each k, calculate betweenss and tot.withinss
  bss[i] <- kmeans(winesNorm, centers=i)$betweenss
  wss[i] <- kmeans(winesNorm, centers=i)$tot.withinss

}

# Between-cluster sum of squares vs Choice of k
p3 <- qplot(1:10, bss, geom=c("point", "line"), 
            xlab="Number of clusters", ylab="Between-cluster sum of squares") +
  scale_x_continuous(breaks=seq(0, 10, 1)) +
  theme_bw()

# Total within-cluster sum of squares vs Choice of k
p4 <- qplot(1:10, wss, geom=c("point", "line"),
            xlab="Number of clusters", ylab="Total within-cluster sum of squares") +
  scale_x_continuous(breaks=seq(0, 10, 1)) +
  theme_bw()

# Subplot
grid.arrange(p3, p4, ncol=2)

Which is the optimal value for k? One should choose a number of clusters so that adding another cluster doesn’t give much better partition of the data. At some point the gain will drop, giving an angle in the graph (elbow criterion). The number of clusters is chosen at this point. In our case, it is clear that 3 is the appropriate value for k.

2.7 Results

# Execution of k-means with k=3
set.seed(1234)

wines_k3 <- kmeans(winesNorm, centers=3)

# Mean values of each cluster
aggregate(wines, by=list(wines_k3$cluster), mean)

##   Group.1     Wine  Alcohol Malic.acid      Ash      Acl        Mg  Phenols
## 1       1 1.065574 13.69590   2.002623 2.480984 17.62623 108.00000 2.863934
## 2       2 1.970588 12.26809   1.909265 2.214706 19.77059  92.85294 2.229412
## 3       3 2.979592 13.15163   3.344490 2.434694 21.43878  99.02041 1.678163
##   Flavanoids Nonflavanoid.phenols  Proanth Color.int       Hue       OD
## 1  3.0201639            0.2913115 1.939672  5.472951 1.0691803 3.177541
## 2  2.0276471            0.3610294 1.586324  3.039265 1.0528824 2.768088
## 3  0.7979592            0.4508163 1.163061  7.343265 0.6859184 1.690204

# Clustering 
ggpairs(cbind(wines, Cluster=as.factor(wines_k3$cluster)),
        columns=1:6, aes(colour=Cluster, alpha=0.5),
        lower=list(continuous="points"),
        upper=list(continuous="blank"),
        axisLabels="none", switch="both") +
        theme_bw()

2.8 Summary

In this entry we have learned about the k-means algorithm, including the data normalization before we execute it, the choice of the optimal number of clusters (elbow criterion) and the visualization of the clustering.

It has been a pleasure to make this post, I have learned a lot! Thank you for reading and if you like it, please upvote it.

2.9 Citations for used packages

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Hadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse

Taiyun Wei and Viliam Simko (2017). R package “corrplot”: Visualization of a Correlation Matrix (Version 0.84). Available from https://github.com/taiyun/corrplot

Baptiste Auguie (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.3. https://CRAN.R-project.org/package=gridExtra

Barret Schloerke, Jason Crowley, Di Cook, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg and Joseph Larmarange (2017). GGally: Extension to ‘ggplot2’. R package version 1.3.2. https://CRAN.R-project.org/package=GGally

Yihui Xie (2018). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.20.

Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963

Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595