Principal Component Analysis with R
1 PREPARING THE DATA
1.1 Load the data
Load the heart disease dataset.
1.2 Read background information and variable descriptions
1.3 Data preprocessing
Data peprocessing is an integral first step in machine learning workflows. Because different algorithms sometimes require the moving parts to be coded in slightly different ways, always make sure you research the algorithm you want to implement so that you properly setup your \(y\) and \(x\) variables and split your data appropriately.
NOTE: also, use the
save
function to save your variables of interest. In the remaining walkthroughs, we will use theload
function to load the relevant variables.
1.3.1 What is one-hot encoding?
One additional preprocessing aspect to consider: datasets that contain factor (categorical) features should typically be expanded out into numeric indicators (this is often referred to as one-hot encoding. You can do this manually with the model.matrix
R function. This makes it easier to code a variety of algorithms to a dataset as many algorithms handle factors poorly (decision trees being the main exception). Doing this manually is always good practice. In general however, functions like lm
will do this for you automatically.
1.4 Handling missing data
Missing values need to be handled somehow. Listwise deletion (deleting any row with at least one missing value) is common but this method throws out a lot of useful information. Many advocate for mean imputation, but arithmetic means are sensitive to outliers. Still, others advocate for Chained Equation/Bayesian/Expectation Maximization imputation (e.g., the mice and Amelia II R packages). K-nearest neighbor imputation can also be useful but median imputation is demonstrated below.
However, you will want to learn about Generalized Low Rank Models for missing data imputation in your research. See the impute_missing_values
function from the ck37r package to learn more - you might need to install an h2o dependency.
First, count the number of missing values across variables in our dataset.
## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
We have no missing values, so let’s introduce a few to the “oldpeak” feature for this example to see how it works:
# Add five missing values added to oldpeak in row numbers 50, 100, 150, 200, 250
data$oldpeak[c(50, 100, 150, 200, 250)] = NA
colSums(is.na(data))
## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 5 0 0 0 0
## age sex cp trestbps chol fbs restecg
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## thalach exang oldpeak slope ca thal target
## 0.00000000 0.00000000 0.01650165 0.00000000 0.00000000 0.00000000 0.00000000
There are now 5 missing values in the “oldpeak” feature. Now, median impute the missing values! We also want to create missingness indicators to inform us about the location of missing data. These are additional columns we will add to our data frame that represent the locations within each feature that have missing values - 0 means data are present, 1 means there was a missing (and subsequently imputed) value.
## Found 1 variables with NAs.
## Running standard imputation.
## Imputing oldpeak (10 numeric) with 5 NAs. Impute value: 0.8
## Generating missingness indicators.
## Generating 1 missingness indicators.
## Checking for collinearity of indicators.
## Final number of indicators: 1
## Indicators added (1): miss_oldpeak
## [1] "type" "add_indicators" "skip_vars" "prefix"
## [5] "impute_values" "indicators_added" "data"
# Use the imputed dataframe.
data = result$data
# View new columns. Note that the indicator feature "miss_oldpeak" has been added as the last column of our data frame.
str(data)
## 'data.frame': 303 obs. of 15 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps : int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
## $ miss_oldpeak: int 0 0 0 0 0 0 0 0 0 0 ...
## age sex cp trestbps chol fbs
## 0 0 0 0 0 0
## restecg thalach exang oldpeak slope ca
## 0 0 0 0 0 0
## thal target miss_oldpeak
## 0 0 0
Since the “ca”, “cp”, “slope”, and “thal” features are currently integer type, convert them to factors. The other relevant variables are either continuous or are already indicators (just 1’s and 0’s).
data = ck37r::categoricals_to_factors(data,
categoricals = c("sex", "ca", "cp", "slope", "thal"),
verbose = TRUE)
## Converting sex from integer to factor. Unique vals: 2
## Converting ca from integer to factor. Unique vals: 5
## Converting cp from integer to factor. Unique vals: 4
## Converting slope from integer to factor. Unique vals: 3
## Converting thal from integer to factor. Unique vals: 4
## 'data.frame': 303 obs. of 15 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 2 2 2 ...
## $ cp : Factor w/ 4 levels "0","1","2","3": 4 3 2 2 1 1 2 2 3 3 ...
## $ trestbps : int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : Factor w/ 3 levels "0","1","2": 1 1 3 3 3 2 2 3 3 3 ...
## $ ca : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ thal : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
## $ miss_oldpeak: int 0 0 0 0 0 0 0 0 0 0 ...
1.5 Defining y outcome vectors and x feature dataframes
1.5.1 Convert factors to indicators
Now expand “sex”, “ca”, “cp”, “slope”, and “thal” features out into indicators.
## Converting factors (5): sex, cp, slope, ca, thal
## Converting sex from a factor to a matrix (2 levels).
## : sex_1
## Converting cp from a factor to a matrix (4 levels).
## : cp_1 cp_2 cp_3
## Converting slope from a factor to a matrix (3 levels).
## : slope_1 slope_2
## Converting ca from a factor to a matrix (5 levels).
## : ca_1 ca_2 ca_3 ca_4
## Converting thal from a factor to a matrix (4 levels).
## : thal_1 thal_2 thal_3
## Combining factor matrices into a data frame.
## 'data.frame': 303 obs. of 23 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ trestbps : int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
## $ miss_oldpeak: int 0 0 0 0 0 0 0 0 0 0 ...
## $ sex_1 : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp_1 : int 0 0 1 1 0 0 1 1 0 0 ...
## $ cp_2 : int 0 1 0 0 0 0 0 0 1 1 ...
## $ cp_3 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ slope_1 : int 0 0 0 0 0 1 1 0 0 0 ...
## $ slope_2 : int 0 0 1 1 1 0 0 1 1 1 ...
## $ ca_1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ca_2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ca_3 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ca_4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal_1 : int 1 0 0 0 0 1 0 0 0 0 ...
## $ thal_2 : int 0 1 1 1 1 0 1 0 0 1 ...
## $ thal_3 : int 0 0 0 0 0 0 0 1 1 0 ...
What happened?