1 PREPARING THE DATA

library(dplyr)
library(PCAmixdata)
library(rio)
library(ck37r)
library(ggplot2)

1.1 Load the data

Load the heart disease dataset.

# Load the heart disease dataset using import() from the rio package.
data_original = import("data-raw/heart.csv")

# Preserve the original copy
data = data_original
str(data)

1.2 Read background information and variable descriptions

https://archive.ics.uci.edu/ml/datasets/heart+Disease

1.3 Data preprocessing

Data peprocessing is an integral first step in machine learning workflows. Because different algorithms sometimes require the moving parts to be coded in slightly different ways, always make sure you research the algorithm you want to implement so that you properly setup your \(y\) and \(x\) variables and split your data appropriately.

NOTE: also, use the save function to save your variables of interest. In the remaining walkthroughs, we will use the load function to load the relevant variables.

1.3.1 What is one-hot encoding?

One additional preprocessing aspect to consider: datasets that contain factor (categorical) features should typically be expanded out into numeric indicators (this is often referred to as one-hot encoding. You can do this manually with the model.matrix R function. This makes it easier to code a variety of algorithms to a dataset as many algorithms handle factors poorly (decision trees being the main exception). Doing this manually is always good practice. In general however, functions like lm will do this for you automatically.

1.4 Handling missing data

Missing values need to be handled somehow. Listwise deletion (deleting any row with at least one missing value) is common but this method throws out a lot of useful information. Many advocate for mean imputation, but arithmetic means are sensitive to outliers. Still, others advocate for Chained Equation/Bayesian/Expectation Maximization imputation (e.g., the mice and Amelia II R packages). K-nearest neighbor imputation can also be useful but median imputation is demonstrated below.

However, you will want to learn about Generalized Low Rank Models for missing data imputation in your research. See the impute_missing_values function from the ck37r package to learn more - you might need to install an h2o dependency.

First, count the number of missing values across variables in our dataset.

colSums(is.na(data))
##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0

We have no missing values, so let’s introduce a few to the “oldpeak” feature for this example to see how it works:

# Add five missing values added to oldpeak in row numbers 50, 100, 150, 200, 250
data$oldpeak[c(50, 100, 150, 200, 250)] = NA

colSums(is.na(data))
##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        5        0        0        0        0
colMeans(is.na(data))
##        age        sex         cp   trestbps       chol        fbs    restecg 
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 
##    thalach      exang    oldpeak      slope         ca       thal     target 
## 0.00000000 0.00000000 0.01650165 0.00000000 0.00000000 0.00000000 0.00000000

There are now 5 missing values in the “oldpeak” feature. Now, median impute the missing values! We also want to create missingness indicators to inform us about the location of missing data. These are additional columns we will add to our data frame that represent the locations within each feature that have missing values - 0 means data are present, 1 means there was a missing (and subsequently imputed) value.

result = ck37r::impute_missing_values(data, verbose = TRUE, type = "standard")
## Found 1 variables with NAs.
## Running standard imputation.
## Imputing oldpeak (10 numeric) with 5 NAs. Impute value: 0.8 
## Generating missingness indicators.
## Generating 1 missingness indicators.
## Checking for collinearity of indicators.
## Final number of indicators: 1 
## Indicators added (1): miss_oldpeak
names(result)
## [1] "type"             "add_indicators"   "skip_vars"        "prefix"          
## [5] "impute_values"    "indicators_added" "data"
# Use the imputed dataframe.
data = result$data

# View new columns. Note that the indicator feature "miss_oldpeak" has been added as the last column of our data frame. 
str(data)
## 'data.frame':    303 obs. of  15 variables:
##  $ age         : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex         : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp          : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps    : int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol        : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs         : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg     : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach     : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang       : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak     : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope       : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal        : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ miss_oldpeak: int  0 0 0 0 0 0 0 0 0 0 ...
# No more missing data!
colSums(is.na(data))
##          age          sex           cp     trestbps         chol          fbs 
##            0            0            0            0            0            0 
##      restecg      thalach        exang      oldpeak        slope           ca 
##            0            0            0            0            0            0 
##         thal       target miss_oldpeak 
##            0            0            0

Since the “ca”, “cp”, “slope”, and “thal” features are currently integer type, convert them to factors. The other relevant variables are either continuous or are already indicators (just 1’s and 0’s).

data = ck37r::categoricals_to_factors(data,
              categoricals = c("sex", "ca", "cp", "slope", "thal"),
              verbose = TRUE)
## Converting sex from integer to factor. Unique vals: 2 
## Converting ca from integer to factor. Unique vals: 5 
## Converting cp from integer to factor. Unique vals: 4 
## Converting slope from integer to factor. Unique vals: 3 
## Converting thal from integer to factor. Unique vals: 4
# Inspect the updated data frame
str(data)
## 'data.frame':    303 obs. of  15 variables:
##  $ age         : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex         : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 2 2 2 ...
##  $ cp          : Factor w/ 4 levels "0","1","2","3": 4 3 2 2 1 1 2 2 3 3 ...
##  $ trestbps    : int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol        : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs         : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg     : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach     : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang       : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak     : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope       : Factor w/ 3 levels "0","1","2": 1 1 3 3 3 2 2 3 3 3 ...
##  $ ca          : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ thal        : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
##  $ target      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ miss_oldpeak: int  0 0 0 0 0 0 0 0 0 0 ...

1.5 Defining y outcome vectors and x feature dataframes

1.5.1 Convert factors to indicators

Now expand “sex”, “ca”, “cp”, “slope”, and “thal” features out into indicators.

result = ck37r::factors_to_indicators(data, verbose = TRUE)
## Converting factors (5): sex, cp, slope, ca, thal
## Converting sex from a factor to a matrix (2 levels).
## : sex_1 
## Converting cp from a factor to a matrix (4 levels).
## : cp_1 cp_2 cp_3 
## Converting slope from a factor to a matrix (3 levels).
## : slope_1 slope_2 
## Converting ca from a factor to a matrix (5 levels).
## : ca_1 ca_2 ca_3 ca_4 
## Converting thal from a factor to a matrix (4 levels).
## : thal_1 thal_2 thal_3 
## Combining factor matrices into a data frame.
data = result$data

str(data)
## 'data.frame':    303 obs. of  23 variables:
##  $ age         : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ trestbps    : int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol        : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs         : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg     : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach     : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang       : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak     : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ target      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ miss_oldpeak: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sex_1       : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp_1        : int  0 0 1 1 0 0 1 1 0 0 ...
##  $ cp_2        : int  0 1 0 0 0 0 0 0 1 1 ...
##  $ cp_3        : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ slope_1     : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ slope_2     : int  0 0 1 1 1 0 0 1 1 1 ...
##  $ ca_1        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_2        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_3        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_4        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal_1      : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ thal_2      : int  0 1 1 1 1 0 1 0 0 1 ...
##  $ thal_3      : int  0 0 0 0 0 0 0 1 1 0 ...
#dim(data)

What happened?

1.5.2 Save our preprocessed data

We save our preprocessed data into an RData file so that we can easily load it the later files.

save(data, data_original,
     file = "data/preprocessed.RData")