1 Sparse Matrix

Sparse matrices arise natrually in many problems. For example, if we want to predict the price of an item on craigslist using the post’s text, we could build a matrix where each row represents a craigslist post, each column represents a keyword {bad, boat, car, good, new, shoes, used}, and element $ (i,j) $ represents the number of times keyword $ j $ appears in post $ i $.

In practice, such a matrix can have millions of rows and columns, and storing every single element of that matrix is expensive. Compressed matrix formats take advantage of the fact that sparse matrices are comprised mostly of 0s, and so they only store the non-zero entries. In this post, we’ll look at common techniques for creating sparse matrices in R.

1.1 Sparse Matrix From Base R Matrix

library(Matrix)

# Build a base R matrix from scratch, comprised of 
#   0s with probability 0.80
#   1s with probability 0.10
#   2s with probability 0.10

set.seed(0)
nrows <- 4L
ncols <- 6L
vals <- sample(
  x = c(0, 1, 2), 
  prob = c(0.8, 0.1, 0.1), 
  size = nrows * ncols, 
  replace = TRUE
)
matBaseR <- matrix(vals, nrow = nrows)
matBaseR
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    2    1    0    0    0    0
## [2,]    0    0    0    0    0    1
## [3,]    0    2    0    0    1    0
## [4,]    0    1    0    0    0    0

# Convert to sparse format
matSparse <- as(matBaseR, "sparseMatrix")
matSparse
## 4 x 6 sparse Matrix of class "dgCMatrix"
##                 
## [1,] 2 1 . . . .
## [2,] . . . . . 1
## [3,] . 2 . . 1 .
## [4,] . 1 . . . .

Notice how the snippet as(matBaseR, "sparseMatrix") creates a dgCMatrix by default. This is a matrix in compressed sparse column (CSC) format. Instead of letting the Matrix package make this decision for you, you can be explicit about the storage format you want. Granted, this is usually going to be CSC.

# compressed sparse column format
matCSC <- as(matBaseR, "dgCMatrix")
matCSC
## 4 x 6 sparse Matrix of class "dgCMatrix"
##                 
## [1,] 2 1 . . . .
## [2,] . . . . . 1
## [3,] . 2 . . 1 .
## [4,] . 1 . . . .

# compressed sparse row format
matCSR <- as(matBaseR, "dgRMatrix")
matCSR
## 4 x 6 sparse Matrix of class "dgRMatrix"
##                 
## [1,] 2 1 . . . .
## [2,] . . . . . 1
## [3,] . 2 . . 1 .
## [4,] . 1 . . . .

Before moving on, the following exercise is worth doing.

  • Modify the parameters defining the size of the matrix to nrows = 10^4 and ncols = 10^4
  • Re-build matBaseR, matCSC and matCSR
  • Compare their sizes in memory

1.2 Sparse Matrix From (Row, Column) Pairs

We start by building a table of customers, products, and orders.

# Build a table of customers
custs <- data.frame(Customer = c("Bob", "Sue", "Ralf", "John", "Mary"))
custs
##   Customer
## 1      Bob
## 2      Sue
## 3     Ralf
## 4     John
## 5     Mary

# Build a table of products
prods <- data.frame(Product = c("AAA", "BBB", "CCC", "DDD", "EEE", "FFF"))
prods
##   Product
## 1     AAA
## 2     BBB
## 3     CCC
## 4     DDD
## 5     EEE
## 6     FFF

# Build a table of orders: (customer, product, units) triplets
orders <- data.frame(
  Customer = c("Bob", "Bob", "Sue", "Sue", "Bob", "Ralf", "John", "John"),
  Product = c("AAA", "AAA", "FFF", "FFF", "EEE", "EEE", "AAA", "FFF")
)

# Insert RowIdx column, identifying the row index to assign each customer
orders$RowIdx <- match(orders$Customer, custs$Customer)

# Insert ColIdx column, identifying the column index to assign each product
orders$ColIdx <- match(orders$Product, prods$Product)

# Inspect
orders
##   Customer Product RowIdx ColIdx
## 1      Bob     AAA      1      1
## 2      Bob     AAA      1      1
## 3      Sue     FFF      2      6
## 4      Sue     FFF      2      6
## 5      Bob     EEE      1      5
## 6     Ralf     EEE      3      5
## 7     John     AAA      4      1
## 8     John     FFF      4      6

RowIdx and ColIdx in the orders table will be the (i,j) location of each record in the resulting matrix. Let’s start by constructing a basic sparse matrix where element (i,j) = TRUE if customer i purchased product j, and FALSE otherwise.

# Build sparse matrix indicating if customer i ever purchased project j
matSparse1 <- sparseMatrix(
  i = orders$RowIdx, 
  j = orders$ColIdx
)
matSparse1
## 4 x 6 sparse Matrix of class "ngCMatrix"
##                 
## [1,] | . . . | .
## [2,] . . . . . |
## [3,] . . . . | .
## [4,] | . . . . |

Notice that the resulting matrix has 4 rows, but we actually have 5 customers. The 5th customer, Mary, doesn’t have any orders. If we want to make sure every customer (and every product) appears in our matrix, we can set the dimensions explicitly. Furthermore, we can provide row names and column names to sparseMatrix().

matSparse2 <- sparseMatrix(
  i = orders$RowIdx, 
  j = orders$ColIdx, 
  dims = c(nrow(custs), nrow(prods)), 
  dimnames = list(custs$Customer, prods$Product)
)
matSparse2
## 5 x 6 sparse Matrix of class "ngCMatrix"
##      AAA BBB CCC DDD EEE FFF
## Bob    |   .   .   .   |   .
## Sue    .   .   .   .   .   |
## Ralf   .   .   .   .   |   .
## John   |   .   .   .   .   |
## Mary   .   .   .   .   .   .

Instead of building a matrix where element (i,j) indicates if customer i purchased product j, let’s build one where element (i,j) shows the number of times customer i purchased product j.

matSparse3 <- sparseMatrix(
  i = orders$RowIdx, 
  j = orders$ColIdx, 
  x = 1L,
  dims = c(nrow(custs), nrow(prods)), 
  dimnames = list(custs$Customer, prods$Product)
)
matSparse3
## 5 x 6 sparse Matrix of class "dgCMatrix"
##      AAA BBB CCC DDD EEE FFF
## Bob    2   .   .   .   1   .
## Sue    .   .   .   .   .   2
## Ralf   .   .   .   .   1   .
## John   1   .   .   .   .   1
## Mary   .   .   .   .   .   .

Here, we use x = 1L to “add 1” to element (i,j) every time it is seen in the input. If you replaced x = 1L with x = 2L, we’d get a matrix that’s twice the one we just created.

1.3 Sparse Matrix From (Row, Column, Value) Triplets

Now let’s assign a Units column to our orders table that indicates the number of units purchased for each order.

orders$Units <- c(1L, 1L, 3L, 1L, 1L, 2L, 1L, 1L)
orders
##   Customer Product RowIdx ColIdx Units
## 1      Bob     AAA      1      1     1
## 2      Bob     AAA      1      1     1
## 3      Sue     FFF      2      6     3
## 4      Sue     FFF      2      6     1
## 5      Bob     EEE      1      5     1
## 6     Ralf     EEE      3      5     2
## 7     John     AAA      4      1     1
## 8     John     FFF      4      6     1

We can build a sparse matrix where element (i,j) shows the number of units of product j purchased by customer i.

matSparse4 <- sparseMatrix(
  i = orders$RowIdx, 
  j = orders$ColIdx, 
  x = orders$Units,
  dims = c(nrow(custs), nrow(prods)), 
  dimnames = list(custs$Customer, prods$Product)
)
matSparse4
## 5 x 6 sparse Matrix of class "dgCMatrix"
##      AAA BBB CCC DDD EEE FFF
## Bob    2   .   .   .   1   .
## Sue    .   .   .   .   .   4
## Ralf   .   .   .   .   2   .
## John   1   .   .   .   .   1
## Mary   .   .   .   .   .   .