Use Sparse Matrix with R
Anirban Ghatak
1 Sparse Matrix
Sparse matrices arise natrually in many problems. For example, if we want to predict the price of an item on craigslist using the post’s text, we could build a matrix where each row represents a craigslist post, each column represents a keyword {bad, boat, car, good, new, shoes, used}, and element $ (i,j) $ represents the number of times keyword $ j $ appears in post $ i $.
In practice, such a matrix can have millions of rows and columns, and storing every single element of that matrix is expensive. Compressed matrix formats take advantage of the fact that sparse matrices are comprised mostly of 0s, and so they only store the non-zero entries. In this post, we’ll look at common techniques for creating sparse matrices in R.
1.1 Sparse Matrix From Base R Matrix
library(Matrix)
# Build a base R matrix from scratch, comprised of
# 0s with probability 0.80
# 1s with probability 0.10
# 2s with probability 0.10
set.seed(0)
nrows <- 4L
ncols <- 6L
vals <- sample(
x = c(0, 1, 2),
prob = c(0.8, 0.1, 0.1),
size = nrows * ncols,
replace = TRUE
)
matBaseR <- matrix(vals, nrow = nrows)
matBaseR
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 2 1 0 0 0 0
## [2,] 0 0 0 0 0 1
## [3,] 0 2 0 0 1 0
## [4,] 0 1 0 0 0 0
# Convert to sparse format
matSparse <- as(matBaseR, "sparseMatrix")
matSparse
## 4 x 6 sparse Matrix of class "dgCMatrix"
##
## [1,] 2 1 . . . .
## [2,] . . . . . 1
## [3,] . 2 . . 1 .
## [4,] . 1 . . . .
Notice how the snippet as(matBaseR, "sparseMatrix")
creates a dgCMatrix
by default. This is a matrix in compressed sparse column (CSC) format. Instead of letting the Matrix package make this decision for you, you can be explicit about the storage format you want. Granted, this is usually going to be CSC.
# compressed sparse column format
matCSC <- as(matBaseR, "dgCMatrix")
matCSC
## 4 x 6 sparse Matrix of class "dgCMatrix"
##
## [1,] 2 1 . . . .
## [2,] . . . . . 1
## [3,] . 2 . . 1 .
## [4,] . 1 . . . .
# compressed sparse row format
matCSR <- as(matBaseR, "dgRMatrix")
matCSR
## 4 x 6 sparse Matrix of class "dgRMatrix"
##
## [1,] 2 1 . . . .
## [2,] . . . . . 1
## [3,] . 2 . . 1 .
## [4,] . 1 . . . .
Before moving on, the following exercise is worth doing.
- Modify the parameters defining the size of the matrix to
nrows = 10^4
andncols = 10^4
- Re-build
matBaseR
,matCSC
andmatCSR
- Compare their sizes in memory
1.2 Sparse Matrix From (Row, Column) Pairs
We start by building a table of customers, products, and orders.
# Build a table of customers
custs <- data.frame(Customer = c("Bob", "Sue", "Ralf", "John", "Mary"))
custs
## Customer
## 1 Bob
## 2 Sue
## 3 Ralf
## 4 John
## 5 Mary
# Build a table of products
prods <- data.frame(Product = c("AAA", "BBB", "CCC", "DDD", "EEE", "FFF"))
prods
## Product
## 1 AAA
## 2 BBB
## 3 CCC
## 4 DDD
## 5 EEE
## 6 FFF
# Build a table of orders: (customer, product, units) triplets
orders <- data.frame(
Customer = c("Bob", "Bob", "Sue", "Sue", "Bob", "Ralf", "John", "John"),
Product = c("AAA", "AAA", "FFF", "FFF", "EEE", "EEE", "AAA", "FFF")
)
# Insert RowIdx column, identifying the row index to assign each customer
orders$RowIdx <- match(orders$Customer, custs$Customer)
# Insert ColIdx column, identifying the column index to assign each product
orders$ColIdx <- match(orders$Product, prods$Product)
# Inspect
orders
## Customer Product RowIdx ColIdx
## 1 Bob AAA 1 1
## 2 Bob AAA 1 1
## 3 Sue FFF 2 6
## 4 Sue FFF 2 6
## 5 Bob EEE 1 5
## 6 Ralf EEE 3 5
## 7 John AAA 4 1
## 8 John FFF 4 6
RowIdx and ColIdx in the orders table will be the (i,j) location of each record in the resulting matrix. Let’s start by constructing a basic sparse matrix where element (i,j) = TRUE
if customer i purchased product j, and FALSE
otherwise.
# Build sparse matrix indicating if customer i ever purchased project j
matSparse1 <- sparseMatrix(
i = orders$RowIdx,
j = orders$ColIdx
)
matSparse1
## 4 x 6 sparse Matrix of class "ngCMatrix"
##
## [1,] | . . . | .
## [2,] . . . . . |
## [3,] . . . . | .
## [4,] | . . . . |
Notice that the resulting matrix has 4 rows, but we actually have 5 customers. The 5th customer, Mary, doesn’t have any orders. If we want to make sure every customer (and every product) appears in our matrix, we can set the dimensions explicitly. Furthermore, we can provide row names and column names to sparseMatrix()
.
matSparse2 <- sparseMatrix(
i = orders$RowIdx,
j = orders$ColIdx,
dims = c(nrow(custs), nrow(prods)),
dimnames = list(custs$Customer, prods$Product)
)
matSparse2
## 5 x 6 sparse Matrix of class "ngCMatrix"
## AAA BBB CCC DDD EEE FFF
## Bob | . . . | .
## Sue . . . . . |
## Ralf . . . . | .
## John | . . . . |
## Mary . . . . . .
Instead of building a matrix where element (i,j) indicates if customer i purchased product j, let’s build one where element (i,j) shows the number of times customer i purchased product j.
matSparse3 <- sparseMatrix(
i = orders$RowIdx,
j = orders$ColIdx,
x = 1L,
dims = c(nrow(custs), nrow(prods)),
dimnames = list(custs$Customer, prods$Product)
)
matSparse3
## 5 x 6 sparse Matrix of class "dgCMatrix"
## AAA BBB CCC DDD EEE FFF
## Bob 2 . . . 1 .
## Sue . . . . . 2
## Ralf . . . . 1 .
## John 1 . . . . 1
## Mary . . . . . .
Here, we use x = 1L
to “add 1” to element (i,j) every time it is seen in the input. If you replaced x = 1L
with x = 2L
, we’d get a matrix that’s twice the one we just created.
1.3 Sparse Matrix From (Row, Column, Value) Triplets
Now let’s assign a Units column to our orders table that indicates the number of units purchased for each order.
orders$Units <- c(1L, 1L, 3L, 1L, 1L, 2L, 1L, 1L)
orders
## Customer Product RowIdx ColIdx Units
## 1 Bob AAA 1 1 1
## 2 Bob AAA 1 1 1
## 3 Sue FFF 2 6 3
## 4 Sue FFF 2 6 1
## 5 Bob EEE 1 5 1
## 6 Ralf EEE 3 5 2
## 7 John AAA 4 1 1
## 8 John FFF 4 6 1
We can build a sparse matrix where element (i,j) shows the number of units of product j purchased by customer i.
matSparse4 <- sparseMatrix(
i = orders$RowIdx,
j = orders$ColIdx,
x = orders$Units,
dims = c(nrow(custs), nrow(prods)),
dimnames = list(custs$Customer, prods$Product)
)
matSparse4
## 5 x 6 sparse Matrix of class "dgCMatrix"
## AAA BBB CCC DDD EEE FFF
## Bob 2 . . . 1 .
## Sue . . . . . 4
## Ralf . . . . 2 .
## John 1 . . . . 1
## Mary . . . . . .