Statistics Saga 1: Matrix Factorization

This blogpost will give you a gentle (re)introduction to the idea of matrix factorization, an enormously useful technique in statistics and machine learning.

Matrix Factorization

Matrix factorization is a technique to decompose or factorize a matrix into a product of more fundamental matrices. If that sounds a bit confusing, it's analogous to factorizing a number: 48=4×12 or 48=6×8. Of course, a matrix is more complex than a number, so many kinds of factorization are possible.

Perhaps the easiest matrix factorization is LU decomposition. Here we decompose a square matrix A into a lower triangular matrix L and an upper triangular matrix U:

$$ \begin{bmatrix}4 & 3 \\6 & 3\end{bmatrix} = \begin{bmatrix}1 & 0 \\1.5 & 1\end{bmatrix} \begin{bmatrix}4 & 3 \\0 & -1.5\end{bmatrix}. $$

LU Decomposition

The LU decomposition is a very useful tool for some basic mathematical operations that are needed in statistical and machine learning models. Inverting matrices and solving systems of linear equations are probably the most common problems that can be solved by an LU decomposition. Perhaps the easiest application both familiar to statisticians, data scientists, and machine learning engineers is estimating the coefficients of a linear regression with LU decomposition, let's take look!

Example: linear regression using LU Decomposition

First things first, recall that estimating the coefficients of a linear regression (with an Ordinary Least Squares (OLS) estimator) is done as follows:

$$ (X^TX)b=X^Ty \\ b = (X^TX)^{-1}X^Ty. $$

This requires us to invert the matrix $ X^TX $. However, doing an LU decomposition of this matrix accomplishes the same goal and is much more computationally efficient. So let's code up this example.

We start by generating an x variable from a normal distribution.

import numpy as np

np.random.seed(42) # seed for reproducibility
mu, sigma = 0, 0.1 # mean and standard deviation
n = 500
x = np.random.normal(mu, sigma, n)

generating values for the predictor of our linear regression

Next we define the matrix of regression coefficients.

ones = np.ones(n) 
X = np.column_stack( (ones, x) )
b = np.vstack(np.array( (0, 3.1415) )) # set intercept and slope

creating the coefficient matrix

Now we can use this matrix of regression coefficients to calculate a y variable, with known coefficients.

y = X.dot(b) + np.vstack(np.random.normal(0,1,n))

creating y variable, predicted by x with coefficients b

We now have all the elements we need. We can now solve for b in $ X^TXb=X^Ty $. To do so, we first require a bit more algebra to make some of the next steps more clear:

$$ (X^TX) = LU \\ LUb = X^Ty \\ Ub = L^{-1}X^Ty.$$

Let's define $ Z = L^{-1}X^Ty $ . Allowing us to construct this final system of equations:

$$ Ub = Z \\ LZ = X^Ty. $$

We can then solve for Z in $ LZ = X^Ty $ and finally solve for b in $ Ub = Z $.

But let's use python to do the dirty work of finding L and U, and all that linear algebra. Computers are much better at it than you and I!

Let's go!

import scipy.linalg as la

# LU decomposition

P, L, U = la.lu((X.T).dot(X))

LU Decomposition

Unfair how easy this is! Now let's plug L and U into their respective equations, and solve for the unknowns.

# Solve for Z

def forward_substitution(L, b):
  # source: https://johnfoster.pge.utexas.edu/numerical-methods-    	             book/LinearAlgebra_LU.html#$\mathbf{LU}
    
    #Get number of rows
    n = L.shape[0]
    
    #Allocating space for the solution vector
    y = np.zeros_like(b, dtype=np.double);
    
    #Here we perform the forward-substitution.  
    #Initializing  with the first row.
    y[0] = b[0] / L[0, 0]
    
    #Looping over rows in reverse (from the bottom  up),
    #starting with the second to last row, because  the 
    #last row solve was completed in the last step.
    for i in range(1, n):
        y[i] = (b[i] - np.dot(L[i,:i], y[:i])) / L[i,i]
        
    return y

Z = forward_substitution(L, (X.T).dot(y))

solving for Z

The forward_substitution() function is used to solve linear equations using forward substitution. Having solved for Z, we can now use it to solve for b.

# Solve for b

def back_substitution(U, y):
    
    #Number of rows
    n = U.shape[0]
    
    #Allocating space for the solution vector
    x = np.zeros_like(y, dtype=np.double);

    #Here we perform the back-substitution.  
    #Initializing with the last row.
    x[-1] = y[-1] / U[-1, -1]
    
    #Looping over rows in reverse (from the bottom up), 
    #starting with the second to last row, because the 
    #last row solve was completed in the last step.
    for i in range(n-2, -1, -1):
        x[i] = (y[i] - np.dot(U[i,i:], x[i:])) / U[i,i]
        
    return x
b = back_substitution(U, Z)

solving for b

And we're done already! Think of all the time you wasted in linear algebra classes doing this by hand or on a calculator! Let's take a look at the results.

$$ \begin{bmatrix}0.03234184 \\ 2.38730346\end{bmatrix} $$

Remember that we set the intercept to be equal to 0, and our slope to be equal to 3.1415. It's normal to be a little bit off, since we're generating random numbers of course. But we can do two sanity checks: we can use python to directly calculate the inverse of $ X^TX $ and solve for b. We can also use a library like statsmodels to fit the linear regression and the coefficients it estimated.

Inverse

np.linalg.inv( (X.T).dot(X)).dot( (X.T).dot(y))

solving using inverse calculation

$$ \begin{bmatrix}0.03234184 \\ 2.38730346\end{bmatrix} $$

Nice, the results are exactly the same, as expected!

Statsmodels

import statsmodels.api as sm
x_t = sm.add_constant(x) # needed to also estimate the intercept
lm = sm.OLS(y, x_t)
linreg = lm.fit()
print(linreg.params)

linear regression using statsmodels

$$ \begin{bmatrix}0.03234184 \\ 2.38730346\end{bmatrix} $$

And once again, as expected, we get the exact same result. Which should not be surprising since most statistical or machine learning libraries use some kind of matrix calculation behind the scenes. Still, it's fun to take a peak behind the scenes once in a while!

Conclusion

Isn't it great that such a simple technique allows us to estimate linear regression coefficients? Of course, in reality LU decomposition is not used for these kinds of situations, more advanced decompositions like the Cholesky-decomposition, Eigen-decomposition or Singular Value Decomposition are seen instead.

Matrices and factorisation are hugely important in all kinds of state of the art machine learning models and neural networks. You can represent the input, hidden layers, and outputs of a neural network using matrices. The most important thing however is that matrix calculations enable us to make use of vectorization. This technique allows your computer to do calculations on all of the values at once, instead of having to loop through each value one by one. This is why neural networks are almost always trained on GPU's, computer hardware that is built for handling a huge amount of computations once.

I hope I've convinced you of how useful matrix factorisations are! If not, we have a blog coming up about dimensionality reduction, which will once again heavily feature matrix factorizations!