By Chiel Mues
This blogpost will give you a gentle (re)introduction to the idea of matrix factorization, an enormously useful technique in statistics and machine learning.
Matrix Factorization
Matrix factorization is a technique to decompose or factorize a matrix into a product of more fundamental matrices. If that sounds a bit confusing, it's analogous to factorizing a number: 48=4×12 or 48=6×8. Of course, a matrix is more complex than a number, so many kinds of factorization are possible.
Perhaps the easiest matrix factorization is LU decomposition. Here we decompose a square matrix A into a lower triangular matrix L and an upper triangular matrix U:
$$ \begin{bmatrix}4 & 3 \\6 & 3\end{bmatrix} = \begin{bmatrix}1 & 0 \\1.5 & 1\end{bmatrix} \begin{bmatrix}4 & 3 \\0 & -1.5\end{bmatrix}. $$
LU Decomposition
The LU decomposition is a very useful tool for some basic mathematical operations that are needed in statistical and machine learning models. Inverting matrices and solving systems of linear equations are probably the most common problems that can be solved by an LU decomposition. Perhaps the easiest application both familiar to statisticians, data scientists, and machine learning engineers is estimating the coefficients of a linear regression with LU decomposition, let's take look!
Example: linear regression using LU Decomposition
First things first, recall that estimating the coefficients of a linear regression (with an Ordinary Least Squares (OLS) estimator) is done as follows:
$$ (X^TX)b=X^Ty \\ b = (X^TX)^{-1}X^Ty. $$
This requires us to invert the matrix \( X^TX \). However, doing an LU decomposition of this matrix accomplishes the same goal and is much more computationally efficient. So let's code up this example.
We start by generating an x variable from a normal distribution.
Next we define the matrix of regression coefficients.
Now we can use this matrix of regression coefficients to calculate a y variable, with known coefficients.
We now have all the elements we need. We can now solve for b in \( X^TXb=X^Ty \). To do so, we first require a bit more algebra to make some of the next steps more clear:
$$ (X^TX) = LU \\ LUb = X^Ty \\ Ub = L^{-1}X^Ty.$$
Let's define \( Z = L^{-1}X^Ty \) . Allowing us to construct this final system of equations:
$$ Ub = Z \\ LZ = X^Ty. $$
We can then solve for Z in \( LZ = X^Ty \) and finally solve for b in \( Ub = Z \).
But let's use python to do the dirty work of finding L and U, and all that linear algebra. Computers are much better at it than you and I!
Let's go!
Unfair how easy this is! Now let's plug L and U into their respective equations, and solve for the unknowns.
The forward_substitution() function is used to solve linear equations using forward substitution. Having solved for Z, we can now use it to solve for b.
And we're done already! Think of all the time you wasted in linear algebra classes doing this by hand or on a calculator! Let's take a look at the results.
$$ \begin{bmatrix}0.03234184 \\ 2.38730346\end{bmatrix} $$
Remember that we set the intercept to be equal to 0, and our slope to be equal to 3.1415. It's normal to be a little bit off, since we're generating random numbers of course. But we can do two sanity checks: we can use python to directly calculate the inverse of \( X^TX \) and solve for b. We can also use a library like statsmodels to fit the linear regression and the coefficients it estimated.
Inverse
$$ \begin{bmatrix}0.03234184 \\ 2.38730346\end{bmatrix} $$
Nice, the results are exactly the same, as expected!
Statsmodels
$$ \begin{bmatrix}0.03234184 \\ 2.38730346\end{bmatrix} $$
And once again, as expected, we get the exact same result. Which should not be surprising since most statistical or machine learning libraries use some kind of matrix calculation behind the scenes. Still, it's fun to take a peak behind the scenes once in a while!
Conclusion
Isn't it great that such a simple technique allows us to estimate linear regression coefficients? Of course, in reality LU decomposition is not used for these kinds of situations, more advanced decompositions like the Cholesky-decomposition, Eigen-decomposition or Singular Value Decomposition are seen instead.
Matrices and factorisation are hugely important in all kinds of state of the art machine learning models and neural networks. You can represent the input, hidden layers, and outputs of a neural network using matrices. The most important thing however is that matrix calculations enable us to make use of vectorization. This technique allows your computer to do calculations on all of the values at once, instead of having to loop through each value one by one. This is why neural networks are almost always trained on GPU's, computer hardware that is built for handling a huge amount of computations once.
I hope I've convinced you of how useful matrix factorisations are! If not, we have a blog coming up about dimensionality reduction, which will once again heavily feature matrix factorizations!