3 LLS: A linear algebra perspective
We now take a deeper dive into the structure of linear least squares problems. We will see how theoretical concepts from linear algebra can be used to describe and explain properties of learned linear regression models.
The intended learning outcomes of these notes are that students should be able to:
Determine if a least-squares problem is over-, under-, or well-determined.
Describe the set of all solutions to over-, under-, and well-determined least squares problems.
Use linear algebra concepts to describe how the choice of features in a linear regression model impacts the minimizer and minima of least squares problems.
3.1 Mathematical problem formulation
We briefly recap the mathematical problem formulation that we consider. Given a data set \(\{(z_i,y_i)\}_{i=1}^N\), where \(z_i\in\mathbb{R}^d\) and \(y_i\in\mathbb{R}\) are paired inputs and outputs, we choose a feature mapping \(x: \mathbb{R}^d\to\mathbb{R}^n\). We seek to find coefficients \(\beta\in\mathbb{R}^n\) so that the model
\[ f(z;\beta) = x(z)^\top\beta = \sum_{j=1}^n \beta_j x_j(z) \]
accurately predicts the output \(y\). To do this, we define the matrix \(X\in\mathbb{R}^{N\times n}\) and vector \(Y\in\mathbb{R}^N\),
\[ X = \begin{pmatrix} - &x^\top(z_1) & - \\ & \vdots & \\ - & x^\top(z_N) & - \end{pmatrix}, \qquad Y = \begin{pmatrix} y_1 \\ \vdots \\ y_N \end{pmatrix}, \qquad(3.1)\]
and write the least-squares problem as follows:
\[ X\beta = Y \qquad(3.2)\]
This is a different notation than the minimization (?eq-linreg-minimization) introduced earlier, but is also fairly common. One way to think of the expression Equation 3.2 is to note that our goal is to find \(\beta\) so that \(X\beta\) is as close to \(Y\) is possible – where ideally we would have \(X\beta = Y\). But Equation 3.2 has a more precise and technical meaning. The goal of these notes is to explain that precise technical definition and its implications for model learning.
3.2 Well-posedness of \(X\beta = Y\)
In math, the term “well-posedness” refers to the existence and uniqueness of solutions to a problem. We say a mathematical problem is “well-posed” if it has a unique solution. Existence and uniqueness of solutions may seem like abstract theoretical concepts, but as we will soon see, they have very practical consequences for the accuracy and reliability of the models we learn.
3.2.1 Well-posedness of square linear systems
Suppose for a moment that \(N = n\). Then the square linear system \(X\beta = Y\) describes a system of \(n\) linear equations for \(n\) unknowns (the elements of the unknown coefficient vector \(\beta\in\mathbb{R}^n\)). The problem of solving \(X\beta = Y\) for \(\beta\) is well-posed if and only if \(X\) is invertible. In this case, the solution exists and is uniquely given by \(\beta=X^{-1}Y\).
Remember that in our setting, \(X\) is a matrix of feature data (one row per data point), and \(Y\) is a vector with the corresponding output data. For square invertible \(X\), the existence of a solution to \(X\beta=Y\) means that there is a choice of \(\beta\) so that the learned model will interpolate the data, i.e., the model will satisfy
\[ f(z_i;\beta) = y_i \quad \text{for all }i = 1,\ldots,N. \]
The fact that the solution is unique means there is only one choice of \(\beta\) that perfectly matches the data. Imagine for a moment if there were more than one choice of \(\beta\) that perfectly matched the data: how would you choose which \(\beta\) to use for predicting model outputs at new inputs?
What if \(X\) is not invertible? This is clearly the case in the general linear regression problem we have set up where \(N\neq n\) and \(X\) is rectangular, but \(X\) can also be non-invertible and square. There are essentially two ways \(X\) can fail to be invertible:
\(X\) can have column rank less than \(n\), i.e., the number of linearly independent columns in \(X\) is less than the number of columns \(n\). In this case, we say \(X\) is “column rank-deficient” or \(X\) “does not have full column rank”.
\(X\) can have row rank less than \(N\), i.e., the number of linearly independent rows in \(X\) is less than the number of rows \(N\). In this case, we say \(X\) is “row rank-deficient” or \(X\) “does not have full row rank”.
These are not mutually exclusive: \(X\) can be simultaneously column rank-deficient and row rank-deficient. However, these different types of rank deficiencies have different consequences for well-posedness the existence and uniqueness of solutions, and different implications for learning linear regression models.
3.2.2 Overdetermined least-squares problems
Let’s start with the row rank-deficient case. If \(X\) has fewer linearly independent rows than the total number of rows, this means that some of the rows of \(X\) are redundant. This means that some of the linear equations represented by rows of the linear system \(X\beta=Y\) have redundant left-hand sides. Because redundant equations place more constraints on the solution than can be simultaneously satisfied, this case is called the overdetermined case.
One way overdetermined linear systems can arise is if we have multiple copies of the same input in the data set, i.e., suppose \(z_1 = z_2\). Note that having multiple copies of the same input datum usually does not mean that the associated output data are also copies of each other (so you can still have \(y_1 \neq y_2\)): for example, you might take multiple experimental measurements with the same configuration and get different measured values due to sensor noise or external factors that you don’t account for in your inputs. If this happens, your data set contains “contradictory” information, because \(x(z_1)^\top\beta=y_1\) and \(x(z_2)^\top\beta= x(z_1)^\top\beta=y_2\) cannot simultaneously be true.
What does row rank-deficiency mean for solving \(X\beta=Y\)? In most cases this means that there is no choice of \(\beta\) that will exactly solve the equation — the equality cannot simultaneously hold for all rows of the system. That means no solution exists for the linear system. This means there is no choice of \(\beta\) that can perfectly match all the data.
Instead of looking for \(\beta\) that yields a perfect match to the data, the least-squares problem formulation looks to minimize the misfit to the data. Let’s define the residual of the system as follows:
\[ r(\beta) = X\beta-Y. \qquad(3.3)\]
For a given parameter \(\beta\), the residual norm \(\|r(\beta)\|\) is a measure of the model misfit over the training data set. Note that if \(X\beta=Y\) holds with equality for all rows, then the residual (norm) is zero. In the overdetermined case where \(X\beta=Y\), since we can’t have a zero residual, we aim for the next best thing, which is minimizing the size of the residual (recall that the norm is a measure of the size of a vector)1:
\[ \beta^* = \arg\min_{\beta\in\mathbb{R}^n} \|r(\beta)\| = \arg\min_{\beta\in\mathbb{R}^n}\|r(\beta)\|^2 \qquad(3.4)\]
With the residual defined in Equation 3.3, the expression Equation 3.4 recovers our original expression for the formulation of a least-squares problem2, ?eq-linreg-minimization. In this sense, the least-squares formulation “fixes” the lack of existence of a solution to the linear system due to row redundancy. For this reason, if you see \(X\beta=Y\) written for a non-invertible system3, the way you should understand that is in the sense of solving ?eq-linreg-minimization.
3.2.3 Underdetermined least-squares problems
We now turn our attention to the column rank-deficient case. If \(X\) has fewer linearly independent columns than the total number of columns, this means that some of the columns of \(X\) are redundant. Recall that \(X\beta\in\mathbb{R}^N=\sum_{j=1}^n \beta_j x_j\), where \(x_j\in\mathbb{R}^N\) is the \(j\)th column of \(X\). If \(X\) has redundant columns, this means that there are different ways to weight the columns of \(X\) using the coefficients in \(\beta\) that lead to the same model predictions. This means that requiring \(X\beta=Y\) will not fully specify what \(\beta\) is, so we call this the underdetermined case.
One way underdetermined systems arise is if there are more unknown coefficients in \(\beta\) than there data points to constrain the problem. This can happen if we choose a very large set of features (\(n\) large) but only have a small amount of data (\(N<n\)).
What does column rank-deficiency mean for solving \(X\beta=Y\)? This means that the solution is not unique. This non-uniqueness comes from the fact that column rank-deficiency of \(X\) means that \(X\) has a nonzero kernel (aka nullspace). Recall that this means there exists some nonzero \(v\in\mathbb{R}^n\) satisfying \(Xv = 0\). Let \(r^*\) be the minimum possible residual,
\[ r^* = \min_{\beta\in\mathbb{R}^n}\|r(\beta)\|. \]
Suppose we find some \(\beta^*\) that leads to this minimum residual, so that \(r^* = X\beta^* - Y\). Well, if \(v\) lies in the kernel of \(X\), then we can add any multiple of \(v\) to \(\beta^*\) and get the same minimum residual! That is, if \(\beta^*\) is a solution to Equation 3.4, then so is \(\beta^* + av\) for any \(a\in\mathbb{R}\). This means there are infinitely many solutions. To make underdetermined problems well-posed, we have to decide on a way of choosing a single unique solution out of the many possible minimizers of Equation 3.4.
Deciding on a method of choosing a single unique solution out of many possible solutions is called regularization and is a topic we will explore in more depth later. In the meantime, the most common and vanilla way of regularizing is to specify that we want to find what’s called the “minimum-norm” solution: i.e., out of all possible solutions, we take the smallest one. That is,
\[ \beta^* = \arg\min_{\beta\in\mathbb{R}^n} \|\beta\| \quad \text{subject to } \|X\beta - Y\| = r^*. \qquad(3.5)\]
3.2.4 Summary of linear least-squares problem formulation
For \(X\) and \(y\) defined as in Equation 3.1, to solve \(X\beta=Y\) means to find the minimum-norm least-squares solution as described in Equation 3.5. This is generally what is happening under the hood when you use any scientific computing or machine learning software that has a built-in least-squares function, including
- scipy.linalg.lstsq
- numpy.linalg.lstsq
- MATLAB’s “backslash operator” \ also known as mldivide
In a later chapter we will introduce some of the basic algorithms that are used to find (minimum-norm) least-squares solutions that these functions use. For now you should focus on understanding the precise mathematical definition of the least squares solution given in Equation 3.5.
3.3 Feature selection: expressivity, accuracy, and well-posedness
You may wonder why you should bother understanding what the mathematical solution is when there are built-in functions that just give it to you. One reason is that you can debug machine learning code much more effectively if you understand what the code is supposed to be doing theoretically. Another reason is that considerations of well-posedness have big implications for your job as an intelligent human designer of machine learning models.
For linear regression models, your job as an intelligent human is to decide on the feature mapping \(x:\mathbb{R}^d \to\mathbb{R}^n\), which determines what kinds of terms appear in your learned model. Examples of features can include:
monomials, e.g., \(1, z, z^2, z^3, \ldots\),
Fourier features, e.g. \(\sin(\pi z), \cos(\pi z), \sin(2\pi z),\cos(2\pi z), \ldots\),
physics-based features based on problem knowledge,
or any other function of \(z\) that you choose.
The choice of feature mapping \(x\) determines the expressivity of the model class. The term “expressivity” describes the set of functions that the model class can represent. A model class is considered highly expressive if it can represent many different types of functions. For example, recall that any continuous function can be arbitrarily well-approximated by an infinite sum of polynomial terms. Thus, the features \(x_j : z \mapsto z^j\) for \(j = 0,1,2,\ldots, n-1\) would be highly expressive if \(d\) is large. Fourier-based features have a similar approximation result – if our features are based on truncating a Fourier series at some frequency, the more features we have, the more expressive the model class would be.
Generally, highly expressive model classes are better able to match the training data than less expressive model classes. This is because there are more degrees of freedom (regression coefficients) that we can tweak to lower the mismatch between the model prediction and the training data. However, the more features we have, the greater the risk of overfitting, that is, achieving a very low error on the training data with a model that will not perform well on unseen (test) data. Overfitting is especially a risk for underdetermined problems, where there are more features than data points, but can also happen even when the problem is well-determined or over-determined.
Note that the second equality follows because squaring the norm doesn’t change where the minimum is because \(a^2\) is monotone increasing in \(a\) for non-negative \(a\).↩︎
up to a multiplying constant anyway, which I’ve emphasized before really doesn’t matter when it’s inside a minimization.↩︎
see, for example, the documentation for the numpy function linalg.lstsq or the scipy version↩︎