4 LLS: Computational Cost

In this section we will introduce fundamental algorithms from computational linear algebra that are used to solve linear systems and linear least squares problems. We will also introduce and discuss some of the key properties and theoretical considerations when using these algorithms.

The intended learning outcomes of these notes are that students should be able to:

Explain how LU decomposition and QR decompositions are used to solve linear systems and linear least squares problems, respectively.
Analyze the time complexity of numerical linear algebra algorithms by deriving expressions for the number of FLOPs in terms of matrix size
Analyze the memory complexity and cost of numerical linear algebra algorithms in terms of both matrix size and bits of storage

4.1 Measuring computational cost

A key property of any algorithm is its computational cost. The two types of cost we are interested in are time and memory: the former describes how long an algorithm takes to run, and the latter describes how many bytes of RAM we must have available in order to run the computation. For both of these measures, we are interested in understanding how the time and memory costs change as the dimensions of the problem (the number of data \(N\) or the number of features \(d\)) change. These properties are often referred to as the (time or memory) complexity of the algorithm, or as the (time or memory) cost scaling of the algorithm.

4.1.1 Measuring time complexity using FLOPs

We will focus on compute time first. Since compute time will vary from computer to computer, the way we measure time complexity in computational linear algebra is by counting the number of floating point operations (FLOPs) required by an algorithm. A single FLOP refers to the addition, subtraction, multiplication, or division of two scalar quantities. Compute speed of supercomputers is measured in FLOPs per second (also abbreviated FLOPs), and you’ll the fastest computers in the world measured in gigaFLOPs (\(10^9\)), teraFLOPs (\(10^{12}\)), or even exaFLOPs (\(10^{18}\)). All computational linear algebra algorithms are simply comprised of a bunch of FLOPs strung together. To see this, consider adding two vectors, \(a,b\in\mathbb{R}^n\):

\[ a + b = \begin{pmatrix}a_1 \\ a_2 \\ \vdots \\ a_n\end{pmatrix} + \begin{pmatrix}b_1 \\ b_2 \\ \vdots \\ b_n\end{pmatrix} \]

Since vector addition means that the corresponding elements each get added to each other, there are as many scalar addition operations as there are elements of the the vector. Thus vector addition has cost \(n\).

Let’s now consider another operation between vectors, the dot product:

\[ a\cdot b = a^\top b = a_1b_1 + a_2b_2 + \cdots + a_nb_n \]

The dot product involves \(n\) element-wise multiplications followed by \((n-1)\) additions. The total cost is thus \(2n-1\) FLOPs.

Recall that we are most interested in how the computational cost of a computation scales as the dimension of the problem changes. For both vector addition and the dot product between vectors, the computational cost is said to scale linearly with the size of the vectors \(n\). This is because the highest-order polynomial term in the cost is \(n^1\), so this linear term dominates the cost as \(n\) becomes large. This means that if we increase \(n\) by a factor of 10, the cost will also increase by approximately a factor of 10. Note that this is true regardless of what the constant factor in front of the \(n^1\) term is. Because of this, it is common to write that the cost is \(\mathcal{O}(n)\) (read this notation as “order n” or “big-O n”, although the technical definition of “big-O” is not one I plan to get into). Other common terminology is to say that the computation has “first-order” cost scaling/complexity.

As a final example of FLOP-counting in action before we move on, consider matrix-vector multiplication. That is, let \(A\in\mathbb{R}^{m\times n}\) and let \(b\in\mathbb{R}^n\). Then, note that

\[ Ab = \begin{pmatrix}a_{1,:}\cdot b \\ a_{2,:}\cdot b \\ \vdots \\ a_{m,:} \cdot b \end{pmatrix} \]

where \(a_{i,:}\in\mathbb{R}^n\) denotes the row vector that is the \(i\)th row of \(A\). Writing it this way shows that a matrix-vector multiplication consists of \(m\) dot products between two vectors of length \(n\). Thus, the total number of FLOPs is \(m(2n-1)\), which we say is \(\mathcal{O}(mn)\). If the matrix \(A\) is square so that \(m=n\), we would say that matrix-vector multiplication is \(\mathcal{O}(n^2)\), i.e., it has second-order or quadratic cost scaling/complexity.

Note

Cost scaling really describes the cost of algorithms as \(n\) becomes large. This means that as you increase \(n\) from 2 to 3, or from 3 to 30, you may not see the expected scaling in run time. This is because small computations happen fast enough the run time might be more influenced by how busy your computer is doing other things than by how many FLOPs it needs to do. However, as you increase \(n\) by orders of magnitude, to 1000 and 10000 and so on, you should see the expected cost scaling as \(n\) becomes larger.

4.1.2 Measuring memory complexity

There are essentially two considerations when measuring memory complexity of scientific computing algorithms: (1) how a computer stores a single number (scalar), and (2) how many scalars must be stored in order to carry out a computation.

4.1.2.1 Floating point numbers: an introduction

Computers represent numbers using binary, a base-2 numeral system (our usual number system is base-10, or “decimal”). Binary uses only two digits: 0 and 1. In binary, each digit is referred to as a “bit” (short for binary digit), and a group of eight bits is called a “byte.” In base-10, the right-most digit tells us how many ones (\(10^0\)) we have, the second-right-most how many tens (\(10^1\)), the third-right-most how many hundreds (\(10^2\)), etc. Similarly, in binary the right-most bit tells us how many ones (\(2^0\)) we have, the second-right-most how many 2s (\(2^1\)), the third-right-most how many 4s (\(2^2\)) etc. To see this in action, consider the binary number \(1101_2\) (the subscript \(\cdot_2\) tells us the number should be intepreted as a binary number as opposed to one-thousand-one-hundred-one). To convert \(1101_2\) to decimal, we sum the powers of 2 corresponding to each bit:

\[ 1101_2 = (1 \times 2^3) + (1 \times 2^2) + (0 \times 2^1) + (1 \times 2^0) = 8 + 4 + 0 + 1 = 13_{10} \]

How can we use binary to represent decimal numbers, as well as negative numbers? If you have ever learned programming in a language like Java or C, you will have learned that computers can store variables in different data types. In scientific computing, the main data types we care about are single- and double-precision floats (short for floating-point number), which represent numbers with decimal points (as opposed to ints which represent integers). The difference between single-precision float and a double-precision float is the number of binary bits the computer uses to store the number. Single precision uses 32 bits, whereas double precision uses 64 bits. These are standard data types that divide up the available bits into three parts that define a type of binary scientific notation: the sign, the mantissa or significand, and the exponent. We convert these to a value as follows:

\[ \text{value} = (-1)^\text{sign} \cdot \text{significand} \cdot 2^\text{exponent} \]

single precision uses 1 bit for the sign, 8 bits for the exponent, and 23 bits for the “mantissa” or “significand” (the significant digits that pre-multiply the exponent)
double precision uses 1 bit for the sign, 11 bits for the exponent, and 52 bits for the mantissa/significand

The sign bit can be 0 or 1 — if the bit is 0 we get a positive number and otherwise the number is negative. We do something clever with the significand: just like you would usually not write down the decimal number \(13\) as \(0013\), in binary we also don’t want to “waste” bits with leading 0s. Thus, the significand is understood as having an implicit “1.” in front of all the bits that are actually stored, giving us 24 binary bits of precision in single-precision floats and 53 binary bits of precision in double-precision floats (note that all these bits of precision come after the implicit “1.”, meaning they’re associated with negative powers of 2). This translates to approximately 8 (15) decimal digits of precision for single-precision (double-precision) floats. Finally, for the exponent, the 8 (11) bits can represent the decimal numbers from 0 to 255 (0 to 2047), but we want to allow negative exponents in order to represent numbers that are smaller than 1. Thus, we subtract 127 (1023) from the number stored in the exponent bits to get the actual exponent.

Worked example: Converting the single-precision float “0 10000001 10100000000000000000000” to decimal

Extract the parts:
- Sign bit (S): 0 (positive)
- Exponent bits (E): 10000001
- Significand (fraction) bits (F): 10100000000000000000000
Convert the exponent:
- The exponent is 10000001 in binary, which is 129 in decimal.
- The actual exponent is: \(E - 127 = 129 - 127 = 2\).
Convert the fraction:
- The fraction is 1.F = 1.101 (because the leading 1 is implied).
- Convert 1.101 to decimal: \(1.101_2 = (1 \times 2^0) + (1 \times 2^{-1}) + (0 \times 2^{-2}) + (1 \times 2^{-3}) = 1 + 0.5 + 0 + 0.125 = 1.625\).
Calculate the value: \[ (-1)^S \times 1.F \times 2^{E - 127} = (-1)^0 \times 1.625 \times 2^2 = 1 \times 1.625 \times 4 = 6.5 \]

The default for most computers you will encounter is to store decimal numbers as double-precision floats, and often when you hear someone just say “float” they in fact mean a double. However, the default precision in some popular machine learning packages is single precision.

4.1.2.2 Storing matrices and vectors

If you’re exhausted by the details above the good news is that things get simpler now. When doing computations with matrices and vectors, each element of the matrix/vector is some kind of float, and the cost to store the whole matrix/vector is just the number of elements multiplied by either 32 or 64 bits, depending on what kind of float it is. So a matrix \(X\in\mathbb{R}^{1000\times 25}\) stored in double precision will require \[ 25000\times 64 \text{ bits}=1600000 \text{ bits} = 200000 \text{ bytes} = 200 \text{ kilobytes} \]

If you’ve ever run a computation on your laptop and gotten an “Out of Memory” error, that’s your computer telling you it doesn’t have enough RAM to store all the numbers you need. This is a common challenge when working with “big data” in many ML applications: there are so many data points that we cannot load them all into memory at one time — there are some clever tricks to deal with that case that we’ll discuss later on in the course.

Note that the cost to store a vector scales linearly with its length, and the cost to store an \(m\)-by-\(n\) matrix scales with \(mn\), or if the matrix is square with \(m=n\), then quadratically in \(n\).

4.2 Computational cost of solving linear systems

We now assess the computational cost of two common linear algebra algorithms: (1) solving linear systems via LU decomposition, and (2) solving least squares problems via QR decomposition. We will start with a brief review of these decompositions¹ before diving into the algorithms that are used to compute them and their computational cost.

4.2.1 LU and QR decomposition: linear algebra review

The LU decomposition is applicable to square matrices \(X\). The LU decomposition (or “factorization”) writes a square matrix \(X\) as a product of a lower triangular matrix \(L\) with an upper triangular matrix \(U\), that is, \(X=LU\). Sometimes we have to first re-order the rows of \(X\) in order to be able to give it an LU factorization; we do this with a permutation matrix \(P\), which is a square matrix made up of permuted rows of the identity matrix \(I\). In this case, the result is called an “LU factorization with partial pivoting”, and is written as \(PX = LU\). Such a factorization (with pivoting) always exists for any square matrix (note that if no row permutation is required then we just have \(P=I\)).

Note

Do you remember what it means for a matrix to be upper/lower triangular?

The QR decomposition is applicable to square or tall rectangular matrices, \(X\in\mathbb{R}^{N\times n}\), where \(N\geq n\), and allows the matrix \(X\) to be written as a product of an orthogonal (unitary) matrix \(Q\) with an upper triangular matrix \(R\). There are two versions of the QR decomposition, the full QR decomposition and the reduced QR decomposition:

in the full QR decomposition, \(X=QR\) where \(Q\in\mathbb{R}^{N\times N}\) is square with orthogonal columns so that \(Q^\top Q = I_N\) (the \(N\)-by-\(N\) identity matrix), and \(R\in\mathbb{R}^{N\times n}\) is an upper-triangular rectangular matrix. Note that in this case the columns of \(Q\) are an orthonormal basis for \(\mathbb{R}^N\).
in the reduced QR decomposition, \(X=QR\) where \(Q\in\mathbb{R}^{N\times n}\) is rectangular with orthogonal columns so that \(Q^\top Q = I_n\) (the \(n\)-by-\(n\) identity), and \(R\in\mathbb{R}^{n\times n}\) is a square upper triangular matrix. In this case, the columns of \(Q\) are an orthonormal basis for \(\textsf{Ran}(X)\) (the range/columnspace of \(X\)).

Unfortunately, people often say “QR decomposition” without explicitly specifying whether they mean the full or reduced version, and you have to consider the context carefully to see what they mean. But we will focus on the reduced QR decomposition because it’s most relevant for least squares problems.

Similar to the LU decomposition, it will sometimes be helpful to first permute the columns of \(X\) before doing a QR decomposition, leading to the expression \(XP=QR\). This is referred to as “QR factorization with partial pivoting”, or sometimes “rank-revealing QR” (for reasons that we’ll get into in a later chapter).

4.2.2 LU and QR decomposition: algorithms

We now turn our attention to algorithms that compute the LU and QR decompositions. To simplify our exposition, we are going to assume that no pivoting is required for either decomposition.

4.2.2.1 LU decomposition

Here is some pseudocode for the LU decomposition without pivoting:

LU decomposition of \(X\in\mathbb{R}^{n\times n}\) without pivoting: pseudocode

Initialize \(L = I\) and \(U = X\).
for \(j = 1\) to \(n\):
- for \(i = 1\) to \(j\):
  - \(s_1 = 0\)
  - for \(k = 1\) to \(i\):
    - \(s_1 = s_1 + u_{kj} \ell_{ik}\)
  - \(u_{ij} = x_{ij} - s_1\)
- for \(i = (j + 1)\) to \(n\):
  - \(s_2 = 0\)
  - for \(k = 1\) to \(j - 1\):
    - \(s_2 = s_2 + u_{kj} \ell_{ik}\)
  - \(\ell_{ij} = \frac{x_{ij} - s_2}{u_{jj}}\)
return \(L\), \(U\)

Time complexity analysis

Let’s understand the time complexity of LU decomposition by counting the number of FLOPs involved. Step 1 initializing two square matrices \(L,U\in\mathbb{R}^{n\times n}\). This involves initializing two array with \(n^2\) values and therefore is treated as having complexity \(2n^2 = \mathcal{O}(n^2)\), although no FLOPs are technically involved.

Step 2 is where things get hairy: note that there is an outer loop (for \(j=1,\ldots,n\)) around two inner loops, each of which has an inner-most loop. We have to count the operations very carefully here using arithmetic series.

Consider the first inner loop (for \(i=1,\ldots,j\)): for each \(i\), there is one initialization operation (let’s count this as 1 FLOP) for \(s_1\), then \(i\) addition operations and \(i\) multiplication operations due to the inner-most for loop, and finally 1 more subtraction after the inner-most for loop, for a total of \(2i+2\) operations. Thus, the total number of operations in the first inner loop is \(\sum_{i=1}^j (2i+2) = j(j+3)\). Finally we sum over the outer loop to get \(\sum_{j=1}^n j(j+3) = \frac13 n(n+1)(n+5) = \frac13n^3 + 2n^2 + \frac53 n\).

We do a similar calculation for the second inner loop (for \(i = (j+1),\ldots,n\)): for each \(i\), there is one initialization (1 FLOP) of \(s_2\), then there are \((j-1)\) additions and \((j-1)\) multiplications, followed by 1 subtraction and 1 division, for a total of \(2(j-1)+3=2j+1\) FLOPs. Summing over this inner loop then gives us \(\sum_{i=(j+1)}^n (2j+1) = (2j+1)(n-j)\) FLOPs, and then we sum over the outer loop to get \(\sum_{j=1}^n(2j+1)(n-j) = \frac16 n(2n^2 +3n-5) = \frac13n^3 + \frac12 n^2 - \frac56 n\) FLOPs.

Adding the total FLOPs over both loops gives us \(\frac23 n^3 + \frac52 n^2 + 5/6 n\) FLOPs, where the leading term is the cubic term so we just say that LU decomposition is \(\mathcal{O}(n^3)\) or “cubic” or “third-order” in cost. Again, we emphasize that this is large-\(n\) property – at small \(n\) the other terms of the polynomial expression may have more of an influence, but as \(n\to\infty\) you will see the leading term dominate the cost.

Memory complexity analysis

Memory complexity is a little bit simpler — we just want to look at what quantities need to be stored in RAM to do our computation: for this we can just look at the variables that appear in the algorithm, which are \(L, U, X\in\mathbb{R}^{n\times n}\) along with \(s_1,s_2\in\mathbb{R}\). The latter are just two scalars, so the cost is dominated by the cost of storing the three matrices \(L,U,X\), and thus the memory complexity is \(\mathcal{O}(n^2)\).

4.2.2.2 QR decomposition

Here’s some pseudocode for the QR decomposition using a method called the “modified Gram-Schmidt” procedure:

QR decomposition of \(X\in\mathbb{R}^{N\times n}\) without pivoting: pseudocode

Initialize \(V = X\), \(R=0_{n\times n}\), \(Q = 0_{N\times n}\).
for \(i = 1,\ldots, n\):
- \(r_{ii} = \|v_{:,i}\|\)
- \(q_{:,i} = v_{:,i}/r_{ii}\)
- for \(j=i+1,\ldots,n\)
  - \(r_{ij} = q_{:,i}^\top v_{:,j}\)
  - \(v_{:,j} = v_{:,j} - r_{ij}q_{:,i}\)

where \(v_{:,i}\) and \(q_{:,i}\) denote the \(i\)th column vectors of \(V\) and \(Q\), respectively.

The QR decomposition has time complexity \(\mathcal{O}(N\cdot n^2)\) and storage complexity \(\mathcal{O}(Nn)\). For square matrices where \(N=n\), the complexity is \(\mathcal{O}(N^3)\). In general, computing matrix decompositions will have third-order complexity: we will see another example when we get to the SVD in the next chapter.

4.3 Solving linear systems using matrix decompositions

Finally we consider the overall cost of solving linear systems of equations using matrix decompositions. In general, the best practice when you are coding something is to use functions that are part of standard linear algebra packages to solve these systems. In python, both the numpy and scipy libraries have linalg packages that contain various useful linear algebra functions. Both are based on the LAPACK library, which efficiently implements many linear algebra computations in Fortran. Popular deep learning packages like PyTorch and TensorFlow have their own linear algebra packages that build on other libraries that leverage GPU computing capabilities. The details in this section explain what happens under the hood in these linear algebra packages: LU decomposition is used to solve invertible linear systems and QR decomposition is used to solve least squares problems where the data matrix has full column rank.

Warning

In this section, we are assuming that \(X\) has full column rank (which guarantees invertibility if \(X\) is square), whereas previously we did not make this assumption.

4.3.1 Invertible linear systems

If \(X\) is invertible (and therefore square), the python functions we use to solve the system \(X\beta=Y\) are numpy.linalg.solve and scipy.linalg.solve. Both of these functions use LU decomposition to solve the system. This means they first compute the factors \(L\) and \(U\) and described above. Then, once the LU factorization of \(X\) is found, then we can solve for \(\beta\) in two steps:

Solve \(L\alpha = Y\) for \(\alpha\).
Solve \(U\beta = \alpha\) for \(\beta\).

This may look like we’ve just overcomplicated things by turning our one problem of solving \(X\beta = Y\) into two problems, but the triangular structure of \(L\) and \(U\) lets us solve these two new systems relatively efficiently. To see this, consider solving \(L\alpha = Y\) for a three-dimensional example:

\[ L = \begin{pmatrix}1 & & \\ 2 & 1 & \\ 2 & 1 & 2 \end{pmatrix}, \qquad Y = \begin{pmatrix}1 \\ 2 \\ 0\end{pmatrix}. \]

Recall that this represents a system of three linear equations:

\[ \begin{aligned} 1 \alpha_1 & & = 1\\ 2 \alpha_1 &+ 1 \alpha_2 &= 2 \\ 2 \alpha_1 &+ 1 \alpha_2 + 2 \alpha_3 & = 0 \end{aligned} \]

We can immediately read off the solution of the first equation, \(\alpha_1 = 1\) because there is only one unknown variable in it. From there we substitute that into the second equation to get \(2 + 1\alpha_2 = 2\), so we again one unknown variable, which we can easily solve to get \(\alpha_2 = 0\). From there we now substitute our values for \(\alpha_1\) and \(\alpha_2\) into the third equation, yielding \(2 + 2\alpha_3 = 0\), which again has one unknown variable, so we can easily solve to get \(\alpha_3 = -1\). This yields our overall solution \(\alpha = (1, 0, -1)^\top\).

The process we have just described is called forward substitution and is a standard algorithm for solving lower triangular systems. It generalizes easily to triangular systems of arbitrary dimension \(n\times n\) — we just keep substituting in the results of previous rows into the next row, as follows:

Forward substitution: pseudocode

\(\alpha_1 = Y_1 / L_{11}\).
for \(i\) = 2 to \(n\):
- \(s = 0\)
- for \(j\) = 1 to \((i - 1)\):
  - \(s = s + L_{ij}\alpha_j\)
- \(\alpha_i = (Y_i - s) / L_{ii}\)
return \(\alpha\)

Forward substitution has complexity \(\mathcal{O}(n^2)\).

Now that we have solved \(L\alpha = Y\) for \(\alpha\) using forward substitution, we need to solve \(U\beta = \alpha\) for \(\beta\), where \(U\) is upper triangular rather than lower triangular. This is done using backward substitution, which is a very similar algorithm to forward substitution and also has complexity \(\mathcal{O}(n^2)\). That gives us our final solution \(\beta\).

The overall computational cost of solving an invertible linear system is dominated by the cubic cost of the LU decomposition (for large \(n\), the quadratic costs of forward and backward substitution are much cheaper than the cubic cost of LU). Thus, we say linear solves have cubic complexity.

4.3.2 Least squares problems

If \(X\) is non-invertible, the python functions we use to solve \(X\beta = Y\) are numpy.linalg.lstsq and scipy.linalg.lstsq. These functions can use a number of different algorithms to solve the least squares problem, but a common choice for overdetermined least squares problems is to use QR decomposition. That is, we first compute the factors \(Q\in\mathbb{R}^{N\times n}\) and \(R\in\mathbb{R}^{n\times n}\) as described above. Then, we can solve for \(\beta\) is two steps:

Solve \(Q\alpha = Y\) for \(\alpha\).
Solve \(R\beta = \alpha\) for \(\beta\).

Recall that orthogonal matrices \(Q\) satisfy \(Q^\top Q=I\), so this special structure lets us find \(\alpha\) via a matrix multiplication, which is computationally more efficient than a full LU decomposition. Then, since \(R\) is upper triangular, we can solve for \(\beta\) using backward substitution.

If \(X\) is does not have full column rank, there are a couple of options for solving the least squares problem, but the standard one is based on the the singular value decomposition, which deserves its own set of notes, which come next.

4.4 Exercises

Show that the QR decomposition has time complexity \(\mathcal{O}(N n^2)\).
Provide pseudocode for a backward substitution algorithm analogous to the forward substitution algorithm provided (without looking it up, which you could easily do — but if you can derive it yourself then you know you understand it).
Show that forward (or backward) substitution has complexity \(\mathcal{O}(n^2)\).
Show that the overall cost of solving \(X\beta=Y\) via QR decomposition is \(\mathcal{O}(Nn^2)\).
Determine how much RAM you have on your personal computer. How many double precision floats can you store in that space? What does this mean for the largest square linear system you can solve using your computer? What are the dimensions of the largest least squares problem you can solve using the QR decomposition on your computer?
Write code to set up and solve least squares problems of arbitrary dimension and time the solve. Plot the solve times as the dimensions increase. Do you observe the expected trends? At what point does your computer run out of memory and is it approximately what you expected?

if you want more of a refresher, revisit your favorite linear algebra textbook or notes. My favorite is (Strang 2022). There’s also no substitute for your own notes that reflect how you think about things – if you have your own notes, go back and look at them now!↩︎