Appendix A — Linear algebra review

Before we dive into the world of machine learning and optimization, it’s important that we get comfortable with the fundamental building blocks of mathematics in higher dimensions: Linear Algebra. This course assumes you have seen basic concepts in Linear Algebra, but we’ll provide a refresher to get everyone on the same page.

A.1 Scalars, Vectors, and Matrices

A.1.1 Notation

Let’s introduce some notation: \(\mathbb{R}\) is the set of real numbers, \(\mathbb{C}\) is the set of complex numbers, and the operator \(\in\) designates that a variable belongs to a set. To define a real scalar (i.e. just a number), \(c\), we write:

\[ c \in \mathbb{R}\]

You can read this as “the variable \(c\) belongs to the set of 1-dimensional real numbers.”

To define a real \(n\) dimensional vector, \(\mathbf{x}\), we write:

\[ \mathbf{x} = \begin{bmatrix} x_1 \\ \vdots \\ x_n \end{bmatrix} \in \mathbb{R}^n \]

To define a \(m \times n\) matrix \(\mathbf{A}\) of real-valued entries, we write:

\[ \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \dots & a_{1n} \\ a_{21} & a_{22} & & a_{2n} \\ \vdots & & \ddots & \vdots \\ a_{m1} & a_{m2} & \dots & a_{mn} \end{bmatrix} \in \mathbb{R}^{m \times n } \]

A.1.2 The Transpose

The “transpose” of a matrix or vector, swaps the rows and columns. In other words, the first row becomes the first column, the second row the second column and so-on:

\[ \mathbf{x}^\top = \begin{bmatrix} x_1 & \dots & x_n \end{bmatrix} \in \mathbb{R}^{1 \times n} \]

\[ \mathbf{A}^\top = \begin{bmatrix} a_{11} & a_{21} & \dots & a_{m1} \\ a_{12} & a_{22} & & a_{m2} \\ \vdots & & \ddots & \vdots \\ a_{1n} & a_{2n} & \dots & a_{mn} \end{bmatrix} \in \mathbb{R}^{n \times m } \]

Transposing the product of matrices reverses their order:

\[ \left( \mathbf{A} \mathbf{B} \mathbf{C} \mathbf{D} \right)^\top = \mathbf{D}^\top \mathbf{C}^\top \mathbf{B}^\top \mathbf{A}^\top \]

A.1.3 Addition & subtraction

Technically, only objects of the same dimensionality can be added or subtracted with one another. So, scalars can only be added to scalars, vectors can only be added to vectors of the same dimension, and matrices can only be added to other matrices with the same numbers of rows and columns. Some software like MATLAB or NumPy might let you add scalars to vectors and matrices and under the hood they multiply that scalar by a vector/matrix of ones to make the dimensions match.

Mathematically, none of the following are valid operations. However, numpy allows us to do them by secretly adding in a vector of ones so that the vector/matrix dimensions are valid. This can be confusing for new programmers.

Adding a scalar to a vector

import numpy as np 
# Defining a vector 
u = np.array([
  1, 2, 3
])

# Adding a scalar to a vector
print(u + 5)
# What numpy does to make the dimensions match
print(u + 5*np.ones_like(u)) 
[6 7 8]
[6 7 8]

Adding a row-vector to a column-vector

# Reshaping u to a column-vector 
u = u.reshape((3,1))

# Defining a row-vector, v 
v = np.array([
  4, 5, 6
]).reshape((1,3)) 

# Adding the row-vector v to a column-vector u
print(v + u)
# What numpy does to make the dimensions match
print(np.outer(u, np.ones(3)) + np.outer(np.ones(3), v)) 
[[5 6 7]
 [6 7 8]
 [7 8 9]]
[[5. 6. 7.]
 [6. 7. 8.]
 [7. 8. 9.]]

A.2 Scalar Multiplication

Generally, scalars can multiply by any structure (scalar, vector, and matrix) and do not change its dimension. They only “scale” its value by a certain amount. Hence, multiplying a scalar by a scalar returns a scalar, multiplying a scalar times a vector returns a vector, and multiplying a scalar by a matrix returns a matrix. Scalar multiplication is also commutative, i.e. \(c\mathbf{A} = \mathbf{A}c\) for all scalars \(c\). Consider our scalar, \(c\), from the previous section, another scalar, \(b\), the vector \(\mathbf{x} \in \mathbb{R}^n\) and the matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\)

A.2.1 Scalar-Scalar Product

\[ b \times c = bc \in \mathbb{R} \]

A.2.2 Scalar-Vector Product

\[ c \mathbf{x} = \mathbf{x} c = \begin{bmatrix} c x_1 \\ \vdots \\ c x_n \end{bmatrix} \in \mathbb{R}^n \]

A.2.3 Scalar-Matrix Product

\[ c \mathbf{A} = \mathbf{A} c = \begin{bmatrix} c a_{11} & c a_{12} & \dots & c a_{1n} \\ c a_{21} & c a_{22} & & c a_{2n} \\ \vdots & & \ddots & \vdots \\ c a_{m1} & c a_{m2} & \dots & c a_{mn} \end{bmatrix} \in \mathbb{R}^{m \times n } \]

A.3 Matrix & Vector Multiplication

A.3.1 Vector-Vector Multiplication

To multiply two matrices (or two vectors, or a matrix with a vector), their inner dimensions must always match. Hence, the only valid way to multiply two vectors is by “inner” or “outer” products. Consider two vectors, \(\mathbf{u} \in \mathbb{R}^n\) and \(x \in \mathbb{R}^n\). An “Inner Product” (sometimes called the Dot-Product) is defined as:

\[ \mathbf{u}^\top \mathbf{x} = \begin{bmatrix} u_1 & \dots & u_n \end{bmatrix} \begin{bmatrix} x_1 \\ \vdots \\ x_n \end{bmatrix} = u_1 x_1 + u_2 x_2 + \dots u_n x_n \in \mathbb{R}\]

Some important notes about inner products:

  • If the dimensions of \(\mathbf{u}\) and \(\mathbf{x}\) do not match, we cannot multiply them this way as we need to multiply each entry of both vectors.
  • This operation returns a scalar value.
  • Inner Products are commutative, i.e. \(\mathbf{u}^\top \mathbf{x} = \mathbf{x}^\top \mathbf{u}\), as they are sums of scalar-scalar multiplications which are also commutative.

Recall \(\mathbf{x} \in \mathbb{R}^n\). Now consider a new vector \(\mathbf{w} \in \mathbb{R}^m\) where \(m \neq n\). An “Outer Product” is defined as:

\[ \mathbf{w} \mathbf{x}^\top = \begin{bmatrix} w_1 \\ \vdots \\ w_m \end{bmatrix} \mathbf{x}^\top = \begin{bmatrix} w_1 \mathbf{x}^\top \\ \vdots \\ w_m \mathbf{x}^\top \end{bmatrix} = \begin{bmatrix} w_1 x_1 & w_1 x_2 & \dots & w_1 x_n \\ w_2 x_1 & w_2 x_2 & & w_2 x_n \\ \vdots & & \ddots & \vdots \\ w_m x_1 & w_m x_2 & \dots & w_m x_n \end{bmatrix} \in \mathbb{R}^{m \times n} \]

Important notes on outer products:

  • Outer products return matrices with dimensions according to the vectors multiplied.
  • As long as \(\mathbf{w}\) and \(\mathbf{x}\) are vectors, outer products share an inner dimension of \(1\), which means that any two vectors have an outer product.
  • Lastly, this operation is non-commutative, meaning \(\mathbf{w} \mathbf{x}^\top \neq \mathbf{x} \mathbf{w}^\top\). Rather, these two are transposes of each other.

A.3.2 Matrix-Vector Multiplication

Multiplying a matrix with a vector means the dimension of the vector must equal the number of columns of the matrix. Recall \(\mathbf{A} \in \mathbb{R}^{m \times n}\) and \(\mathbf{x} \in \mathbb{R}^n\). Note that the columns of \(\mathbf{A}\) match the dimension of \(\mathbf{x}\), which are both size \(n\). The product \(\mathbf{A}\mathbf{x}\) can be written as:

\[ \mathbf{A}\mathbf{x} = \begin{bmatrix} a_{11} & a_{12} & \dots & a_{1n} \\ a_{21} & a_{22} & & a_{2n} \\ \vdots & & \ddots & \vdots \\ a_{m1} & a_{m2} & \dots & a_{mn} \end{bmatrix} \begin{bmatrix} x_1 \\ \vdots \\ x_n \end{bmatrix} = \begin{bmatrix} a_{11} x_1 + a_{12} x_2 + \dots + a_{1n} x_n \\ a_{21} x_1 + a_{22} x_2 + \dots + a_{2n} x_n \\ \vdots \\a_{m1} x_1 + a_{m2} x_2 + \dots + a_{mn} x_n \end{bmatrix} \in \mathbb{R}^m \]

Perhaps a more useful way to visualize this is to break \(\mathbf{A}\) down into its constituent columns and show this as a sum of scalar-vector multiplications:

\[ \mathbf{A}\mathbf{x} = \begin{bmatrix} | & | & & | \\ \mathbf{a}_1 & \mathbf{a}_2 & \dots & \mathbf{a}_n \\ | & | & & | \end{bmatrix} \begin{bmatrix} x_1 \\ \vdots \\ x_n \end{bmatrix} = \begin{bmatrix} | \\ \mathbf{a}_1 \\ | \end{bmatrix} x_1 + \begin{bmatrix} | \\ \mathbf{a}_2 \\ | \end{bmatrix} x_2 + \dots + \begin{bmatrix} | \\ \mathbf{a}_n \\ | \end{bmatrix} x_n \in \mathbb{R}^m \]

where \(\mathbf{a}_i\) is the \(i\)th column of \(\mathbf{A}\). Note that matrix-vector multiplication forms a linear map from one dimensional vector-space to another. In this case, \(\mathbf{A}\mathbf{x}\) maps a vector from \(\mathbb{R}^n\) to \(\mathbb{R}^m\). This linear transformation from one dimensionality to another is a core component of many machine learning algorithms.

A.3.3 Matrix-Matrix Multiplication

Consider two matrices, \(\mathbf{A} \in \mathbb{R}^{m \times n}\) and \(\mathbf{B} \in \mathbb{R}^{n \times k}\), where \(m \neq k\). Again, the inner dimensions must match to multiply matrices. Because \(\mathbf{A}\) multiplied by \(\mathbf{B}\) have an inner-dimension of \(n\), they can be multiplied. Similar to matrix-vector multiplication, we can visualize the product of \(\mathbf{A}\) and \(\mathbf{B}\) by breaking up the columns and rows of \(\mathbf{A}\) and \(\mathbf{B}\), respectively:

\[ \mathbf{A}\mathbf{B} = \begin{bmatrix} | & | & & | \\ \mathbf{a}_1 & \mathbf{a}_2 & \dots & \mathbf{a}_n \\ | & | & & | \end{bmatrix} \begin{bmatrix} - & \mathbf{b}_1 & - \\ - & \mathbf{b}_2 & - \\ & \vdots & \\ - & \mathbf{b}_n & - \end{bmatrix} \in \mathbb{R}^{m \times k}\]

where \(\mathbf{a}_i\) is the \(i\)th column of \(\mathbf{A}\) and \(\mathbf{b}_i\) is the \(i\)th row of \(\mathbf{B}\). We can expand this expression by multiplying each column of \(\mathbf{A}\) with each row of \(\mathbf{B}\):

\[ \mathbf{A}\mathbf{B} = \begin{bmatrix} | \\ \mathbf{a}_1 \\ | \end{bmatrix} \begin{bmatrix} - & \mathbf{b}_1 & - \end{bmatrix} +\begin{bmatrix} | \\ \mathbf{a}_2 \\ | \end{bmatrix} \begin{bmatrix} - & \mathbf{b}_2 & - \end{bmatrix} + \dots + \begin{bmatrix} | \\ \mathbf{a}_n \\ | \end{bmatrix} \begin{bmatrix} - & \mathbf{b}_n & - \end{bmatrix}\]

As we can see, this is a sum of outer products (vectors multiplied by transposed vectors). We know \(\mathbf{A}\) has \(m\) rows and \(\mathbf{B}\) has \(k\) columns, so each outer product will form a matrix of dimension \(m \times k\):

\[ \mathbf{A}\mathbf{B} = \sum_{i=1}^n \mathbf{a}_i \mathbf{b}_i \in \mathbb{R}^{m \times k}\]

Note: matrix multiplication is generally non-commutative; The product \(\mathbf{A}\mathbf{B}\) is valid, because the number of columns of \(\mathbf{A}\) matches the number of rows of \(\mathbf{B}\). However, the product \(\mathbf{B}\mathbf{A}\) is not valid, because the number of columns of \(\mathbf{B}\) does not equal the number of rows of \(\mathbf{A}\). The product \(\mathbf{B}\mathbf{A}\) can be multiplied if \(\mathbf{A}\) and \(\mathbf{B}\) are both square matrices of the same size. However, even if the dimensions do match, generally speaking, \(\mathbf{BA \neq AB}\).

A.4 Vector Norms

Oftentimes, it is useful to know how “big” a vector is. And there are many different ways of computing this. The “norm” of an object is a measure of how “large” it is, according to some rule. Consider \(\mathbf{x} \in \mathbb{R}^n\). We’ll start with the most common norm:

A.4.1 The 2-Norm

The 2-norm of \(x\) (which we will use most-often in this class) is defined as the squareroot of the squares of its entries:

\[ ||\mathbf{x}||_2 = \sqrt{x_1^2 + x_2^2 + \dots + x_n^2} \]

The reason the 2-norm is so useful is the fact that the square of the \(l_2\) norm is an inner product of a vector with itself:

\[ \mathbf{x}^\top \mathbf{x} = x_1^2 + x_2^2 + \dots + x_3^2 = ||\mathbf{x}||_2^2 \]

This helps us write a norm in terms of matrix-vector operations, which is extremely useful for optimization. The 2-norm is also quite important because it is the first \(l_p\) norm with a continuous derivative, which cannot be said of the 1-norm since it uses absolute-value functions.

A.4.2 The \(l_p\)-Norm

We can take the definition of the 2-Norm and generalize it to describe a class of norms known as \(l_p\) norms. The \(l_p\) norm of a vector is defined by the following:

\[ ||\mathbf{x}||_p = \left( \sum_{i=1}^n |x_i|^p \right)^{1/p} \]

A.4.3 The 1-Norm

The 1-norm of \(\mathbf{x}\) is simply the sum of the absolute-value of its entries:

\[ ||\mathbf{x}||_1 = |x_1| + |x_2| + \dots + |x_n| \]

A.4.4 The Infinity-Norm

The infinity-norm takes the limit as \(p \rightarrow \infty\) and when this limit is solved, this norm simply becomes the entry of \(x\) with the largest absolute value:

\[ ||\mathbf{x}||_\infty = \max \{ |x_1|, |x_2|, \dots, |x_n| \} \]

A.5 Rank, Range/Column-Space and Null-Spaces

A.5.1 Linear Combinations and Span

Many of the following ideas hinge upon the idea of “Linear Combinations” of vectors. Consider two vectors \(\mathbf{u}, \mathbf{v}\in \mathbb{R}^{d}\). A linear combination of \(\mathbf{u}\) and \(\mathbf{v}\) is the vector produced by the following:

\[ \mathbf{w}= c_1 \mathbf{u}+ c_2 \mathbf{v}\]

where \(c_1\) and \(c_2\) are scalars. We can generalize this idea by saying that all possible realizations of \(\mathbf{w}\) live in the vector-space \(\mathcal{W}\) defined by the “span” by \(\mathbf{u}\) and \(\mathbf{v}\). This is denoted:

\[ \mathbf{w}\in \mathcal{W} = \text{Span}\left( \mathbf{u}, \mathbf{v}\right) \]

We can read this in plain english as “the vector \(\mathbf{w}\) belongs to the vector-space \(\mathcal{W}\) which is defined by all possible linear combinations of \(\mathbf{u}\) and \(\mathbf{v}\)”.

A.5.2 Linear-Independence

This is a fundamental idea in Linear Algebra. A set of vectors is said to be “Linearly Independent” if no vector in this set can be reconstructed with any of the others. In other words, no one vector can be reproduced by scaling or combining any of the others. For example, consider the following three vectors:

\[ \{ \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}, \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}\} \]

Because each vector in this set points in its own unique direction, it is impossible to reconstruct one of these vectors as a linear combination of any others. This may be true with more complex vectors that aren’t unit vectors along the dimensions of the vector-space.

A.5.3 Rank

Technically, there are two types of rank: Column Rank and Row-Rank. In this class, when we say “rank”, we will be referring to the column rank.

The column rank of a matrix is simply its number of linearly independent column-vectors (the row-rank is the number of linearly independent row-vectors). Consider the following matrix:

\[ \mathbf{A} = \begin{bmatrix} 1 & 2 & 1 & 4 \\ 0 & 1 & 1 & 2 \\ 0 & 0 & 2 & 0 \end{bmatrix} \]

The first-three columns of \(\mathbf{A}\) are linearly independent because they cannot be reproduced using linear combinations of the other columns. However, the last column is a multiple of column 2, which means it can be reproduced using a linear combination of columns. Hence, this matrix has a rank of 3.

Theorem: No matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) where \(m < n\) can be full-rank. This is because only \(m\) vectors can span \(\mathbb{R}^m\), and since there are more than \(m\) columns, we must have some redundancy.

A.5.4 Range/Column-Space

The Range, Column-Space or Image of a matrix, \(\mathbf{A}\), is the span of all possible combinations of its columns. In human words, it describes the space of \(\mathbf{Ax}\) for all possible \(\mathbf{x}\). Consider the following matrix:

\[ \mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 6 \\ 2 & 4 \end{bmatrix} \]

Note that the second column is a multiple of the first column, hence, the range of \(\mathbf{A}\) is:

\[ \text{Range}(\mathbf{A}) = \text{span} \left( \begin{bmatrix} 1 \\ 3 \\ 2 \end{bmatrix}\right) \]

A.5.5 Null-Space

The Null-Space of a matrix, \(\mathbf{A}\) is the set of all nonzero vectors, \(\mathbf{x}\) such that \(\mathbf{Ax=0}\). Consider a simple example:

\[ \mathbf{A} = \begin{bmatrix} 1 & 0 & -1\\ 0 & 1 & 1 \end{bmatrix} \]

We can find the set of \(\mathbf{x}\) that produce the zero-vector by solving the following equation:

\[ \begin{bmatrix} 1 & 0 & -1\\ 0 & 1 & 1 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \]

This produces the systems of equations:

\[ x_1 - x_3 = 0 \] \[ x_2 + x_3 = 0 \] \[ x_3 = x_3 \]

If we isolate each entry of \(\mathbf{x}\):

\[ x_1 = x_3 \] \[ x_2 = -x_3 \] \[ x_3 = x_3 \]

Hence, we see that \(x_3\) can be any scalar and satisfy this systems of equations:

\[ \mathbf{x} = \begin{bmatrix} 1 \\ -1 \\ 1 \end{bmatrix} x_3 = \text{span}\left( \begin{bmatrix} 1 \\ -1 \\ 1 \end{bmatrix} \right) = \text{Null}(\mathbf{A})\]

A.5.6 The Identity Matrix

The identity matrix is a square, \(n\)-dimensional matrix consisting of ones on its diagonal and zeros elsewhere. It is called the identity matrix because it preserves the identity when multiplied by matrices and vectors.

\[ \mathbf{I} = \begin{bmatrix} 1 & 0 & \dots & 0 \\ 0 & 1 & & 0 \\ \vdots & & \ddots & \vdots \\ 0 & 0 & \dots & 1 \end{bmatrix} \]

Consider \(\mathbf{I} \in \mathbb{R}^{n \times n}\) to be an \(n \times n\) identity matrix. If we recall \(\mathbf{A} \in \mathbb{m \times n}\) and \(\mathbf{x} \in \mathbb{R}^n\):

\[ \mathbf{A I = I A = A} \]

\[ \mathbf{I x = x } \]

\[ \mathbf{x^\top I = x^\top} \]

Why is the identity matrix useful?

The identity matrix simplifies expressions involving matrices. If we can transform a term into an identity matrix, then we can reduce the expression to only the remaining terms.

A.5.7 Matrix Inverses

You may have noticed that, thus far, we have covered the addition, subtraction, and multiplication of matrices, which conspicuously leaves division. Well, unfortunately, matrix division is complex to say the least. For square matrices, division is accomplished via matrix inversion. However, not all matrices are invertible.

Definition: A matrix \(\mathbf{B}\) is called the “inverse” of a square matrix \(\mathbf{A}\) if and only if the following condition holds:

\[ \mathbf{A} \mathbf{B} = \mathbf{B} \mathbf{A} = \mathbf{I} \]

where \(\mathbf{I}\) is the identity matrix. Such a \(\mathbf{B}\) is denoted \(\mathbf{A}^{-1}\).

A.6 Further Reading