# Math¶

For a book on mathematical derivations, this text assumes knowledge of relatively few mathematical methods. Most of the mathematical background required is summarized in the three following sections on calculus, matrices, and matrix calculus.

## Calculus¶

The most important mathematical prerequisite for this book is calculus. Almost all of the methods covered involve minimizing a loss function or maximizing a likelihood function, done by taking the function’s derivative with respect to one or more parameters and setting it equal to 0.

Let’s start by reviewing some of the most common derivatives used in this book:

We will also often use the sum, product, and quotient rules:

Finally, we will heavily rely on the chain rule:

As an example of the chain rule, suppose \(f(x) = \log(x^2)\). Let \(h(x) = x^2\), meaning \(f(x) = \log(h(x))\). Then

## Matrices¶

While little linear algebra is used in this book, matrix and vector representations of data are very common. The most important matrix and vector operations are reviewed below.

Let \(\mathbf{u}\) and \(\mathbf{v}\) be two column vectors of length \(D\). The **dot product** of \(\mathbf{u}\) and \(\mathbf{v}\) is a scalar value given by

If \(\bv\) is a vector of features (with a leading 1 appended for the intercept term) and \(\bu\) is a vector of weights, this dot product is also referred to as a *linear combination* of the predictors in \(\bv\).

The **L1 norm** and **L2 norm** measure a vector’s magnitude. For a vector \(\bu\), these are given respectively by

Let \(\mathbf{A}\) be a \((N \times D)\) matrix defined as

The transpose of \(\mathbf{A}\) is a \((D \times N)\) matrix given by

If \(\mathbf{A}\) is a square \((N \times N)\) matrix, its inverse, given by \(\mathbf{A}^{-1}\), is the matrix such that

## Matrix Calculus¶

Dealing with multiple parameters, multiple observations, and sometimes multiple loss functions, we will often have to take multiple derivatives at once in this book. This is done with matrix calculus.

In this book, we will use the numerator layout convention for matrix derivatives. This is most easily shown with examples. First, let \(a\) be a scalar and \(\mathbf{u}\) be a vector of length \(I\). The derivative of \(a\) with respect to \(\bu\) is given by

and the derivative of \(\bu\) with respect to \(a\) is given by

Note that in either case, the first dimension of the derivative is determined by what’s in the numerator. Similarly, letting \(\bv\) be a vector of length \(J\), the derivative of \(\bu\) with respect to \(\bv\) is given with

We will also have to take derivatives of or with respect to matrices. Let \(\bX\) be a \((N \times D)\) matrix. The derivative of \(\bX\) with respect to a constant \(a\) is given by

and conversely the derivative of \(a\) with respect to \(\bX\) is given by

Finally, we will occasionally need to take derivatives of vectors with respect to matrices or vice versa. This results in a *tensor* of 3 or more dimensions. Two examples are given below. First, the derivative of \(\bu \in \R^I\) with respect to \(\bX \in \R^{N \times D}\) is given by

and the derivative of \(\bX\) with respect to \(\bu\) is given by

Notice again that what we are taking the derivative *of* determines the first dimension(s) of the derivative and what we are taking the derivative with respect *to* determines the last.