Math

Probability and Statistics

Sample space: a space of events that are assigned probabilities, where events are binary, multi-valued, or continuous; but always mutually exclusive

Examples

Random variable: a variable, \(x\), whose domain is the sample space and whose value is somewhat uncertain

The Axioms of Probability

  1. For event \(A, P(A) ∈ [0,1]\)
  2. For sample space \(S\), \(P(S) = 1\); and \(P(true) = 1\), \(P(false) = 0\)
  3. If events \(A_1, A_2, A_3, \cdots\) are disjoint (mutually exclusive), then \(P(A_1 \cup A_2 \cup A_3 \cdots)=P(A_1)+P(A_2)+P(A_3)+\cdots\)
    It follows that for disjoint events \(A\) and \(B\)
    \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)

Joint probability: the probability of two events occurring together at the same point in time.

$$P(A, B) = P(A \cap B) = P(A)*P(B)$$

Marginal probability: the probability of a single event occurring unconditioned on any other events

Conditional probability: the probability of an event \(B\) given that event \(A\) has already occurred, and that these events are dependent

$$P(B|A) = \frac{P(A \cap B)}{P(A)}$$ For conditional probability on some other events \(C\) $$P(A| B,C) = \frac{P(A,B |C)}{P(B|C)}$$

Chain rule: derived from the definition of conditional probability

$$P(A, B) = P(B)*P(A|B) = P(A)*P(B|A)$$ $$P(A_1,A_2,\cdots,A_n) = P(A_1)*P(A_2|A_1)*P(A_3|A_1,A_2)*\cdots*P(A_n|A_1,A_2 \cdots A_{n-1})$$

Bayes' Rule

$$P(A|B) = \frac{P(A,B)}{P(B)} = \frac{P(B|A)P(A)}{P(B)}$$

Identities for Independent Events

Conditional independence: two events \(A\) and \(B\) are conditionally independent given an event \(C\)

$$P(A| B,C) = \frac{P(A \cap B |C)}{P(B|C)} = \frac{P(A|C)P(B|C)}{P(B|C)} = P(A|C)$$

Natural Language Processing

Word token: occurrences of a word

Word type: unique word

Vocabulary: list of the word types

Zipf's Law: given a large sample of words used, the frequency of any word is inversely proportional to its rank on the frequency table

Corpus (pl. corpora): a large, representative collection of text

Multinomial distribution for a \(k\)-sided die with probability vector \(\theta\), \(N\) throws, outcomes counts \(n_1, \cdots,n_k\)

$$P(n_1,\cdots ,n_k|\theta) = \binom{N}{n_1 \cdots n_k} = \prod_{i=1}^{k} \theta_i^{n_i}$$

The Maximum Likelihood Estimate (MLE)

The Maximum A Posterior (MAP) Estimate

Language modeling: tries to capture the notion that some text is more likely than others by estimating the probability \(P(s)\) of any text \(s\)

Unigram Language Model: makes a strong independence assumption that words are generated independently from a multinomial distribution \(\theta\) (of dimension \(V\) = size of the vocabulary)

N-gram Language Models

Smoothing

add-1 smoothing $$\hat{\theta_w} = \frac{c_w + 1}{|C| + V'}$$ add-∈ smoothing $$\hat{\theta_w} = \frac{c_w + \in}{|C| + V\in'}$$

Linear Algebra

Scalar: a number, has magnitude (1 × 1)
Vector: a list of numbers, has magnitude and direction (default column vector n × 1)
Matrix: an array of numbers (m × n)

Matrix Basics

Transpose: \((A^T)_{ij} = A_{ji}\) and \((A+B)^{\top} = A^{\top}+B^{\top}\) and \((AB)^{\top} = B^{\top}A^{\top}\). If \(A = A^{\top}\) then matrix \(A\) is symmetric

$$\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}^\top = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}$$

Multiplication: if matrix \(A\) is \(m \times n\) and matrix \(B\) is \(n \times p\), then \(AB = C\) and \(C\) is \(m \times p\) and \(C_{ij} = \sum\limits_{k=1}^m A_{ik}B_{kj}\) and generally \(AB \neq BA\)

$$\begin{bmatrix} 2 & 8 & 3 \\ 5 & 4 & 1 \end{bmatrix} \begin{bmatrix} 4 & 1 \\ 6 & 3 \\ 2 & 4 \end{bmatrix} = \begin{bmatrix} 2(4)+8(6)+3(2) & 2(1)+8(3)+3(4) \\ 5(4)+4(6)+1(2) & 5(1)+4(3)+1(4) \end{bmatrix} = \begin{bmatrix} 62 & 38 \\ 46 & 21 \end{bmatrix}$$

Square Matrices

Diagonal Matrix: nonzero entries along diagonal, zeroes elsewhere (example is 3 × 3 but can be any size)

$$\begin{bmatrix} x_1 & 0 & 0 \\ 0 & x_2 & 0 \\ 0 & 0 & x_3 \end{bmatrix}$$

Determinant of a Matrix: scaling factor of the linear transformation described by the matrix, written as \(det(A)\) or \(|A|\) and matrix \(A\) is invertible iff \(|A| \neq 0\). If \(|A| = 0\) then matrix \(A\) is singular and therefore linearly dependent

$$|A| = \begin{vmatrix} a & b \\ c & d \end{vmatrix} = ad - bc, |A^{-1}| = \frac{1}{|A|}$$

Inverse of a Matrix: matrix \(A^{-1}\) such that \(AA^{-1} = I\), where \(I\) is the identity matrix. \((AB)^{-1} = B^{-1}A^{-1}\) and \((A^{\top})^{-1} = (A^{-1})^{\top}\)

$$A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}, A^{-1} = \frac{1}{det(A)} adj(A) = \frac{1}{ad-bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix}$$

Identity Matrix: diagonal matrix whose nonzero entries are all 1. \(AI = IA = A\) and \(AA^{-1} = A^{-1}A = I\)

Trace of a Matrix: sum of elements along the diagonal, for \(n \times n\) matrix \(A\), \(tr(A) = \sum\limits_{i=1}^n a_{ii}\)

$$A = \begin{bmatrix}\def\b{\color{blue}} \b 1 & 2 & 3 \\ 4 & \b 5 & 6 \\ 7 & 8 & \b 9 \end{bmatrix}, tr(A) = 1+5+9 = 15$$

Eigenspaces

A \(m \times m\) matrix \(A\) has \(m\) eigenvalues λ and \(m\) eigenvectors \(x_i\) such that \(Ax_1 = \lambda x_i\). It is not necessary to know how to compute eigen decomposition, but the general form used to do so is \(det(A- \lambda I) = 0\).

Calculus

Derivative: slope of a tangent line

$$f'(x) = \frac{df}{dx} = \lim_{\delta \to 0} \frac{f(x+ \delta) - f(x)}{\delta}$$

Second Derivative: curvature

$$f''(x) = \frac{d^2 f}{dx^2} = {df'}{dx}$$

Chain Rule

$$\frac{df(y)}{dx} = \frac{df(y)}{dy} \frac{y}{dx}$$

Prinicipal Component Analysis

PCA is a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set. This procedure transforms a number of possibily correlated variables into a smaller number of uncorrelated variables called principal components. PCA is traditionally performed on a square, symmetric matrix.

  1. Let \(x_1, \dots, x_n \in \mathbb{R}^D\) where \(\mathbb{R}^D\) is the set of all real numbers in the \(D^{th}\)-dimension. PCA is to be performed on a set of centered points, i.e., \(\sum_i x_i = 0\). Center the data by computing the sample mean, \(\mu\) and subtracting it from each data point.
  2. $$\mu = frac{1}{n} \sum_{i=1}^{n} x_i, x_i = x_i - \mu$$
  3. Compute the sample covariance matrix \(S\) using a matrix of the centered data \(X\) that will be \(n \times D\)
  4. $$S = frac{1}{n-1} X^{\intercal}X$$
  5. Perform eigen decomposition. You do not need to know how to do this! Use this site to calculate eigenvalues \(\lambda_1, \dots, \lambda_D\) and corresponding eigenvectors \(u_1, \dots, u_D\)
  6. $$S = U \Lambda U^{\intercal}, U = \begin{bmatrix} u_1 & \dots & u_D \end{bmatrix}, \Lambda = \begin{bmatrix} \lambda_1 & 0 & 0 \\ \vdots & \ddots & 0 \\ 0 & \dots & \lambda_D \end{bmatrix}$$
  7. Compute the first principal component \(v_1\) of dimension \(d\) where \(d < D\) i'll finish this later lol just multiply ur eigenvectors by x