Math

Probability and Statistics

Sample space: a space of events that are assigned probabilities, where events are binary, multi-valued, or continuous; but always mutually exclusive

Examples

Coin flip: {head, tail}
Die roll: {1, 2, 3, 4, 5, 6}

Random variable: a variable, $x$, whose domain is the sample space and whose value is somewhat uncertain

$x$ = coin flip outcome
$x$ = tomorrow's temperature

The Axioms of Probability

For event $A, P(A) ∈ [0,1]$
For sample space $S$, $P(S) = 1$; and $P(true) = 1$, $P(false) = 0$
If events $A_1, A_2, A_3, \cdots$ are disjoint (mutually exclusive), then $P(A_1 \cup A_2 \cup A_3 \cdots)=P(A_1)+P(A_2)+P(A_3)+\cdots$
It follows that for disjoint events $A$ and $B$
$P(A \cup B) = P(A) + P(B) - P(A \cap B)$

Joint probability: the probability of two events occurring together at the same point in time.

$$P(A, B) = P(A \cap B) = P(A)*P(B)$$

Marginal probability: the probability of a single event occurring unconditioned on any other events

Conditional probability: the probability of an event $B$ given that event $A$ has already occurred, and that these events are dependent

$$P(B|A) = \frac{P(A \cap B)}{P(A)}$$ For conditional probability on some other events $C$ $$P(A| B,C) = \frac{P(A,B |C)}{P(B|C)}$$

Chain rule: derived from the definition of conditional probability

$$P(A, B) = P(B)*P(A|B) = P(A)*P(B|A)$$ $$P(A_1,A_2,\cdots,A_n) = P(A_1)*P(A_2|A_1)*P(A_3|A_1,A_2)*\cdots*P(A_n|A_1,A_2 \cdots A_{n-1})$$

Bayes' Rule

$$P(A|B) = \frac{P(A,B)}{P(B)} = \frac{P(B|A)P(A)}{P(B)}$$

Prior $P(A)$: probability before any evidence
Likelihood $P(B|A)$: assuming $A$, how likely is the evidence
Posterior $P(A|B)$: conditional probability after knowing evidence
Inference: deriving unknown probability from known ones

Identities for Independent Events

$P(A,B) = P(A)*P(B)$
$P(A|B) = P(A)$
$P(B|A) = P(B)$

Conditional independence: two events $A$ and $B$ are conditionally independent given an event $C$

$$P(A| B,C) = \frac{P(A \cap B |C)}{P(B|C)} = \frac{P(A|C)P(B|C)}{P(B|C)} = P(A|C)$$

Natural Language Processing

Word token: occurrences of a word

Word type: unique word

Vocabulary: list of the word types

Zipf's Law: given a large sample of words used, the frequency of any word is inversely proportional to its rank on the frequency table

Corpus (pl. corpora): a large, representative collection of text

Multinomial distribution for a $k$-sided die with probability vector $\theta$, $N$ throws, outcomes counts $n_1, \cdots,n_k$

$$P(n_1,\cdots ,n_k|\theta) = \binom{N}{n_1 \cdots n_k} = \prod_{i=1}^{k} \theta_i^{n_i}$$

The Maximum Likelihood Estimate (MLE)

The Maximum A Posterior (MAP) Estimate

Language modeling: tries to capture the notion that some text is more likely than others by estimating the probability $P(s)$ of any text $s$

Unigram Language Model: makes a strong independence assumption that words are generated independently from a multinomial distribution $\theta$ (of dimension $V$ = size of the vocabulary)

N-gram Language Models

Smoothing

add-1 smoothing $$\hat{\theta_w} = \frac{c_w + 1}{|C| + V'}$$ add-∈ smoothing $$\hat{\theta_w} = \frac{c_w + \in}{|C| + V\in'}$$

Linear Algebra

Scalar: a number, has magnitude (1 × 1)
Vector: a list of numbers, has magnitude and direction (default column vector n × 1)
Matrix: an array of numbers (m × n)

Matrix Basics

Transpose: $(A^T)_{ij} = A_{ji}$ and $(A+B)^{\top} = A^{\top}+B^{\top}$ and $(AB)^{\top} = B^{\top}A^{\top}$. If $A = A^{\top}$ then matrix $A$ is symmetric

$$\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}^\top = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}$$

Multiplication: if matrix $A$ is $m \times n$ and matrix $B$ is $n \times p$, then $AB = C$ and $C$ is $m \times p$ and $C_{ij} = \sum\limits_{k=1}^m A_{ik}B_{kj}$ and generally $AB \neq BA$

$$\begin{bmatrix} 2 & 8 & 3 \\ 5 & 4 & 1 \end{bmatrix} \begin{bmatrix} 4 & 1 \\ 6 & 3 \\ 2 & 4 \end{bmatrix} = \begin{bmatrix} 2(4)+8(6)+3(2) & 2(1)+8(3)+3(4) \\ 5(4)+4(6)+1(2) & 5(1)+4(3)+1(4) \end{bmatrix} = \begin{bmatrix} 62 & 38 \\ 46 & 21 \end{bmatrix}$$

Square Matrices

Diagonal Matrix: nonzero entries along diagonal, zeroes elsewhere (example is 3 × 3 but can be any size)

$$\begin{bmatrix} x_1 & 0 & 0 \\ 0 & x_2 & 0 \\ 0 & 0 & x_3 \end{bmatrix}$$

Determinant of a Matrix: scaling factor of the linear transformation described by the matrix, written as $det(A)$ or $|A|$ and matrix $A$ is invertible iff $|A| \neq 0$. If $|A| = 0$ then matrix $A$ is singular and therefore linearly dependent

$$|A| = \begin{vmatrix} a & b \\ c & d \end{vmatrix} = ad - bc, |A^{-1}| = \frac{1}{|A|}$$

Inverse of a Matrix: matrix $A^{-1}$ such that $AA^{-1} = I$, where $I$ is the identity matrix. $(AB)^{-1} = B^{-1}A^{-1}$ and $(A^{\top})^{-1} = (A^{-1})^{\top}$

$$A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}, A^{-1} = \frac{1}{det(A)} adj(A) = \frac{1}{ad-bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix}$$

Identity Matrix: diagonal matrix whose nonzero entries are all 1. $AI = IA = A$ and $AA^{-1} = A^{-1}A = I$

Trace of a Matrix: sum of elements along the diagonal, for $n \times n$ matrix $A$, $tr(A) = \sum\limits_{i=1}^n a_{ii}$

$$A = \begin{bmatrix}\def\b{\color{blue}} \b 1 & 2 & 3 \\ 4 & \b 5 & 6 \\ 7 & 8 & \b 9 \end{bmatrix}, tr(A) = 1+5+9 = 15$$

Eigenspaces

A $m \times m$ matrix $A$ has $m$ eigenvalues λ and $m$ eigenvectors $x_i$ such that $Ax_1 = \lambda x_i$. It is not necessary to know how to compute eigen decomposition, but the general form used to do so is $det(A- \lambda I) = 0$.

Prinicipal Component Analysis

PCA is a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set. This procedure transforms a number of possibily correlated variables into a smaller number of uncorrelated variables called principal components. PCA is traditionally performed on a square, symmetric matrix.

Let $x_1, \dots, x_n \in \mathbb{R}^D$ where $\mathbb{R}^D$ is the set of all real numbers in the $D^{th}$-dimension. PCA is to be performed on a set of centered points, i.e., $\sum_i x_i = 0$. Center the data by computing the sample mean, $\mu$ and subtracting it from each data point.
Compute the sample covariance matrix $S$ using a matrix of the centered data $X$ that will be $n \times D$
Perform eigen decomposition. You do not need to know how to do this! Use this site to calculate eigenvalues $\lambda_1, \dots, \lambda_D$ and corresponding eigenvectors $u_1, \dots, u_D$
Compute the first principal component $v_1$ of dimension $d$ where $d < D$ i'll finish this later lol just multiply ur eigenvectors by x