Sample space: a space of events that are assigned probabilities, where events are binary, multi-valued, or continuous; but always mutually exclusive
Examples
Coin flip: {head, tail}
Die roll: {1, 2, 3, 4, 5, 6}
Random variable: a variable, \(x\), whose domain is the sample space and whose value is somewhat uncertain
\(x\) = coin flip outcome
\(x\) = tomorrow's temperature
The Axioms of Probability
For event \(A, P(A) ∈ [0,1]\)
For sample space \(S\), \(P(S) = 1\); and \(P(true) = 1\), \(P(false) = 0\)
If events \(A_1, A_2, A_3, \cdots\) are disjoint (mutually exclusive), then \(P(A_1 \cup A_2 \cup A_3 \cdots)=P(A_1)+P(A_2)+P(A_3)+\cdots\)
It follows that for disjoint events \(A\) and \(B\)
\(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)
Joint probability: the probability of two events occurring together at the same point in time.
$$P(A, B) = P(A \cap B) = P(A)*P(B)$$
Marginal probability: the probability of a single event occurring unconditioned on any other events
Conditional probability: the probability of an event \(B\) given that event \(A\) has already occurred, and that these events are dependent
$$P(B|A) = \frac{P(A \cap B)}{P(A)}$$
For conditional probability on some other events \(C\)
$$P(A| B,C) = \frac{P(A,B |C)}{P(B|C)}$$
Chain rule: derived from the definition of conditional probability
Language modeling: tries to capture the notion that some text is more likely than others by estimating the probability \(P(s)\) of any text \(s\)
Unigram Language Model: makes a strong independence assumption that words are generated independently from a multinomial distribution \(\theta\) (of dimension \(V\) = size of the vocabulary)
Scalar: a number, has magnitude (1 × 1) Vector: a list of numbers, has magnitude and direction (default column vector n × 1) Matrix: an array of numbers (m × n)
Matrix Basics
Transpose: \((A^T)_{ij} = A_{ji}\) and \((A+B)^{\top} = A^{\top}+B^{\top}\) and \((AB)^{\top} = B^{\top}A^{\top}\). If \(A = A^{\top}\) then matrix \(A\) is symmetric
Multiplication: if matrix \(A\) is \(m \times n\) and matrix \(B\) is \(n \times p\), then \(AB = C\) and \(C\) is \(m \times p\) and \(C_{ij} = \sum\limits_{k=1}^m A_{ik}B_{kj}\) and generally \(AB \neq BA\)
Determinant of a Matrix: scaling factor of the linear transformation described by the matrix, written as \(det(A)\) or \(|A|\) and matrix \(A\) is invertible iff \(|A| \neq 0\). If \(|A| = 0\) then matrix \(A\) is singular and therefore linearly dependent
$$|A| = \begin{vmatrix} a & b \\
c & d \end{vmatrix} = ad - bc, |A^{-1}| = \frac{1}{|A|}$$
Inverse of a Matrix: matrix \(A^{-1}\) such that \(AA^{-1} = I\), where \(I\) is the identity matrix. \((AB)^{-1} = B^{-1}A^{-1}\) and \((A^{\top})^{-1} = (A^{-1})^{\top}\)
$$A = \begin{bmatrix} a & b \\
c & d \end{bmatrix}, A^{-1} = \frac{1}{det(A)} adj(A) =
\frac{1}{ad-bc} \begin{bmatrix} d & -b \\
-c & a \end{bmatrix}$$
Identity Matrix: diagonal matrix whose nonzero entries are all 1. \(AI = IA = A\) and \(AA^{-1} = A^{-1}A = I\)
Trace of a Matrix: sum of elements along the diagonal, for \(n \times n\) matrix \(A\), \(tr(A) = \sum\limits_{i=1}^n a_{ii}\)
PCA is a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set. This procedure transforms a number of possibily correlated variables into a smaller number of uncorrelated variables called principal components. PCA is traditionally performed on a square, symmetric matrix.
Let \(x_1, \dots, x_n \in \mathbb{R}^D\) where \(\mathbb{R}^D\) is the set of all real numbers in the \(D^{th}\)-dimension. PCA is to be performed on a set of centered points, i.e., \(\sum_i x_i = 0\). Center the data by computing the sample mean, \(\mu\) and subtracting it from each data point.