statistics

some material based on probability and statistics cookbook by matthias vallentin

# basics

• mutually exclusive: $P(AB)=0$
• independence $A \perp B$ means $P(AB) = P(A)P(B)$
•  conditional independence $A \perp B: :C$ means $P(AB\vert C) = P(A\vert C) P(B\vert C)$
•  conditional prob: $P(A B) = \frac{P(AB)}{P(B)} = \frac{P(B A)P(A)}{\sum P(B A)P(A)}$ (Bayes’ thm)
• $E[X] = \int P(x)x dx$
• $E[h(X)] \approx h(E[X])$
• $V[X] = E[(x-\mu)^2] = E[x^2]-E[x]^2$
• for unbiased estimate, divide by n-1
• $V(X_1-X_2) = V(X_1) + V(X_2)$ if $X_1,X_2$ independent
• $V(a_1X_…+a_nX_n) = \sum_{i=1}^{n}\sum_{j=1}^{n}a_ia_jcov(X_i,X_j)$
• $V[h(X)] \approx h’(E[X])^2 V[X]$
• standard deviation - sqrt of variance
• standard error - error of the mean
• $Cov[X,Y] = E[(X-\mu_X)(Y-\mu_Y)] = E[XY]-E[X]E[Y]$
• $Cov(aX+bY,Z) = aCov(X,Z)+bCov(Y,Z)$
• $Corr(Y,X) = \rho = \frac{Cov(Y,X)}{s_xs_y}$
• $Corr(aX+b,cY+d) = Corr(X,Y)$ if a and c have same sign
• $R^2 = \rho^2$
• skewness = $E[(\frac{X-\mu}{\sigma})^3]$
•  law of total expectation: $E[X] = E_Y[E(X Y)]$
•  law of total variance: $V[Y] =\underbrace{E[V(Y X)]}_{\text{unexplained variance}} + \underbrace{V(E[Y X])}_{\text{explained variance}}$

## error bars

• always write what you use
• standard dev
• standard error = standard dev / sqrt(n) = standard error of the mean when you’re estimating a mean
• 95% confidence interval = 2*standard error
• can get prediction intervals for on-line data using conformal prediction
• nonconformity measure - how unusual an examples looks relative to previous examples

## inter-rater agreement

• cohen’s kappa - measures how well different raters agree (just taking fraction may be too simple, because they might agree by chance)
• from -1 to 1 (1 is perfect agreement)
• $\kappa = 1 - \frac{1 - p_o}{1-p_e}$ where $p_o$ is the relative observed agreement among raters (identical to accuracy), and $p_e$ is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly seeing each category

## sample-size calculation

• how many samples must I collect?

## inequalities

•  cauchy-schwarz: $x \cdot y \leq x : y$
• $E[XY]^2 \leq E[X^2] E[Y^2]$
• triangle: $\vert \vert x + y \vert \vert \leq \vert \vert x \vert \vert + \vert \vert y \vert \vert$
• markov’s: $P(X \geq a) \leq \frac{E[X]}{a}$
• X is typically running time of the algorithm
• if we don’t have E[X], can use upper bound for E[X]
• chebyshev’s: $P(\vert X-\mu\vert \geq a) \leq \frac{Var[X]}{a^2}$
• utilizes the variance to get a better bound
• jensen’s: $f(E[X]) \leq E[f(X)]$ for convex $f$

## moment-generating function

• $M_X(t) = E(e^{tX})$
• derivatives yield moments: $\frac{d^r}{dX^r}M_X (0) = E(X^r)$
• sometimes $\ln[M_x(t)]$ yields $\mu$ and $V(X)$
• $Y = aX+b \implies M_y(t) = e^{bt}M_x(at)$
• $Y = a_1X_1+a_2X_2 \implies M_Y(t) = M_{X_1}(a_1t)M_{X_2}(a_2t)$ if $X_i$ independent
• ordered statistics - variables $Y_i$ such that $Y_i$ is the ith smalless

# distributions

• PMF: $f_X(x) = P(X=x)$
• PDF: $P(a \leq X \leq b) = \int_a^b f(x) dx$

• multivariate gaussian
• 2 parameterizations ($x \in \mathbb{R}^n$)
1. canonical parameterization: $p(x\vert\mu, \Sigma) = \frac{1}{(2\pi )^{n/2} \vert\Sigma\vert^{1/2}} \exp\left[ -\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu) \right]$
2. moment parameterization: $p(x\vert\eta, \Omega) = \text{exp}\left( a + \eta^T x - \frac{1}{2} x^T \Omega x\right)$ ~ also called information parameterization - $\Omega = \Sigma^{-1}$ - $\eta = \Sigma^{-1} \mu$
• joint distr - split parameters into block matrices
• want to block diagonalize the matrix
• Schur complement of matrix M w.r.t. H: $M/H$
• $\mu = \begin{bmatrix} \mu_1 \ \mu_2 \end{bmatrix}$
• $\Sigma = \begin{bmatrix} \Sigma_{11} & \Sigma_{12}\ \Sigma_{21} & \Sigma_{22} \end{bmatrix}$
•  $p(x_1, x_2) = \underbrace{p(x_1 x_2)}{\text{conditional}}\cdot\underbrace{p(x_2)}{\text{marginal}}$
• marginal
• $\mu_2^m = \mu_2$
• $\Sigma_2^m = \Sigma_{22}$
• conditional
•  $\mu_{1 2}^c = \mu_1 + \Sigma_{12}\Sigma_{22}^{-1} (x_2 - \mu_2)$
•  $\Sigma_{1 2}^c = \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}$

# law of large numbers

## law of large numbers

• equivalent statements
• $E(\bar{X}-\mu)^2 \to 0$ as $n \to \infty,$
• $P(\vert\bar{X}-\mu\vert \geq \epsilon) \to 0$ as $n \to \infty$
• $T_o = X_1+…+X_n, E(T_o) = n\mu , V(T_o) = n\mu ^2$
• implications
• $E(\bar{X}) = \mu$
• $V(\bar{X}) = \frac{\sigma_x^2}{n}$

## central limit thm

• 2 characterizations
• random samples have a normal distr. if n is large
• $lim_{n\to\infty}P(\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\leq z)=P(Z\leq z) = \Phi(z)$
• implications
• $X_1..X_n$ has approximately lognormal distribution if all $P(X_i>0)$

# bias and point estimation

• point estimator $\hat{\theta}$ - statistic that predicts a parameter
• point estimate - single number prediction
• bias: $E(\hat{\theta}) - \theta$
• really nice example
• more complex models (more nonzero parameters) have lower bias, higher variance
• if high bias, train and test error will be very close (model isn’t complex enough)
• after unbiased we want MVUE (minimum variance unbiased estimator)
• need inductive inference property: must make prior assumptions in order to classify unseen instances
• define inductive bias of a learner as the set of additional assumptions B sufficient to justify its inductive inferences as deductive inferences
• bias types
• preference bias = search bias - models can search entire space (e.g. NN, decision tree)
• restriction bias = language bias - models that can’t express entire space (e.g. linear)
• consistent: $\hat{\theta_n} \to some : value$
• basically it converges to a number (can still be biased)
• bias/variance trade-off
• MSE - mean squared error - $E[(\hat{\theta}-\theta)^2]$ = $V(\hat{\theta})+[E(\hat{\theta})-\theta]^2$
• defs
• bias = approximation err
• variance = estimation err
• confidence intervals - take sample data + produce a range of values that likely contains population parameter you are interested in
• confidence interval of the prediction is a range that likely contains the mean value of the dependent variable given specific values of the independent variables - usually wider because it is one point, not a mean
• MLE example
• MLE - maximize likelihood $L(\theta) = p(X_1,…,X_n;\theta_1,…\theta_m)$ (the agreement with a chosen distribution)
• $\hat{\theta} =$argmax $L(\theta)$
• $L(\theta)=P(X_1…X_n\vert\theta)=\prod_{i=1}^n P(X_i\vert\theta)$
• $log : L(\theta)= \ell(\theta) = \sum log P(X_i\vert\theta)$
• to maximize, set $\frac{\partial \ell (\theta)}{\partial \theta} = 0$
• fisher information $I(\theta)=V[\frac{\partial^2}{\partial\theta^2} \overbrace{\ln(f[x;\theta])}^{\text{Fisher score function}} ]$ (for n samples, multiply by n)
• higher info $\implies$ lower estimation error
• $\overbrace{\ln(f[x;\theta])}^{\text{Fisher score function}}$

# overview - J. 5

• prob theory: given model $\theta$, infer data $X$
• statistics: given data $X$, infer model $\theta$
• 2 statistical schools of thought: Bayesian and frequentist
1. Bayesian: $\overbrace{p(\theta \vert x)}^{\text{posterior}} = \frac{\overbrace{p(x\vert\theta)}^{\text{likelihood}} \overbrace{p(\theta)}^{\text{prior}}}{p(x)}$
• assumes $\theta$ is a RV, find its distr.
• prior probability $p(\theta)$= statistician’s uncertainty
•  posterior $p(\theta x)$ is what you don’t observe
• $\hat{\theta}_{Bayes} = \int \theta : p(\theta \vert x) d\theta$ ~ mean of the posterior
• $\hat{\theta}_{MAP} = \underset{\theta}{argmax} : p(\theta\vert x) = \underset{\theta}{argmax} : p(x\vert \theta) p(\theta) \\ = \underset{\theta}{argmax} : [ \log : p(x\vert\theta) + \log : p(\theta) ]$
• like penalized likelihood
• bayesians prefer whole distr. rather than parameter estimates
2. frequentist - use estimators (ex. MLE)
• no prior - only use priors when they correspond to objective frequencies of observing values
• neyman / pearson
• $\hat{\theta}{MLE} = argmax\theta : p(x\vert\theta)$
•  really likelihood is whatever we model (ex. for discriminative models would be $p(y x, \theta)$)

## 3 problems

1. density estimation - given samples of X, estimate P(X)
• ex. univariate Gaussian density estimation
• frequentist
• derive MLE for mean and variance
• bayesian
• assume distr. for $\mu$
• ex. $p(\mu) \sim N(\mu_0, \tau^2)$
• derive MAP for mean and variance (assuming some prior)
• can use plate to show repeated element
• ex. discrete, multinomial prob. distr.
• derive MLE
•  $P(x \theta) \sim$multionomial distr.
• derive MAP
• want to be able to plug in posterior as prior recursively
• this requires a Dirichlet prior to multiply the multinomial
• Dirichlet: $p(\theta) = C(\alpha) \theta_1^{\alpha_1 - 1}\cdot \cdot \cdot \theta_M^{\alpha_M-1}$
• ex. mixture models - $p(x\vert\theta)=\sum_k \alpha_k f_k (x\vert\theta_k)$
• here $f_k$ represent densities (mixture components)
• $\alpha_k$ are weights (mixing proportions)
• can do inference on this - given x, figure out which cluster it fits into better
• learning requires EM
• can be used nonparametrically - mixture seive
• however, means are allowed to vary
• solving with random projection: project to low dim and keep track of means etc.
• ex. nonparametric density estimation
• ex. kernel density estimator - stacking up mass
• each point contributes a kernel function $k(x,x_n, \lambda)$
• $x_n$ is location, $\lambda$ is smoothing
• $\hat{p}(x) = \frac{1}{N}\sum_n k(x,x_n,\lambda)$
• nonparametric models sometimes called infinite-dimensional
2. regression - want $p(y \vert x)$
• conditional mixture model - variable z can be used to pick out regions of input space where different regression functions are used
• $p(y_n\vert x_n,\theta) = \sum_k p(y_n\vert z_n^k = 1, x_n, \theta) \cdot p(z_n^k=1\vert x_n,\theta)$
• nonparametric regression
• ex. kernel regression $\hat{f}(x) = \frac{\sum_{i=1}^N k(x, x_i) \cdot y_i}{\sum_{m=1}^N k(x, x_j)}$
3. classification
• ex. Gaussian class-conditional densities
• posterior probability is logistic function
• clustering - use mixture models

## model selection / averaging

• bayesian
• for model m, want to maximize $p(m\vert x) = \frac{p(x\vert m) p(m)}{p(x)}$
• usually, just take $m$ that maximizes $p(m\vert x)$
•  model averaging: $p(x_{new} x) = \int dm \int d\theta : p(x_{new} \theta, m) p(\theta x, m) p(m x)$
• otherwise integrate over $\theta, m$ - model averaging
• frequentist
• can’t use MLE - will always prefer more complex models
• use some criteria such as KL-divergence, AIC, cross-validation