# 4.2. data analysisÂ¶

## 4.2.1. pqrsÂ¶

• Goal: inference - conclusion or opinion formed from evidence

• PQRS

• P - population

• Q - question - 2 types

1. hypothesis driven - does a new drug work

2. discovery driven - find a drug that works

• R - representative data colleciton

• simple random sampling = SRS

• w/ replacement: $$var(\bar{X}) = \sigma^2 / n$$

• w/out replacement: $$var(\bar{X}) = (1 - \frac{n}{N}) \sigma^2 / n$$

## 4.2.2. visualizationÂ¶

First 5 parts here are based on the book storytelling with data by cole nussbaumer knaflic

• difference between showing data + storytelling with data

### 4.2.2.1. understand the context (1)Â¶

• who is your audience? what do you need them to know/do?

• exploratory vs explanatory analysis

• slides (need little details) vs email (needs lots of detail) - usually need to make both in slideument

• should know how much nonsupporting data to show

• distill things down into a 3-minute story or a 1-sentence Big Idea

• easiest to start things on paper/post-it notes

### 4.2.2.2. choose an effective visual (2)Â¶

• generally avoid pie/donut charts, 3D charts, 2nd y-axes

• tables

• best for when people will actually read off numbers

• minimalist is best

• bar charts should basically always start at 0

• horizontal charts typically easy to read

• on axes, retain things like dollar signs, percent, etc.

### 4.2.2.3. eliminate clutter (3)Â¶

• gestalt principles of vision

• proximity - close things are grouped

• similarity - similar things are grouped

• connection - connected things are grouped

• enclosure

• closure

• continuity

• generally good to have titles and such at top-left!

• diagonal lines / text should be avoided

• center-aligned text should be avoided

• label lines directly

### 4.2.2.4. focus attention (4)Â¶

• visual hierarchy - outlines what is important

### 4.2.2.5. tell a story / think like a designer (5)Â¶

• affordances - aspects that make it obvious how something will be used (e.g. a button affords pushing)

• âYou know youâve achieved perfection, not when you have nothing more to add, but when you have nothing to take awayâ (SaintâExupery, 1943)

• stories have different parts, which include conflict + tension

• beginning - introduce a problem / promise

• middle - what could be

• end - call to action

• horizontal logic - people can just read title slides and get out what they need

• can either convince ppl through conventional rhetoric or through a story

### 4.2.2.6. visual summariesÂ¶

• numerical summaries

• mean vs. median

• sd vs. iq range

• visual summaries

• histogram

• kernel density plot - Gaussian kernels

• with bandwidth h $$K_h(t) = 1/h K(t/h)$$

• plots

1. box plot / pie-chart

2. scatter plot / q-q plot

• q-q plot = probability plot - easily check normality

• plot percentiles of a data set against percentiles of a theoretical distr.

• should be straight line if they match

1. transformations = feature engineering

• log/sqrt make long-tail data more centered and more normal

• delta-method - sets comparable bw (wrt variance) after log or sqrt transform: $$Var(g(X)) \approx [g'(\mu_X)]^2 Var(X)$$ where $$\mu_X = E(X)$$

• if assumptions donât work, sometimes we can transform data so they work

• transform x - if residuals generally normal and have constant variance

• corrects nonlinearity

• transform y - if relationship generally linear, but non-constant error variance

• stabilizes variance

• if both problems, try y first

• Box-Cox: Yâ = $$Y^l \: if \: l \neq 0$$, else log(Y)

1. least squares

• inversion of pxp matrix ~O(p^3)

• regression effect - things tend to the mean (ex. bball children are shorter)

• in high dims, l2 worked best

1. kernel smoothing + lowess

• can find optimal bandwidth

• nadaraya-watson kernel smoother - locally weighted scatter plot smoothing

• $$$g_h(x) = \frac{\sum K_h(x_i - x) y_i}{\sum K_h (x_i - x)}$$$ where h is bandwidth

• loess - multiple predictors / lowess - only 1 predictor

• also called local polynomial smoother - locally weighted polynomial

• take a window (span) around a point and fit weighted least squares line to that point

• replace the point with the prediction of the windowed line

• can use local polynomial fits rather than local linear fits

1. silhouette plots - good clusters members are close to each other and far from other clustersf

1. popular graphic method for K selection

2. measure of separation between clusters $$s(i) = \frac{b(i) - a(i)}{max(a(i), b(i))}$$

3. a(i) - ave dissimilarity of data point i with other points within same cluster

4. b(i) - lowest average dissimilarity of point i to any other cluster

5. good values of k maximize the average silhouette score

2. lack-of-fit test - based on repeated Y values at same X values

## 4.2.3. imbalanced dataÂ¶

1. randomly oversample minority class

2. randomly undersample majority class

3. weighting classes in the loss function - more efficient, but requires modifying model code

4. generate synthetic minority class samples

1. smote (chawla et al. 2002) - interpolate betwen points and their nearest neighbors (for minority class) - some heuristics for picking which points to interpolate

1. adasyn (he et al. 2008) - smote, generate more synthetic data for minority examples which are harder to learn (number of samples is proportional to number of nearby samples in a different class)

2. smrt - generate with vae

5. selectively removing majority class samples

1. tomek links (tomek 1976) - selectively remove majority examples until al lminimally distanced nearest-neighbor pairs are of the same class

2. near-miss (zhang & mani 2003) - select samples from the majority class which are close to the minority class. Example: select samples from the majority class for which the average distance of the N closest samples of a minority class is smallest

3. edited nearest neighbors (wilson 1972) - âeditâ the dataset by removing samples that donât agree âenoughâ with their neighborhood

6. feature selection and extraction

1. minority class samples can be discarded as noise - removing irrelevant features can reduce this risk

2. feature selection - select a subset of features and classify in this space

3. feature extraction - extract new features and classify in this space

4. ideas

1. use majority class to find different low dimensions to investigate

2. in this dim, do density estimation

3. residuals - iteratively reweight these (like in boosting) to improve performance

7. incorporate sampling / class-weighting into ensemble method (e.g. treat different trees differently)

1. ex. undersampling + ensemble learning (e.g. IFME, Beccaâs work)

8. algorithmic classifier modifications

9. misc papers

1. enrichment (jegierski & saganowski 2019) - add samples from an external dataset

10. ref

11. imblanced-learn package with several methods for dealing with imbalanced data

12. good blog post

13. Learning from class-imbalanced data: Review of methods and applications (Haixiang et al. 2017)

14. sample majority class w/ density (to get best samples)

15. log-spline - doesnât scale

## 4.2.4. missing-data imputationÂ¶

• Missing value imputation: a review and analysis of the literature (lin & tsai 2019)

• Causal Inference: A Missing Data Perspective (ding & li, 2018)

• different missingness mechanisms (little & rubin, 1987)

• MCAR - missing completely at random - no relationship between the missingness of the data and any values, observed or missing

• MAR - missing at random - propensity of missing values depends on observed data, but not the missing data

• can easily test for this vs MCAR

• MNAR - missing not at random - propensity of missing values depends both on observed and unobserved data

• connections to causal: MCAR is much like randomization, MAR like ignorability (although slightly more general), and MNAR like unmeasured unconfounding

• imputation problem: propensity of missing values depends on the unobserved values themselves (not ignorable)

• simplest approach: drop rows with missing vals

• mean/median imputation

• probabilistic approach

• EM approach, MCMC, GMM, sampling

• matrix completion: low-rank, PCA, SVD

• nearest-neighbor / matching: hot-deck

• (weighted) prediction approaches

• linear regr, LDA, naive bayes, regr. trees, LDA

• can do weighting using something similar to inverse propensities, although less common to check things like covariate balance

• multiple imputation: impute multiple times to get better estimates

• MICE (passes / imputes data multiple times sequentially)

• can perform sensitivity analysis to evaluate the assumption that things are not MNAR

• two standard models for nonignorable missing data are the selection models and the pattern-mixture models (Little and Rubin, 2002, Chapter 15)

• performance evaluation

• acc at finding missing vals

## 4.2.6. principlesÂ¶

### 4.2.6.1. breimanÂ¶

• conversation

• moved sf -> la -> caltech (physics) -> columbia (math) -> berkeley (math)

• info theory + gambling

• CART, ace, and prob book, bagging

• ucla prof., then consultant, then founded stat computing at berkeley

• lots of cool outside activities

• ex. selling ice in mexico

• 2 cultures paper

1. generative - data are generated by a given stochastic model

• stat does this too much and needs to move to 2

• ex. assume y = f(x, noise, parameters)

• validation: goodness-of-fit and residuals

1. predictive - use algorithmic model and data mechanism unknown

• assume nothing about x and y

• ex. generate P(x, y) with neural net

• validation: prediction accuracy

• axioms

1. Occam

2. Rashomon - lots of different good models, which explains best?

• ex. rf is not robust at all

1. Bellman - curse of dimensionality

• might actually want to increase dimensionality (ex. svms embedded in higher dimension)

### 4.2.6.2. box + tukeyÂ¶

• questions

1. what points are relevant and irrelevant today in both papers?

• relevant

• box

• thoughts on scientific method

• solns should be simple

• necessity for developing experimental design

• flaws (cookbookery, mathematistry)

• tukey

• separating data analysis and stats

• all models have flaws

• no best models

• lots of goold old techniques (e.g. LSR)

• irrelevant

• some of the data techniques (I think)

• tukey multiple-response data has been better attacked (graphical models)

1. how do you think the personal traits of Tukey and Box relate to the scientific opinions expressed in their papers?

• probably both pretty critical of the science at the time

• box - great respect for Fisher

• both very curious in different fields of science

1. what is the most valuable msg that you get from each paper?

• box - data analysis is a science

• tukey - models must be useful

• no best models

• find data that is useful

• no best models

• box_79 âscience and statisticsâ

• scientific method - iteration between theory and practice

• learning - discrepancy between theory and practice

• solns should be simple

• fisher - founder of statistics (early 1900s)

• couples math with applications

• data analysis - subiteration between tentative model and tentative analysis

• develops experimental design

• flaws

• cookbookery - forcing all problems into 1 or 2 routine techniques

• mathematistry - development of theory for theoryâs sake

• tukey_62 âthe future of data analysisâ

• general considerations

• data analysis - different from statistics, is a science

• lots of techniques are very old (LS - Gauss, 1803)

• all models have flaws

• no best models

• must teach multiple data analysis methods

• spotty data - lots of irregularly non-constant variability

• could just trim highest and lowest values

• winzorizing - replace suspect values with closest values that arenât

• must decide when to use new techniques, even when not fully understood

• want some automation

• FUNOP - fulll normal plot

• can be visualized in table

• spotty data in more complex situations

• FUNOR-FUNOM

• multiple-response data

• understudied except for factor analysis

• multiple-response procedures have been modeled upon how early single-response procedures were supposed to have been used, rather than upon how they were in fact used

• factor analysis

1. reduce dimensionality with new coordinates

2. rotate to find meaningful coordinates

• can use multiple regression factors as one factor if they are very correlated

• regression techniques always offer hopes of learning more from less data than do variance-component techniques

• flexibility of attack

• ex. what unit to measure in

### 4.2.6.3. modelsÂ¶

• normative - fully interpretable + modelled

• idealized

• probablistic

• ~mechanistic - somewhere in between

• descriptive - based on reality

• empirical

### 4.2.6.4. exaggerated claimsÂ¶

• video by Rob Kass

• concepts are ambiguous and have many mathematical instantiations

• e.g. âcentral tendencyâ can be mean or median

• e.g. âinformationâ can be mutual info (reduction in entropy) or squared correlation (reduction in variance)

• e.g. measuring socioeconomic status and controlling for it

• regression âwhen controlling for another variableâ makes causal assumptions

• must make sure that everything that could confound is controlled for

• Idan Segev: âmodeling is the lie that reveals the truthâ

• picasso: âart is the lie that reveals the truthâ

• box: âall models are wrong but some are usefulâ - statistical pragmatism

• moves from true to useful - less emphasis on truth

• âtruthâ is contingent on the purposes to which it will be put

• the scientific method aims to provide explanatory models (theories) by collecting and analyzing data, according to protocols, so that

• the data provide info about models

• replication is possible

• the models become increasingly accurate

• scientific knowledge is always uncertain - depends on scientific method