Data and Sampling


Study types

Variables in studies

Variable types

Sampling Techniques

Stratified sampling is used when we want to ensure we gather stats for a particular group.

Cluster sampling is used to reduce the amount of samples required.


Experimental Design

  1. Control - control differences between treatment groups under study
  2. Randomisation - randomly assign to treatment groups
  3. Replication - studies should be replicable, single studies should use a large sample
  4. Block - handle other variables which might impact response variable

Numerical Data

Measures of Centre

Measures of Spread

Inter-quartile range (IQR)

`Q3 - Q1`


`max - min`


`s^2 = (sum_(i=1)^n (x_i - bar x)^2) / (n - 1)`

Standard Deviation

`s = sqrt(s^2) = sqrt((sum_(i=1)^n (x_i - bar x)^2) / (n - 1))`


The Z-score is the number of standard deviaitons a value is from the mean


Sampling Error

Samples statistics taken from a population will always vary from the actual population value

Standard Error

Sampling variability of the mean

`SE = σ / sqrt(n)`

When standard deviation (σ) is not known, sample standard deviation s is used to estimate standard error.

`SE = s / sqrt(n)`

Central Limit Theorem (baby version)

`bar x ~ N(mean = μ, SE = σ / sqrt(n))`


  1. Sample observations must be independent. If sampling without replacement then must have n < 10% population
  2. Population is either normal or n is large (> 30 approx)

Confidence Intervals

A confidence interval gives a margin of error for a point estimate from a sample

point estimate ± margin of error

Margin of error

Half the width of confidence interval

Margin of error = `z_r * SE = (z_r * σ) / sqrt(n)`

Where `z_r` is the Z-score of the cut-off point

WKU/David Neal - Z–scores and Confidence Intervals

Hypothesis Testing

Competing claims

Construction of the hypothesis

Alternative hypothesis is one of:

UCLA - What are the differences between one-tailed and two-tailed tests?

Notes on hypothesis Construction:

Errors in test

Significance level

Write significance level as α

α = P(type 1 error)

α = P(rejecting a true null hypothesis)

Yale - Tests of Significance


p−value = P(observed or more extreme sample statistic | H0 is true)

Z-score for hypothesis Testing

Z = (sample statistic - null value) / SE

e.g. for the mean:

`Z = (bar x - mu) / (SE)`



Used when sample size is small

T-Distribution With Single Mean

Degrees of Freedom

The t-distribution is a family of distributions with a degrees of freedom parameter.

degrees of freedom = n-1

Confidence Interval

`CI = bar x ± t_(df)^** * SE`

Hypothesis Testing

`T_(df) = (bar x - mu) / (SE)`

T-Distribution For Comparing Two Independent Means

T-distribution can be used to determine the significance between two sample means when the population stdev is unknown


Degrees of Freedom

`df = min(n_1 - 1, n_2 - 1)`

Standard Error

`SE_((bar x_1 - bar x_2)) = sqrt( s_1^2 / n_1 + s_2^2 / n_2 )`

Confidence Interval

`CI = (bar x_1 - bar x_2) ± t_(df)^** * SE`

Hypothesis Testing

`T_(df) = ((bar x_1 - bar x_2) - (mu_1 - mu_2)) / (SE)`

T-Distribution For Comparing Two Dependent Means

`df = n_{d i f f} - 1`

Standard Error

`SE_{d i f f} = s_{d i f f} / sqrt(n_{d i f f})`

Confidence Interval

`CI = bar x_{d i f f} ± t_{d i f f}^** * SE`

Hypothesis Testing

`T_df = (bar x_{d i f f} - mu_{d i f f}) / (SE)`

ANOVA - Analysis of Variance

Used to compare more than two groups


    Degrees of Freedom Sum of Squares Mean Squares F Value Pr(>F)
Group   df_G SSG MSG    
Error Residuals df_E SSE MSE    
Total   df_T SST      

SST - Sum Of Squares Total

`SST = sum_(i=1)^n (y_i - bar y)^2`

SSG - Sum Of Squares Group

`SST = sum_(j=1)^k n_j * (y_j - bar y)^2`

SSE - Sum Of Squares Error

`S S E = S S T - S S G`

Degrees Of Freedom

`d f_T = n - 1`

`d f_G = k - 1`

`d f_E = d f_T - d f_G`

Mean Squares

`M S G = (S S G) / (d f_G)`

`M S E = (S S E) / (d f_E)`


`F = (M S G) / (M S E)`

Bonferoni Correction

Adjustment of significance level

`alpha^** = alpha / K | K = (k(k-1)) / 2`

Standard Error (for multiple pairwise comparisons)

`SE = sqrt( (M S E)/n_1 + (M S E)/n_2 )`

Standard Error (for multiple pairwise comparisons)

`d f = d f_E`

Categorical Stuff

Sampling & CLT for Proportions

Distribution of sample proportions

`hat p ~ N(mean = p, SE = sqrt((p(1-p))/n))`


CI for Proportions

point estimate +- standard error

`hat p +- z^** SE_{hat p}`

`SE_{hat p} = sqrt((hat p(1- hat p))/n)`

Linear Regression


Fan shaped residuals plot - as value of explanatory variable increases, variability of response variable increases. Fails conditions for linear regression

Strength of fit


© Will Robertson