# Data and Sampling

## Studies

### Study types

• Observational study - No random assignment. Only correlation is possible and causation can not be inferred
• Experimental study - Random assignment is used, causation may be inferred

### Variables in studies

• Response variable - a variable which us being studied
• Explanatory variable - a variable which is explains changes in response variable
• Confounding variable - a variable which is explains correlation between the response and explanatory variables
• Blocking variable - a variable used to split the data into categories (blocks) if it is suspected the variable may have an impact on the response variable

### Variable types

• Variables can either be numerical or categorical
• Numerical variables can either be discrete or continuous
• Categorical variables may be ordinal in which case categories have a natural ordering

### Sampling Techniques

• Simple random sampling - randomly sample from whole population
• Stratified sampling - split population into strata, randomly sample from each strata
• Cluster sampling - split population into clusters, randomly select some clusters then randomly select from those clusters

Stratified sampling is used when we want to ensure we gather stats for a particular group.

Cluster sampling is used to reduce the amount of samples required.

### Bias

• Convenience sample
• Non-response
• Voluntary response

### Experimental Design

1. Control - control differences between treatment groups under study
2. Randomisation - randomly assign to treatment groups
3. Replication - studies should be replicable, single studies should use a large sample
4. Block - handle other variables which might impact response variable

## Numerical Data

### Measures of Centre

• Mean - arithmetic average
• Median - “middle” number, or if even number of values, average of middle 2
• Mode - most frequent value

### Measures of Spread

Q3 - Q1

#### Range

max - min

#### Variance

s^2 = (sum_(i=1)^n (x_i - bar x)^2) / (n - 1)

#### Standard Deviation

s = sqrt(s^2) = sqrt((sum_(i=1)^n (x_i - bar x)^2) / (n - 1))

### Z-score

The Z-score is the number of standard deviaitons a value is from the mean

Z=(x−μ)/σ

### Sampling Error

Samples statistics taken from a population will always vary from the actual population value

#### Standard Error

Sampling variability of the mean

SE = σ / sqrt(n)

When standard deviation (σ) is not known, sample standard deviation s is used to estimate standard error.

SE = s / sqrt(n)

### Central Limit Theorem (baby version)

bar x ~ N(mean = μ, SE = σ / sqrt(n))

Conditions:

1. Sample observations must be independent. If sampling without replacement then must have n < 10% population
2. Population is either normal or n is large (> 30 approx)

### Confidence Intervals

A confidence interval gives a margin of error for a point estimate from a sample

point estimate ± margin of error

#### Margin of error

Half the width of confidence interval

Margin of error = z_r * SE = (z_r * σ) / sqrt(n)

Where z_r is the Z-score of the cut-off point

WKU/David Neal - Z–scores and Confidence Intervals

## Hypothesis Testing

### Competing claims

• Null Hypothesis - H0 - skeptical perspective
• Alternative Hypothesis - Ha - point of view under consideration

### Construction of the hypothesis

Alternative hypothesis is one of:

• One Sided - μ < null value or μ > null value
• Two Sided - μ ≠ null value

UCLA - What are the differences between one-tailed and two-tailed tests?

Notes on hypothesis Construction:

• Always construct hypotheses about population parameters, not sample parameters

### Errors in test

• Type 1 - rejecting a true null hypothesis
• Type 2 - failing to reject a false null hypothesis

### Significance level

Write significance level as α

α = P(type 1 error)

α = P(rejecting a true null hypothesis)

Yale - Tests of Significance

### P-value

p−value = P(observed or more extreme sample statistic | H0 is true)

• p-value < significance level reject null hypothesis
• p-value > significance level fail to reject null hypothesis

### Z-score for hypothesis Testing

Z = (sample statistic - null value) / SE

e.g. for the mean:

Z = (bar x - mu) / (SE)



## T-Distribution

Used when sample size is small

• Similar to normal distribution but with fat tails - more probability of values being distant from the centre.
• Used for confidence intervals / hypothesis testing on a single mean

### T-Distribution With Single Mean

#### Degrees of Freedom

The t-distribution is a family of distributions with a degrees of freedom parameter.

degrees of freedom = n-1

#### Confidence Interval

CI = bar x ± t_(df)^** * SE

#### Hypothesis Testing

T_(df) = (bar x - mu) / (SE)

### T-Distribution For Comparing Two Independent Means

T-distribution can be used to determine the significance between two sample means when the population stdev is unknown

Conditions:

• Sampled observations are independent within groups
• The two groups should be independent
• Skew / sample size - need low skew otherwise large sample size

#### Degrees of Freedom

df = min(n_1 - 1, n_2 - 1)

#### Standard Error

SE_((bar x_1 - bar x_2)) = sqrt( s_1^2 / n_1 + s_2^2 / n_2 )

#### Confidence Interval

CI = (bar x_1 - bar x_2) ± t_(df)^** * SE

### Hypothesis Testing

T_(df) = ((bar x_1 - bar x_2) - (mu_1 - mu_2)) / (SE)

### T-Distribution For Comparing Two Dependent Means

df = n_{d i f f} - 1

#### Standard Error

SE_{d i f f} = s_{d i f f} / sqrt(n_{d i f f})

#### Confidence Interval

CI = bar x_{d i f f} ± t_{d i f f}^** * SE

### Hypothesis Testing

T_df = (bar x_{d i f f} - mu_{d i f f}) / (SE)

## ANOVA - Analysis of Variance

Used to compare more than two groups

### Conditions

• independence - with and between groups (unless using repeated value ANOVA)
• nearly normal distributions
• constant variance across groups - homoscedastic
Degrees of Freedom Sum of Squares Mean Squares F Value Pr(>F)
Group   df_G SSG MSG
Error Residuals df_E SSE MSE
Total   df_T SST

### SST - Sum Of Squares Total

SST = sum_(i=1)^n (y_i - bar y)^2

• yi - value of response variable for each observation
• ȳ - mean of response variable

### SSG - Sum Of Squares Group

SST = sum_(j=1)^k n_j * (y_j - bar y)^2

• nj - num observations in group j
• yj - mean of response variable for group j
• ȳ - mean of response variable

### SSE - Sum Of Squares Error

S S E = S S T - S S G

### Degrees Of Freedom

d f_T = n - 1

d f_G = k - 1

d f_E = d f_T - d f_G

### Mean Squares

M S G = (S S G) / (d f_G)

M S E = (S S E) / (d f_E)

### F-Statistic

F = (M S G) / (M S E)

### Bonferoni Correction

Adjustment of significance level

alpha^** = alpha / K | K = (k(k-1)) / 2

### Standard Error (for multiple pairwise comparisons)

SE = sqrt( (M S E)/n_1 + (M S E)/n_2 )

### Standard Error (for multiple pairwise comparisons)

d f = d f_E

## Sampling & CLT for Proportions

Distribution of sample proportions

hat p ~ N(mean = p, SE = sqrt((p(1-p))/n))

Conditions:

• Independence - random sample. If without replacement n < 10% of population
• Sample size / skew (at least 10 successes and 10 failures)

## CI for Proportions

point estimate +- standard error

hat p +- z^** SE_{hat p}

SE_{hat p} = sqrt((hat p(1- hat p))/n)

# Linear Regression

Conditions

• Linearity - relationship between explanatory and response
• Nearly normal residuals
• Constant variability

Fan shaped residuals plot - as value of explanatory variable increases, variability of response variable increases. Fails conditions for linear regression

Strength of fit

R^2`

• Square of the correlation coefficient
• % variability in response variable is explained by model
• Always between 0 and 1
© Will Robertson