Foundations of Statistics: Data, Sampling, Graphs, and Measures of Center & Variation

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Introduction to Statistics

Statistical Terminology

Statistics is the science of collecting, analyzing, interpreting, and presenting data. Understanding key terminology is essential for interpreting statistical results accurately.

Data: Collections of observations, such as measurements, genders, or survey responses.
Population: The complete collection of all measurements or data that are being considered.
Census: The collection of data from every member of a population.
Sample: A subcollection of members selected from a population.
Voluntary Response Sample: Respondents themselves decide whether to be included, often leading to bias.

Potential Pitfalls of Analyzing Data

Statistical analysis can be compromised by several pitfalls, which must be avoided to ensure valid conclusions.

Misleading Conclusions: Conclusions should be clear and understandable to all audiences.
Sample Data Reported Instead of Measured: Direct measurement is preferred over self-reported data.
Loaded Questions: Poorly worded survey questions can bias results.
Nonresponse: Occurs when individuals refuse or are unavailable to respond, potentially skewing results.
Low Response Rates: Decreases reliability and increases bias.
Percentages: Misleading percentages, especially those exceeding 100%, should be avoided.

Statistical and Practical Significance

Statistical significance means a result is unlikely to occur by chance, while practical significance means the result is meaningful in real-world terms.

Statistical Significance: Determined by probability and sample size.
Practical Significance: Assesses whether the result has real-world impact.
Example: Exam prep increases average test scores from 50% to 80% (practically significant), but may not be statistically significant if the sample size is too small.

Correlation and Causation

Correlation measures the relationship between two variables, but does not imply causation.

Example: Increase in pirates and cell phones over time does not mean one causes the other.

Bias

Bias occurs when a study's design or data collection favors certain outcomes.

Example: AAA surveys its members about driving preferences, likely favoring car use.

Describing, Exploring, and Comparing Data

Parameter and Statistic

Parameters and statistics are numerical measurements describing characteristics of populations and samples, respectively.

Parameter: Describes a population characteristic.
Statistic: Describes a sample characteristic.
Example: 35% of sampled employees own a computer (statistic).

Quantitative and Categorical Data

Data can be classified as quantitative (numerical) or categorical (labels).

Quantitative Data: Numbers representing counts or measurements (e.g., salaries).
Categorical Data: Names or labels (e.g., favorite musicians).

Discrete and Continuous Data

Quantitative data can be discrete (countable) or continuous (uncountable).

Discrete Data: Countable values (e.g., number of murders).
Continuous Data: Uncountable, infinitely precise values (e.g., temperature).

Levels of Measurement

Data can be measured at four levels: nominal, ordinal, interval, and ratio.

Nominal: Categories only (e.g., favorite musicians).
Ordinal: Categories with order (e.g., survey responses).
Interval: Differences, no natural zero (e.g., room temperature).
Ratio: Differences and a natural zero (e.g., voltage measurements).

Collecting Sample Data

Observational Study vs. Experiment

Data can be collected through observational studies or experiments.

Observational Study: Observes individuals and measures variables without influencing responses.
Experiment: Imposes treatments to observe effects.
Example: Observing sleep patterns vs. depriving sleep to measure exam scores.

Design of Experiments

Proper experimental design includes replication, blinding, and randomization to reduce bias and confounding.

Replication: Repeating an experiment.
Blinding: Subjects (and sometimes administrators) do not know if treatment is real or placebo.
Randomization: Assigning individuals to groups randomly.

Sampling Methods

Sampling methods determine how subjects are selected from a population.

Random Sampling: Every individual has an equal chance of selection.
Systematic Sampling: Every kth element is selected.
Convenience Sampling: Easy to obtain, but often biased.
Stratified Sampling: Sample from subgroups, selecting individuals from each.
Cluster Sampling: Select entire groups (clusters) and sample all members within.

Example: Convenience sampling by asking Facebook friends does not represent the general population.

Systematic sampling example Stratified sampling example Cluster sampling example

Types of Observational Studies

Observational studies can be retrospective, cross-sectional, or prospective, depending on when data is collected.

Retrospective: Data collected from past records.
Cross-sectional: Data measured at one point in time.
Prospective: Data collected going forward, observing groups sharing common factors.

Types of observational studies

Confounding and Experimental Design

Confounding occurs when the effect of one variable cannot be separated from another. Experimental designs address confounding through randomization and blocking.

Completely Randomized Design: Subjects are randomly assigned to treatment or control groups.
Randomized Block Design: Subjects are grouped (blocked) by a characteristic, then randomly assigned within blocks.
Matched Pairs Design: Subjects are paired (e.g., twins or before/after) and each pair receives different treatments.

Completely randomized design Randomized block design Matched pairs design

Exploring Data with Tables and Graphs

Frequency Distributions

Frequency distributions summarize data by showing how often each value occurs.

Relative Frequency Distribution: Shows the percentage of each frequency.
Cumulative Distribution: Accumulates frequencies as values increase.
Normal Distribution: Bell-shaped, symmetric distribution.
Other Distributions: Uniform (all values equal), V-shaped, skewed.

Histograms

Histograms are graphical representations of frequency distributions, using bins or classes to group data.

Bin Width: The distance between numbers in a histogram.
Class Width: Difference between consecutive lower class limits.
Lower/Upper Class Limits: Smallest/largest values in each bin.
Too Few Bins: Over-smoothed, hard to see patterns.
Too Many Bins: Under-smoothed, noisy representation.
Histograms Show: Center, variation, shape, outliers.
Skewness: Right-skewed (long right tail), left-skewed (long left tail).

Graphs

Various graphs are used to visualize data, each serving a specific purpose.

Time-Series Graph: Plots quantitative data over time to identify trends.
Pareto Chart: Bar graph for categorical data, bars in descending order of frequency.
Pie Chart: Shows proportions of categories as slices of a circle.
Scatterplot: Plots two quantitative variables to show relationships.

Time-series graph example Pareto chart example Pie chart example Scatterplot example

Graphs That Deceive

Graphs can be misleading if not constructed carefully. Always check for nonzero vertical axes and pictographs that exaggerate differences.

Nonzero Vertical Axis: Can exaggerate differences between groups.
Pictographs: Use images to represent data, often distorting proportions.

Nonzero vertical axis example Pictograph example

Describing, Exploring, and Comparing Data

Measures of Center

Measures of center describe the central tendency of a dataset.

Mean: Arithmetic average, sensitive to outliers.
Median: Middle value, robust to outliers.
Mode: Most frequent value, useful for categorical data.
Midrange: Average of maximum and minimum values, rarely used.

Example: For the dataset 20, 25, 25, 25, 30, 35, 45, 50, 50, 50, 75:

Mean:
Median: 35
Mode: 25 and 50 (both occur three times)
Midrange:

Mean calculation example Summary of measures of center

Measures of Center and Skewness

The relationship between mean, median, and mode indicates the skewness of a distribution.

Symmetric: Mean = Median = Mode.
Skewed Left (Negative): Mean < Median < Mode.
Skewed Right (Positive): Mode < Median < Mean.

Measures of center and skewness

Measures of Variation

Range, Standard Deviation, and Variance

Measures of variation describe the spread of data values.

Range: Difference between maximum and minimum values.
Standard Deviation: Measures variation about the mean. Outliers increase standard deviation.
Variance: Square of the standard deviation.
Notation: s = sample standard deviation, s² = sample variance, σ = population standard deviation, σ² = population variance.

Example: For the "Space Mountain" dataset:

Range: 75 - 20 = 55
Standard Deviation: 16.56
Variance: 274.09

Dataset with measures of variation Empirical rule for bell-shaped distributions

The Empirical Rule for Bell-Shaped Distributions

For data sets with a normal (bell-shaped) distribution, the Empirical Rule states:

About 68% of data values fall within one standard deviation of the mean.
About 95% fall within two standard deviations.
About 99.7% fall within three standard deviations.

Range Rule of Thumb: