Foundations of Statistics: Data Collection, Organization, and Summarization

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Introduction to the Practice of Statistics

Definitions and Key Concepts

Statistics is the science of collecting, organizing, analyzing, and interpreting data to make informed decisions. Understanding the foundational terminology is essential for further study in statistics.

Data: Information from observations, counts, measurements, or responses.
Population: The entire group being studied.
Sample: A subset of the population selected for analysis.
Parameter: A numerical description of a population characteristic.
Statistic: A numerical description of a sample characteristic.
Qualitative Variable: A variable that categorizes or describes an element of a population.
Quantitative Variable: A variable that quantifies an element of a population.
Discrete Variable: A quantitative variable with a countable number of values.
Continuous Variable: A quantitative variable with an infinite number of possible values within a range.

Example: Surveying 100 students about their favorite color (qualitative) or their height (quantitative).

Distinguishing Data Sets

Population Data Set: Includes all outcomes, responses, measurements, or counts of interest.
Sample Data Set: Includes only part of the population.

Example: Measuring the average height of all students in a school (population) vs. a selected class (sample).

Parameters vs. Statistics

Parameter: Describes a characteristic of a population.
Statistic: Describes a characteristic of a sample.

Example: The mean age of all U.S. senators (parameter) vs. the mean age of a sample of 10 senators (statistic).

Qualitative vs. Quantitative Data

Qualitative: Descriptive, non-numeric (e.g., colors, labels).
Quantitative: Numeric, measurable (e.g., height, weight).

Discrete vs. Continuous Variables

Discrete: Countable values (e.g., number of students).
Continuous: Infinite possible values within a range (e.g., height, time).

Observational Studies vs. Designed Experiments

Definitions

Observational Study: Observes outcomes without influencing them.
Experiment: Applies a treatment and observes its effect.
Confounding: Occurs when the effects of multiple variables are mixed.
Lurking Variable: An unmeasured variable that affects the outcome.

Example: Studying the effect of a new drug (experiment) vs. observing health outcomes in a population (observational study).

Sampling Methods

Simple Random Sample

Each member of the population has an equal chance of being selected.
Use random number tables or generators to select samples.

Other Sampling Methods

Stratified Sampling: Divides the population into strata and samples from each.
Systematic Sampling: Selects every k-th individual from a list.
Cluster Sampling: Divides the population into clusters, then randomly selects clusters.
Convenience Sampling: Uses readily available subjects (not recommended for valid inference).

How to Do a Systematic Sample

Know the sample size and population size.
Pick a random starting point.
Select every k-th individual.

Bias in Sampling

Types of Bias

Sampling Bias: Sample does not represent the population.
Nonresponse Bias: Selected individuals do not respond.
Response Bias: Survey answers do not reflect true feelings.
Data-Entry Error: Mistakes in recording data.
Funding Bias: Results influenced by the source of funding.

Design of Experiments

Characteristics of an Experiment

Controlled study with treatments applied to experimental units.
May include control groups, blinding, and randomization.

Steps in Conducting an Experiment

Identify the problem.
Determine factors affecting the response variable.
Determine the number of experimental units.
Assign treatments to units (randomized or block design).
Conduct the experiment and collect data.

Randomized and Block Designs

Randomized Design: Units are randomly assigned to treatments.
Block Design: Units are grouped by similarity, then randomized within blocks.

Organizing Qualitative Data

Frequency Distributions

Lists each category and the number of occurrences (frequency).
Relative frequency is the proportion of observations in each category.

Formula:

Color	Tally	Frequency	Relative Frequency
Brown	\|\|\|\| \|\|\|\|	12	0.2667
Yellow	\|\|\|\| \|	10	0.2222
Red	\|\|\|\|	9	0.2
Orange	\|\|	6	0.1333
Blue	\|	3	0.0667

Graphs for Qualitative Data

Bar Graph: Rectangles represent frequencies for each category.
Pareto Chart: Bars in decreasing order of frequency.
Pie Chart: Circle divided into sectors proportional to category frequencies.

Organizing Quantitative Data: Popular Displays

Frequency Distributions and Histograms

Frequency Distribution: Table showing classes or intervals and their frequencies.
Histogram: Rectangles for each class, height represents frequency or relative frequency.
Dot Plot: Dots above a number line for each data point.

Shapes of Distributions

Uniform: Evenly spread frequencies.
Bell-shaped (Symmetric): Highest frequency in the middle.
Skewed Right: Tail on the right.
Skewed Left: Tail on the left.

Additional Displays of Quantitative Data

Stem-and-Leaf Plot

Shows data values while preserving the original data.

Frequency Polygon

Uses points connected by lines to represent class frequencies.

Ogive Graph

Represents cumulative frequency or cumulative relative frequency.

Time Series Graph

Plots data measured at successive points in time.

Graphical Misrepresentations of Data

Guidelines for Good Graphics

Label axes and provide units.
Avoid distortion and misleading scales.
Use appropriate graph types for the data.
Avoid three-dimensional effects that distort perception.

Measures of Central Tendency

Mean

Population Mean:
Sample Mean:

Median

The middle value when data are ordered.

Mode

The value that occurs most frequently.

Measures of Dispersion

Definitions

Range: Max value minus min value.
Variance: Average of squared deviations from the mean.
Standard Deviation: Square root of the variance.

Population Variance:

Sample Variance:

Standard Deviation: or

Empirical Rule

For bell-shaped distributions:
~68% of data within 1 standard deviation of the mean
~95% within 2 standard deviations
~99.7% within 3 standard deviations

Measures of Position

Z-scores

Represents the number of standard deviations a value is from the mean.

Formula:

Percentiles and Quartiles

Percentile: Value below which a given percentage of observations fall.
Quartiles: Divide data into four equal parts (Q1, Q2, Q3).

Interquartile Range (IQR)

Range of the middle 50% of the data:

Outliers

Values outside are considered outliers.

The Five-Number Summary and Boxplots

Consists of minimum, Q1, median, Q3, and maximum.
Boxplots visually display the five-number summary and identify outliers.

Example: Drawing a boxplot for exam scores using the five-number summary: 60, 68, 77, 89, 98.

Additional info: These notes provide a comprehensive overview of the foundational concepts in statistics, including data collection, sampling, bias, experimental design, and the organization and summarization of data. They are suitable for exam preparation and as a reference for introductory statistics courses.