Chapter 1: Data Collection – Foundations of Statistical Practice

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Data Collection

Introduction to the Practice of Statistics

Statistics is the science of collecting, organizing, summarizing, and analyzing information to draw conclusions or answer questions. It also involves providing a measure of confidence in any conclusions. The information used in statistics is called data, which describes characteristics of individuals and exhibits variability.

Statistics: The science of data analysis for decision-making.
Data: Facts or propositions used to draw conclusions or make decisions; they vary among individuals.
Variability: The phenomenon that data values differ among individuals or over time.

Example: Not all students have the same height or sleep the same number of hours each night. Understanding and describing this variability is a key goal of statistics.

The Process of Statistics

The process of statistics involves several key steps:

Identify the Research Objective: Clearly state the question and define the population to be studied.
Collect Data: Gather data from the population or a sample. Proper data collection is crucial for meaningful results.
Describe the Data: Use descriptive statistics (numerical summaries, tables, graphs) to obtain an overview.
Perform Inference: Apply inferential statistics to generalize from the sample to the population and report reliability (e.g., margin of error).

Example: Pew Research surveyed 1,628 adult Americans to estimate the percentage who trust their neighbors, reporting a margin of error to reflect uncertainty.

Populations, Samples, Parameters, and Statistics

Population: The entire group of individuals to be studied.
Sample: A subset of the population.
Parameter: A numerical summary of a population.
Statistic: A numerical summary of a sample.

Example: If 48.2% of all students own a car (population), this is a parameter. If 46% of a sample of 100 students own a car, this is a statistic.

Types of Variables

Variable: A characteristic of individuals in the population that can vary.
Qualitative (Categorical) Variable: Classifies individuals based on attributes or characteristics (e.g., race, zip code).
Quantitative Variable: Provides numerical measures (e.g., temperature, number of study days).

Example: Race is qualitative; temperature is quantitative.

Discrete vs. Continuous Variables

Discrete Variable: Quantitative variable with a finite or countable number of values (e.g., number of cars, heads in coin flips).
Continuous Variable: Quantitative variable with infinite, uncountable values within a range (e.g., distance, time).

Example: Number of cars is discrete; distance a car can travel is continuous.

Levels of Measurement

Nominal: Values name, label, or categorize; no order (e.g., race).
Ordinal: Values can be ranked or ordered (e.g., letter grades).
Interval: Ordered, differences have meaning, zero does not mean absence (e.g., temperature in Celsius).
Ratio: Ordered, differences and ratios have meaning, zero means absence (e.g., number of study days).

Example: Temperature is interval; number of study days is ratio.

Observational Studies Versus Designed Experiments

Observational Studies and Experiments

Observational Study: Measures the value of the response variable without influencing variables (e.g., surveying people about phone use).
Designed Experiment: Researcher assigns treatments and observes effects (e.g., exposing rats to radio-frequency radiation).

Explanatory Variable: Variable manipulated to observe its effect. Response Variable: Outcome measured in the study.

Confounding and Lurking Variables

Confounding: Effects of two or more explanatory variables are mixed, making it hard to distinguish their individual effects.
Lurking Variable: Not considered in the study but affects the response variable.

Note: Observational studies can show association, not causation.

Types of Observational Studies

Cross-sectional Study: Collects data at a specific point in time.
Case-control Study: Retrospective; compares individuals with and without a characteristic.
Cohort Study: Prospective; follows a group over time to record outcomes.

Census and Data Collection from the Web

Census: List of all individuals in a population with their characteristics.
Web Scraping: Extracting data from the Internet, often for analysis.

Sampling Methods

Simple Random Sampling

Random Sampling: Using chance to select individuals for a representative sample.
Simple Random Sample: Every possible sample of size n from a population of size N has an equal chance of selection.

Example: Selecting 3 friends out of 6 for a concert; each group of 3 is equally likely.

Sampling with and without Replacement

Without Replacement: Selected individuals are not returned to the population.
With Replacement: Selected individuals can be chosen again.

Other Effective Sampling Methods

Stratified Sample: Divide population into homogeneous groups (strata), then randomly sample from each stratum.
Systematic Sample: Select every kth individual after a random start.
Cluster Sample: Divide population into groups (clusters), randomly select some clusters, and include all individuals from those clusters.
Convenience Sample: Individuals are easily obtained; not random and often biased.
Voluntary Response Sample: Individuals self-select to participate.
Multistage Sampling: Combines several sampling methods, often used in large-scale surveys.

Sampling Techniques Comparison Table

Sampling Method	Description	When to Use
Simple Random	Every group of size n equally likely	Small, well-defined populations
Stratified	Divide into strata, sample from each	Population has distinct subgroups
Systematic	Select every k-th individual	List of population available, quick estimate needed
Cluster	Randomly select groups, sample all in group	Population naturally divided into groups
Convenience	Sample easily obtained individuals	Exploratory or pilot studies (not recommended for inference)

Bias in Sampling

Sources of Bias

Sampling Bias: Sampling method favors one part of the population (e.g., undercoverage).
Nonresponse Bias: Individuals who do not respond differ from those who do.
Response Bias: Survey answers do not reflect true feelings due to interviewer error, misrepresented answers, question wording, order, or data-entry errors.

Types of Errors

Nonsampling Error: Due to undercoverage, nonresponse, response bias, or data-entry error; can occur even in a census.
Sampling Error: Due to using a sample to estimate population information; inherent in sampling.

The Design of Experiments

Characteristics of an Experiment

Experiment: Controlled study to determine the effect of varying explanatory variables (factors) on a response variable.
Treatment: Any combination of factor values applied to experimental units.
Experimental Unit (Subject): The individual receiving a treatment.
Control Group: Baseline group for comparison.
Placebo: Inactive treatment resembling the real treatment.
Blinding: Nondisclosure of treatment assignment (single-blind: subject unaware; double-blind: both subject and researcher unaware).

Example: In a double-blind, placebo-controlled trial of Lipitor, neither subjects nor researchers knew who received the drug or placebo.

Steps in Designing an Experiment

Identify the Problem: Clearly state the research question and response variable.
Determine Factors: Identify and decide which factors to control, manipulate, or leave uncontrolled.
Determine Number of Experimental Units: Use as many as resources allow.
Determine Level of Each Factor: Control or randomize factor levels; combinations define treatments.
Conduct the Experiment: Randomly assign units, replicate treatments, collect and process data.
Test the Claim: Use inferential statistics to generalize results and state confidence levels.

Experimental Designs

Completely Randomized Design: Experimental units are randomly assigned to treatments.
Matched-Pairs Design: Experimental units are paired based on similarity; each pair receives different treatments.
Randomized Block Design: Experimental units are grouped into homogeneous blocks; random assignment to treatments occurs within each block.

Experimental Design Comparison Table

Design	Description	When to Use
Completely Randomized	Random assignment to treatments	No obvious subgroups
Matched-Pairs	Pairs of similar units, each gets different treatment	Two treatments, units can be paired
Randomized Block	Group into blocks, randomize within blocks	Known sources of variability (blocks)

Key Terms and Definitions

Replication: Applying each treatment to more than one experimental unit to ensure results are not due to chance.
Blocking: Grouping similar experimental units to reduce variability.

Summary

Statistics is the science of data collection, analysis, and inference.
Proper sampling and experimental design are essential for valid conclusions.
Understanding types of variables, sampling methods, and sources of bias is foundational for statistical practice.
Experimental design principles (randomization, control, replication, blocking) help ensure reliable results.