BackChapter 1: Data Collection – Foundations of Statistical Practice
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Data Collection
Introduction to the Practice of Statistics
Statistics is the science of collecting, organizing, summarizing, and analyzing information to draw conclusions or answer questions. It also involves providing a measure of confidence in any conclusions. The information used in statistics is called data, which describes characteristics of individuals and exhibits variability.
Statistics: The science of data analysis for decision-making.
Data: Facts or propositions used to draw conclusions or make decisions; they vary among individuals.
Variability: The phenomenon that data values differ among individuals or over time.
Example: Not all students have the same height or sleep the same number of hours each night. Understanding and describing this variability is a key goal of statistics.
The Process of Statistics
The process of statistics involves several key steps:
Identify the Research Objective: Clearly state the question and define the population to be studied.
Collect Data: Gather data from the population or a sample. Proper data collection is crucial for meaningful results.
Describe the Data: Use descriptive statistics (numerical summaries, tables, graphs) to obtain an overview.
Perform Inference: Apply inferential statistics to generalize from the sample to the population and report reliability (e.g., margin of error).
Example: Pew Research surveyed 1,628 adult Americans to estimate the percentage who trust their neighbors, reporting a margin of error to reflect uncertainty.
Populations, Samples, Parameters, and Statistics
Population: The entire group of individuals to be studied.
Sample: A subset of the population.
Parameter: A numerical summary of a population.
Statistic: A numerical summary of a sample.
Example: If 48.2% of all students own a car (population), this is a parameter. If 46% of a sample of 100 students own a car, this is a statistic.
Types of Variables
Variable: A characteristic of individuals in the population that can vary.
Qualitative (Categorical) Variable: Classifies individuals based on attributes or characteristics (e.g., race, zip code).
Quantitative Variable: Provides numerical measures (e.g., temperature, number of study days).
Example: Race is qualitative; temperature is quantitative.
Discrete vs. Continuous Variables
Discrete Variable: Quantitative variable with a finite or countable number of values (e.g., number of cars, heads in coin flips).
Continuous Variable: Quantitative variable with infinite, uncountable values within a range (e.g., distance, time).
Example: Number of cars is discrete; distance a car can travel is continuous.
Levels of Measurement
Nominal: Values name, label, or categorize; no order (e.g., race).
Ordinal: Values can be ranked or ordered (e.g., letter grades).
Interval: Ordered, differences have meaning, zero does not mean absence (e.g., temperature in Celsius).
Ratio: Ordered, differences and ratios have meaning, zero means absence (e.g., number of study days).
Example: Temperature is interval; number of study days is ratio.
Observational Studies Versus Designed Experiments
Observational Studies and Experiments
Observational Study: Measures the value of the response variable without influencing variables (e.g., surveying people about phone use).
Designed Experiment: Researcher assigns treatments and observes effects (e.g., exposing rats to radio-frequency radiation).
Explanatory Variable: Variable manipulated to observe its effect. Response Variable: Outcome measured in the study.
Confounding and Lurking Variables
Confounding: Effects of two or more explanatory variables are mixed, making it hard to distinguish their individual effects.
Lurking Variable: Not considered in the study but affects the response variable.
Note: Observational studies can show association, not causation.
Types of Observational Studies
Cross-sectional Study: Collects data at a specific point in time.
Case-control Study: Retrospective; compares individuals with and without a characteristic.
Cohort Study: Prospective; follows a group over time to record outcomes.
Census and Data Collection from the Web
Census: List of all individuals in a population with their characteristics.
Web Scraping: Extracting data from the Internet, often for analysis.
Sampling Methods
Simple Random Sampling
Random Sampling: Using chance to select individuals for a representative sample.
Simple Random Sample: Every possible sample of size n from a population of size N has an equal chance of selection.
Example: Selecting 3 friends out of 6 for a concert; each group of 3 is equally likely.
Sampling with and without Replacement
Without Replacement: Selected individuals are not returned to the population.
With Replacement: Selected individuals can be chosen again.
Other Effective Sampling Methods
Stratified Sample: Divide population into homogeneous groups (strata), then randomly sample from each stratum.
Systematic Sample: Select every kth individual after a random start.
Cluster Sample: Divide population into groups (clusters), randomly select some clusters, and include all individuals from those clusters.
Convenience Sample: Individuals are easily obtained; not random and often biased.
Voluntary Response Sample: Individuals self-select to participate.
Multistage Sampling: Combines several sampling methods, often used in large-scale surveys.
Sampling Techniques Comparison Table
Sampling Method | Description | When to Use |
|---|---|---|
Simple Random | Every group of size n equally likely | Small, well-defined populations |
Stratified | Divide into strata, sample from each | Population has distinct subgroups |
Systematic | Select every k-th individual | List of population available, quick estimate needed |
Cluster | Randomly select groups, sample all in group | Population naturally divided into groups |
Convenience | Sample easily obtained individuals | Exploratory or pilot studies (not recommended for inference) |
Bias in Sampling
Sources of Bias
Sampling Bias: Sampling method favors one part of the population (e.g., undercoverage).
Nonresponse Bias: Individuals who do not respond differ from those who do.
Response Bias: Survey answers do not reflect true feelings due to interviewer error, misrepresented answers, question wording, order, or data-entry errors.
Types of Errors
Nonsampling Error: Due to undercoverage, nonresponse, response bias, or data-entry error; can occur even in a census.
Sampling Error: Due to using a sample to estimate population information; inherent in sampling.
The Design of Experiments
Characteristics of an Experiment
Experiment: Controlled study to determine the effect of varying explanatory variables (factors) on a response variable.
Treatment: Any combination of factor values applied to experimental units.
Experimental Unit (Subject): The individual receiving a treatment.
Control Group: Baseline group for comparison.
Placebo: Inactive treatment resembling the real treatment.
Blinding: Nondisclosure of treatment assignment (single-blind: subject unaware; double-blind: both subject and researcher unaware).
Example: In a double-blind, placebo-controlled trial of Lipitor, neither subjects nor researchers knew who received the drug or placebo.
Steps in Designing an Experiment
Identify the Problem: Clearly state the research question and response variable.
Determine Factors: Identify and decide which factors to control, manipulate, or leave uncontrolled.
Determine Number of Experimental Units: Use as many as resources allow.
Determine Level of Each Factor: Control or randomize factor levels; combinations define treatments.
Conduct the Experiment: Randomly assign units, replicate treatments, collect and process data.
Test the Claim: Use inferential statistics to generalize results and state confidence levels.
Experimental Designs
Completely Randomized Design: Experimental units are randomly assigned to treatments.
Matched-Pairs Design: Experimental units are paired based on similarity; each pair receives different treatments.
Randomized Block Design: Experimental units are grouped into homogeneous blocks; random assignment to treatments occurs within each block.
Experimental Design Comparison Table
Design | Description | When to Use |
|---|---|---|
Completely Randomized | Random assignment to treatments | No obvious subgroups |
Matched-Pairs | Pairs of similar units, each gets different treatment | Two treatments, units can be paired |
Randomized Block | Group into blocks, randomize within blocks | Known sources of variability (blocks) |
Key Terms and Definitions
Replication: Applying each treatment to more than one experimental unit to ensure results are not due to chance.
Blocking: Grouping similar experimental units to reduce variability.
Summary
Statistics is the science of data collection, analysis, and inference.
Proper sampling and experimental design are essential for valid conclusions.
Understanding types of variables, sampling methods, and sources of bias is foundational for statistical practice.
Experimental design principles (randomization, control, replication, blocking) help ensure reliable results.