Chapter 1: Data Collection – Foundations of Statistical Study

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Data Collection

Introduction to the Practice of Statistics

Statistics is the science of collecting, organizing, summarizing, and analyzing information to draw conclusions or answer questions. It also provides a measure of confidence in any conclusions. The information used in statistics is called data, which describes characteristics of individuals and is subject to variability.

Statistics: The science of data analysis, including collection, organization, and interpretation.
Data: Facts or propositions used to draw conclusions or make decisions.
Variability: The tendency of data to differ among individuals or over time.

Example: Not everyone in a class has the same height or hair color; these differences are examples of variability.

The Process of Statistics

The process of statistics involves several key steps to ensure valid and reliable conclusions:

Identify the Research Objective: Clearly state the question and define the population.
Collect the Data: Gather data relevant to the research objective, often using a sample due to practical constraints.
Describe the Data: Use descriptive statistics (numerical summaries, tables, graphs) to summarize the data.
Perform Inference: Apply inferential statistics to generalize findings from the sample to the population, including a measure of reliability (e.g., margin of error).

Example: A survey of 1,628 adults found that 52% trust their neighbors. With a 2.5% margin of error, the true proportion in the population is likely between 49.5% and 54.5%.

Populations, Samples, Parameters, and Statistics

Population: The entire group of individuals to be studied.
Sample: A subset of the population.
Parameter: A numerical summary of a population.
Statistic: A numerical summary of a sample.

Example: If 48.2% of all students own a car (population), this is a parameter. If 46% of a sample of 100 students own a car, this is a statistic.

Types of Variables

Variables are characteristics of individuals in a population and can be classified as follows:

Qualitative (Categorical) Variables: Classify individuals based on attributes or characteristics (e.g., race, zip code).
Quantitative Variables: Provide numerical measures (e.g., temperature, number of study days).

Quantitative variables can be further classified as:

Discrete Variables: Have a finite or countable number of possible values (e.g., number of cars, number of heads in coin flips).
Continuous Variables: Have an infinite number of possible values within a range (e.g., distance traveled, duration of parking).

Example: The number of cars at a drive-thru is discrete; the distance a car can travel is continuous.

Levels of Measurement

Variables can be measured at different levels, which determine the types of statistical analyses that are appropriate:

Nominal: Values name, label, or categorize without a specific order (e.g., race).
Ordinal: Values can be ranked or ordered (e.g., letter grades).
Interval: Differences between values are meaningful, but zero does not mean absence (e.g., temperature in Celsius).
Ratio: Ratios of values are meaningful, and zero indicates absence (e.g., number of study days).

Example: Temperature is interval; number of study days is ratio.

Observational Studies Versus Designed Experiments

Observational Studies and Experiments

Research can be conducted using observational studies or designed experiments:

Observational Study: Measures the value of the response variable without influencing variables. Can show association but not causation.
Designed Experiment: Researcher assigns treatments to groups and manipulates explanatory variables to observe effects on the response variable. Can establish causation.

Example: Studying the effect of cell phone use on brain tumors in humans (observational) versus in rats (experiment).

Confounding and Lurking Variables

Confounding: When the effects of two or more explanatory variables cannot be separated.
Lurking Variable: An unmeasured variable that affects the response variable.

Example: In a flu shot study, age or health status may confound the results.

Types of Observational Studies

Cross-sectional Studies: Collect data at a specific point in time.
Case-control Studies: Retrospective; compare individuals with a characteristic to those without.
Cohort Studies: Prospective; follow a group over time and record characteristics.

Other Data Collection Methods

Census: List of all individuals in a population with their characteristics.
Web Scraping: Extracting data from the internet, often for large-scale data analysis.

Sampling Methods

Simple Random Sampling

Simple random sampling ensures every possible sample of size n from a population of size N has an equal chance of selection.

Requires a frame: a list of all individuals in the population.
Can be done with or without replacement.

Example: Selecting 5 clients from a list of 30 using a table of random numbers or a calculator.

Other Effective Sampling Methods

Stratified Sampling: Divide the population into homogeneous groups (strata) and sample from each stratum.
Systematic Sampling: Select every kth individual after a random start.
Cluster Sampling: Divide the population into groups (clusters), randomly select some clusters, and sample all individuals in those clusters.
Convenience Sampling: Use individuals who are easy to reach; generally leads to bias.
Multistage Sampling: Combine several sampling methods, often used in large-scale surveys.

Sampling Method	Description	Example
Simple Random	Every sample of size n equally likely		Randomly select 5 clients from 30
Stratified	Divide into strata, sample from each	Sample students by year in school
Systematic	Select every k-th individual	Survey every 7th customer
Cluster	Randomly select clusters, sample all in cluster	Survey all households in selected city blocks
Convenience	Sample easiest to reach	Survey people at a mall

Bias in Sampling

Sources of Bias

Sampling Bias: Technique favors one part of the population; may result from undercoverage.
Nonresponse Bias: Individuals who do not respond differ from those who do.
Response Bias: Survey answers do not reflect true feelings due to interviewer error, misrepresented answers, wording, order, or question type.

Other Errors:

Nonsampling Errors: Undercoverage, nonresponse, response bias, or data-entry error.
Sampling Error: Difference between sample estimate and true population value due to using a sample.

The Design of Experiments

Characteristics of an Experiment

Experiment: Controlled study to determine the effect of varying explanatory variables (factors) on a response variable.
Treatment: Any combination of factor values.
Experimental Unit (Subject): The item or person receiving a treatment.
Control Group: Baseline group for comparison.
Placebo: Inactive treatment to mimic the experimental treatment.
Blinding: Nondisclosure of treatment to subjects (single-blind) or both subjects and researchers (double-blind).

Example: A double-blind, placebo-controlled study of Lipitor in diabetic patients.

Steps in Designing an Experiment

Identify the problem and response variable.
Determine factors affecting the response variable.
Determine the number of experimental units.
Determine the level of each factor (control or randomize).
Randomly assign experimental units to treatments and conduct the experiment (replication is important).
Test the claim using inferential statistics.

Experimental Designs

Completely Randomized Design: Each experimental unit is randomly assigned to a treatment.
Matched-Pairs Design: Experimental units are paired based on similarity; each pair receives different treatments.
Randomized Block Design: Experimental units are grouped into homogeneous blocks, and treatments are randomly assigned within each block.

Design	Description	Example
Completely Randomized	Random assignment to treatments	Assigning 60 plants to 3 fertilizer levels
Matched-Pairs	Pairs matched on characteristics, each gets different treatment	Students matched by IQ and gender, one studies with music, one without
Randomized Block	Group by block, randomize within block	Divide plants by variety, randomize fertilizer within each variety

Additional info: Replication, control, and randomization are key principles in experimental design to ensure valid and reliable results.