Chapter 1: Data Collection – Structured Study Notes for Statistics Students

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Data Collection

Introduction to the Practice of Statistics

Statistics is the science of collecting, organizing, summarizing, and analyzing information to draw conclusions or answer questions. It also provides a measure of confidence in any conclusions. The information used in statistics is called data, which describes characteristics of individuals and exhibits variability.

Statistics: The study of data and variability.
Statistical Thinking: Understanding and accounting for variability in data.
Data: Facts or propositions used to draw conclusions or make decisions.
Variability: The tendency of data to differ among individuals or over time.

The Process of Statistics

The process of statistics involves several steps to ensure meaningful and reliable results:

Identify the research objective: Clearly define the question and population to be studied.
Collect the data: Gather information using appropriate methods, often from a sample rather than the entire population.
Describe the data: Use descriptive statistics to summarize and organize the data.
Perform inference: Apply inferential statistics to extend sample results to the population and report reliability.

Population, Sample, Individual diagram

Population: The entire group of individuals to be studied. Sample: A subset of the population. Individual: A single member of the population.

Descriptive Statistics: Organizing and summarizing data using numerical summaries, tables, and graphs.
Inferential Statistics: Methods to extend sample results to the population and measure reliability.
Parameter: Numerical summary of a population.
Statistic: Numerical summary based on a sample.

Types of Variables

Variables are characteristics of individuals within the population. They can be classified as qualitative or quantitative, and further as discrete or continuous.

Qualitative (Categorical) Variables: Classification based on attributes or characteristics.
Quantitative Variables: Numerical measures that can be meaningfully added or subtracted.
Discrete Variables: Quantitative variables with a finite or countable number of values (e.g., number of children).
Continuous Variables: Quantitative variables with an infinite number of possible values, measurable to any desired accuracy (e.g., household income).

Classification of variables diagram

Examples: Nationality – Qualitative Number of children – Quantitative, Discrete Household income – Quantitative, Continuous Daily intake of whole grains – Quantitative, Continuous

Levels of Measurement

Variables can be measured at different levels, which determine the types of statistical analyses that can be performed:

Nominal: Values name, label, or categorize; no ranking (e.g., gender).
Ordinal: Values can be ranked or ordered (e.g., class rank).
Interval: Differences between values have meaning; zero does not indicate absence (e.g., year of birth).
Ratio: Ratios of values have meaning; zero indicates absence (e.g., number of vending machines).

Observational Studies Versus Designed Experiments

Observational Study vs. Experiment

Research can be conducted using observational studies or designed experiments:

Observational Study: Measures the value of the response variable without influencing explanatory or response variables.
Designed Experiment: Researcher assigns individuals to groups, manipulates explanatory variables, and records response variable values.

Confounding: Occurs when effects of two or more explanatory variables are not separated. Lurking Variable: An unaccounted explanatory variable affecting the response variable.

Observational studies can only claim association, not causation.

Types of Observational Studies

Cross-sectional Studies: Collect information at a specific point in time.
Case-control Studies: Retrospective; compare individuals with and without certain characteristics.
Cohort Studies: Prospective; follow a group over time and record characteristics.

Census: List of all individuals in a population with their characteristics.

Sampling Methods

Simple Random Sampling

Random sampling uses chance to select individuals from a population. A simple random sample ensures every possible sample of size n has an equally likely chance of occurring.

Obtain a frame listing all individuals in the population.
Number individuals from 1 to N.
Use a random number generator to select n numbers.

Other Effective Sampling Methods

Several sampling methods are used to obtain representative samples:

Stratified Sample: Divide population into homogeneous groups (strata), then take a simple random sample from each stratum.
Systematic Sample: Select every kth individual from the population, starting with a random number between 1 and k.
Cluster Sample: Select all individuals within randomly chosen groups (clusters).
Convenience Sample: Individuals are easily obtained; results are often suspect.
Multistage Sampling: Combination of sampling techniques, often used in large-scale surveys.

Steps in systematic sampling

Systematic Sampling Steps:

Step 1: Approximate population size, N.
Step 2: Determine sample size, n.
Step 3: Compute and round down to the nearest integer.
Step 4: Randomly select a number between 1 and k (call this p).
Step 5: Sample consists of individuals: p, p + k, p + 2k, ..., p + (n − 1)k.

Sampling methods comparison diagram

This diagram visually compares simple random, stratified, systematic, and cluster sampling methods.

Bias in Sampling

Sources of Bias

If a sample is not representative of the population, it is biased. Main sources of bias include:

Sampling Bias: Technique favors one part of the population.
Nonresponse Bias: Individuals who do not respond differ from those who do.
Response Bias: Survey answers do not reflect true feelings due to interviewer error, misrepresented answers, wording, or order of questions.
Data-entry Error: Mistakes in recording data can lead to inaccurate results.

Caution sign for convenience sampling

Nonsampling Errors: Errors from nonresponse, response bias, or data-entry error; can occur even in a census. Sampling Error: Error from using a sample to estimate population information.

The Design of Experiments

Characteristics of an Experiment

An experiment is a controlled study to determine the effect of varying explanatory variables (factors) on a response variable. Treatments are combinations of factor values.

Experimental Unit: The subject or item receiving a treatment.
Control Group: Baseline treatment for comparison.
Placebo: An innocuous treatment resembling the experimental treatment.
Blinding: Nondisclosure of treatment received; single-blind (subject unaware), double-blind (subject and researcher unaware).

Steps in Designing an Experiment

Identify the problem and response variable.
Determine factors affecting the response variable.
Determine the number of experimental units.
Determine the level of predictor variables (control, randomize).
Conduct the experiment (replication, data collection).
Test the claim using inferential statistics.

Completely Randomized Design

Each experimental unit is randomly assigned to a treatment. Example: Testing the effect of octane level on gas mileage using 12 cars randomly assigned to three octane levels.

Completely randomized design diagram

Replication: Each treatment applied to multiple units to ensure results are not due to a single unit.

Matched-Pairs Design

Experimental units are paired based on related characteristics (e.g., twins, before-and-after measurements). Each pair receives both treatments in random order. Only two levels of treatment are used.

Example: Children taste milk with and without Xylitol; order of tasting is randomized to avoid order effects.
Double-blind: Recommended to prevent bias.