Chapter 1: Introduction to Data – Essential Concepts in Statistics

Notes Practice Video lessons

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Chapter 1: Introduction to Data

Learning Objectives

Distinguish between numerical and categorical variables.
Understand and use rates (including percentages) and know when they are more useful than counts for describing and comparing groups.
Recognize when it is possible to infer a cause-and-effect relationship.
Explain how confounding variables prevent us from inferring causation and suggest confounding variables that are likely to occur in some situations.
Distinguish between observational studies and controlled experiments.

What Are Data?

What is Statistics?

Statistics is the science (and art) of collecting, organizing, summarizing, and analyzing data to answer questions and draw conclusions. The process involves several steps:

Formulating Questions
Collecting Data
Organizing and Summarizing Data
Making Conclusions

Statistics is essential for exploring the world, verifying beliefs, discovering patterns, and sharing findings. However, it must be used carefully, as inappropriate use can lead to inaccurate beliefs and results are always uncertain.

Major Concepts in Statistics

Variation: Differences or changes in an item or measurement. For example, writing the letter "A" with slight differences each time demonstrates variation.
Data: Observations gathered to draw conclusions. Examples include measurements (weight, height, distance), counts (number of customers), or lists (song titles).

Data are More Than Just Numbers

Data are numbers in context. They consist not only of the numbers themselves but also the story behind them. For example, the numbers 7.91, 9.64, 9.18, 10.33, 7.46 could represent birth weights or prices for lunch. Understanding the context is crucial for meaningful analysis.

Data Analysis

Data analysis is the process of examining collected data to explain what the data tell us about the real world. It involves summarizing, visualizing, and interpreting data to answer research questions.

Classifying and Storing Data

Sample and Population

Population: The complete set of people or objects being studied. Obtaining all data from the population is usually impractical.
Sample: A subset of the population from which data are obtained. Samples are used to make inferences about the population and should be representative to allow generalization.

Example: A company samples 30 similar companies to study 401(k) participation rates. The population of interest is all similar companies.

Variables

Numerical (Quantitative) Variable: Describes quantities and contains measured numerical values with units. Examples: age, mileage, number of calories.
Categorical (Qualitative) Variable: Describes qualities or categories. Examples: gender, eye color, country of birth, type of meat.

Example: For Arby's sandwiches, the type of meat is categorical, while number of calories and serving size are numerical.

Coding Categorical Data

Categorical data can be coded with numbers for easier computer input (e.g., 1 for "Personal Growth", 2 for "Career Opportunities").
Yes/No questions are often coded as 0 for "No" and 1 for "Yes".

Storing Data: Stacked vs. Unstacked Data

Stacked Data: Data stored in a spreadsheet format where each row is an individual and each column is a variable.
Unstacked Data: Data stored such that each column represents a variable from a different group. Useful for comparing two groups.

Organizing Categorical Data

Frequency Tables

A frequency table displays each distinct outcome and its frequency (number of times observed). It helps organize raw data to reveal patterns.

Relative Frequency Tables

A relative frequency table shows the percentage of values in each category, making it easier to compare groups of different sizes.

Two-Way Tables

A two-way table displays the relationship between two categorical variables, showing how many times each combination of categories occurs.

Class	Alive	Dead	Total
First	203	122	325
Second	118	167	285
Third	178	528	706
Crew	212	673	885
Total	711	1490	2201

Additional info: This table summarizes survival by ticket class on the Titanic.

Percentages and Rates

Percentages are useful for comparing groups of different sizes.
Rates are often reported as "number of events per 1,000 objects" or similar units.

Example: If 65% of 400 students carry calculators, then students carry calculators.

Collecting Data to Understand Causality

Causality

Treatment Variable: Whether a specific treatment is used.
Outcome (Response) Variable: Whether a certain outcome is observed.

The goal is to determine whether the treatment variable causes a change in the outcome variable.

Groups in Experiments

Treatment Group: Receives the treatment.
Control Group: Does not receive the treatment.

Controlled Experiments vs. Observational Studies

Controlled Experiment: Researchers assign subjects to groups and control conditions. Only controlled experiments can establish cause-and-effect relationships.
Observational Study: Researchers observe groups that already exist without assigning subjects. Cannot establish causality.

Random Assignment

Participants are randomly assigned to treatment or control groups using methods such as coin flips or computer randomization.

Designing Controlled Experiments

Large sample sizes are needed to observe variability.
Random assignment minimizes bias.
All conditions should be as similar as possible except for the treatment.

Example: Effects of Light on Mice

Group	Tumors	No Tumors
LD (12h light/12h dark)	4	46
LL (24h light)	14	36

Conclusion: Random assignment allows us to infer causality. More mice developed tumors in the LL group, suggesting light exposure may cause tumors.

Bias and Confounding Variables

Bias: Occurs when assignments are not random, leading to unrepresentative groups.
Confounding Variable: A characteristic other than the treatment that affects the outcome.

Example: Assigning the heaviest people to the exercise group introduces bias.

Association is Not Causation

Just because two variables are associated does not mean one causes the other.
Confounding variables may explain the association.

Example: Children with larger shoe sizes tend to have higher vocabulary scores, but age is the confounding variable.

Anecdotal Evidence

Anecdotes are personal stories and are not reliable for scientific conclusions.

Placebo Effect

A placebo is a fake treatment.
The placebo effect is when participants respond to a treatment because they believe it is real.

Blind and Double-Blind Studies

Blind Study: Participants do not know their group assignment.
Double-Blind Study: Neither participants nor researchers know group assignments.

Gold Standard for Experiments

Large sample size
Controlled and randomized assignment
Double-blind design
Use of placebo (if appropriate)

Example: Medical Study Comparing Treatments

Treatment	Remission	No Remission
Combination	78	59
Inflix Alone	94	118
Azath Alone	19	44

Conclusion: The combination treatment was most effective. Randomization and placebo use allow us to infer causality.

Key Terms and Formulas

Rate:
Percentage:

Summary Table: Types of Variables

Type	Description	Examples
Numerical (Quantitative)	Measures quantities	Age, Height, Number of calories
Categorical (Qualitative)	Describes categories or groups	Gender, Type of meat, Eye color

Summary Table: Study Types

Type	Description	Can Infer Causality?
Controlled Experiment	Researchers assign groups and control conditions	Yes
Observational Study	Groups exist naturally; no assignment	No