BackIntroduction to Data: Foundations of Statistics (Chapter 1 Study Notes)
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Chapter 1: Introduction to Data
Learning Objectives
Distinguish between numerical and categorical variables.
Understand and use rates (including percentages) and know when they are more useful than counts for describing and comparing groups.
Recognize when it is possible to infer a cause-and-effect relationship.
Explain how confounding variables prevent us from inferring causation and suggest confounding variables that are likely to occur in some situations.
Distinguish between observational studies and controlled experiments.
What Are Data?
What is Statistics?
Statistics is the science (and art) of collecting, organizing, summarizing, and analyzing data to answer questions and draw conclusions. The process involves:
Formulating questions
Collecting data
Organizing and summarizing data
Making conclusions
Why Do We Care About Statistics?
Statistics allows us to explore the world around us.
Use evidence to check whether our beliefs are accurate.
Find patterns that lead to discoveries.
Share new discoveries with others.
Keep in Mind:
Statistics must be used carefully.
Inappropriate use will result in inaccurate beliefs.
Results are always uncertain.
Statistics Rests on Two Major Concepts
Variation: Differences or changes in an item or group. For example, writing the letter "A" with slight differences each time.
Data: Observations gathered to draw conclusions. Examples include measurements (weight, height, distance), number of customers, or a list of song titles stored on a computer.
Data are More Than Just Numbers
Data are numbers in context. They consist not only of the numbers, but also of the story behind the numbers. For example, the numbers 7.91, 9.64, 9.18, 10.33, 7.46 may represent birth weights or prices for lunch.
Data Analysis
Data analysis is the process of examining collected data to explain what the data tell us about the real world.
Classifying and Storing Data
Sample and Population
Population: The complete set of people or objects being studied. Information about the population is usually the goal, but obtaining all data values from the population is often impossible.
Sample: A subset of the population from which data are obtained. A sample is used to make inferences about the population. The goal is to describe the population, and the sample should be representative.
Context of Data: The "W's"
Who: Describe the individuals who were surveyed.
What: Determine what is being measured.
When: When was the research conducted?
Where: Where was the research conducted?
Why: What was the purpose of the survey or experiment?
How: Describe how the survey or experiment was conducted.
Answers to "Who" and "What" are essential; without them, data values are meaningless.
Types of Individuals in Data
Respondents: Individuals who answer surveys (e.g., customers at Amazon).
Subjects/Participants: People who are experimented on (e.g., patients receiving medication).
Experimental Units: Objects of the experiment that are not people (e.g., animals, plants, websites).
Records: Rows in a database, such as each person's purchase record at Amazon.
Variables
Numerical (Quantitative) Variable: Describes quantities of the objects of interest. Contains measured numerical values with measurement units. Quantitative data represent a quantity that can be measured (e.g., age, mileage).
Categorical (Qualitative) Variable: Describes qualities of the objects of interest. Tells which group or category an individual belongs to (e.g., gender, eye color, country of birth).
Example:
Age: Quantitative
Gender: Qualitative
Mileage of a car: Quantitative
Color of a car: Qualitative
Coding Categorical Data
Categorical data can be coded with numbers for easier input into computers (e.g., 1 for "Personal Growth", 2 for "Career Opportunities").
Yes/No questions can be coded as 0 for "No" and 1 for "Yes".
Storing Data: Stacked vs. Unstacked Data
Stacked Data: Data stored in a spreadsheet style format. Each row contains several characteristics of an individual; each column represents a variable. Stacked data can store many variables.
Unstacked Data: Data stored such that each column represents a variable from a different group. A single variable is broken into different groups. Unstacked data can only store two variables: the variable of interest and a categorical variable indicating group membership.
Organizing Categorical Data
Why Organize Data?
Raw data can be messy, hard to read, and difficult to see patterns.
Organizing data helps reveal patterns and relationships.
Frequency Tables
A frequency table displays each distinct outcome and its frequency (number of times observed).
Relative Frequency Tables
A relative frequency table displays the percentage, rather than the count, of values in each category.
Two-Way Tables
A two-way table shows how individuals are distributed across two categorical variables and their relationships. Each cell gives the count for a combination of values.
Class | Alive | Dead | Total |
|---|---|---|---|
First | 203 | 122 | 325 |
Second | 118 | 167 | 285 |
Third | 178 | 528 | 706 |
Crew | 212 | 673 | 885 |
Total | 711 | 1490 | 2201 |
Percentages and Rates
Percentages are useful for comparing groups of different sizes.
Rates are often reported as "number of events per 1,000 objects" or similar.
Example: If 400 students were surveyed and 65% were carrying calculators, then students were carrying calculators.
Collecting Data to Understand Causality
Causality
Treatment variable: Whether or not a specific treatment is used.
Outcome (response) variable: Whether or not a certain outcome is seen.
The goal is to determine whether the treatment variable causes a change in the outcome variable.
Groups in Experiments
Treatment Group: Individuals who receive the treatment.
Control (Comparison) Group: Individuals who do not receive the treatment.
Controlled Experiments vs. Observational Studies
Controlled Experiment: Researchers assign subjects to control or treatment groups. Only controlled experiments can establish cause-and-effect relationships.
Observational Study: Uses groups that are already existing; researchers do not assign subjects. Cannot establish causation.
Random Assignment
Participants are randomly assigned to control and treatment groups using methods such as rolling dice or flipping coins.
Designing Controlled Experiments
Sample sizes should be large enough to observe variability.
Subjects should be assigned randomly.
All conditions should be as similar as possible except for the treatment.
Example: Effects of Light on Mice
Group | Tumors | No Tumors |
|---|---|---|
LD (12h light/12h dark) | 4 | 46 |
LL (24h light) | 14 | 36 |
Percentage of mice with tumors in LL group:
Percentage of mice with tumors in LD group:
This is a controlled experiment because of random assignment. Cause and effect can be inferred.
Example: Vitamin C and Allergies
Observational study: Mothers chose whether to take vitamin C. Cannot conclude causation.
Bias and Confounding Variables
Bias: Occurs when assignments are not random.
Confounding Variable: A characteristic other than the treatment that causes the outcomes.
Association is not Causation: Just because two variables are associated does not mean one causes the other.
Anecdotal Evidence
Anecdotes are stories about individual experiences and are not reliable for scientific conclusions.
Placebo Effect
A placebo is a "fake" treatment.
The placebo effect is the phenomenon of reacting to a treatment even if it is not real.
Blind and Double Blind Studies
Blind Study: Participants do not know their group assignment.
Double Blind Study: Both participants and researchers do not know group assignments.
Gold Standard for Experiments
Large sample size
Controlled and randomized
Double-blind
Placebo (if appropriate)
Example: Medical Study on Crohn's Disease
Treatment | Remission | No Remission |
|---|---|---|
Combination | 50 | 38 |
Inflix Alone | 44 | 55 |
Azath Alone | 30 | 70 |
Remission rates:
Combination:
Inflix Alone:
Azath Alone:
The combination treatment was the most effective. Cause and effect can be inferred due to randomization and placebo control.
Additional info: Some tables and percentages were inferred and completed for clarity and completeness.