Skip to main content
Back

Introduction to Data: Foundations of Statistics (Chapter 1 Study Notes)

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Chapter 1: Introduction to Data

Learning Objectives

  • Distinguish between numerical and categorical variables.

  • Understand and use rates (including percentages) and know when they are more useful than counts for describing and comparing groups.

  • Recognize when it is possible to infer a cause-and-effect relationship.

  • Explain how confounding variables prevent us from inferring causation and suggest confounding variables that are likely to occur in some situations.

  • Distinguish between observational studies and controlled experiments.

What Are Data?

What is Statistics?

Statistics is the science (and art) of collecting, organizing, summarizing, and analyzing data to answer questions and draw conclusions. The process involves:

  • Formulating questions

  • Collecting data

  • Organizing and summarizing data

  • Making conclusions

Why Do We Care About Statistics?

  • Statistics allows us to explore the world around us.

  • Use evidence to check whether our beliefs are accurate.

  • Find patterns that lead to discoveries.

  • Share new discoveries with others.

Keep in Mind:

  • Statistics must be used carefully.

  • Inappropriate use will result in inaccurate beliefs.

  • Results are always uncertain.

Statistics Rests on Two Major Concepts

  • Variation: Differences or changes in an item or group. For example, writing the letter "A" with slight differences each time.

  • Data: Observations gathered to draw conclusions. Examples include measurements (weight, height, distance), number of customers, or a list of song titles stored on a computer.

Data are More Than Just Numbers

Data are numbers in context. They consist not only of the numbers, but also of the story behind the numbers. For example, the numbers 7.91, 9.64, 9.18, 10.33, 7.46 may represent birth weights or prices for lunch.

Data Analysis

Data analysis is the process of examining collected data to explain what the data tell us about the real world.

Classifying and Storing Data

Sample and Population

  • Population: The complete set of people or objects being studied. Information about the population is usually the goal, but obtaining all data values from the population is often impossible.

  • Sample: A subset of the population from which data are obtained. A sample is used to make inferences about the population. The goal is to describe the population, and the sample should be representative.

Context of Data: The "W's"

  • Who: Describe the individuals who were surveyed.

  • What: Determine what is being measured.

  • When: When was the research conducted?

  • Where: Where was the research conducted?

  • Why: What was the purpose of the survey or experiment?

  • How: Describe how the survey or experiment was conducted.

Answers to "Who" and "What" are essential; without them, data values are meaningless.

Types of Individuals in Data

  • Respondents: Individuals who answer surveys (e.g., customers at Amazon).

  • Subjects/Participants: People who are experimented on (e.g., patients receiving medication).

  • Experimental Units: Objects of the experiment that are not people (e.g., animals, plants, websites).

  • Records: Rows in a database, such as each person's purchase record at Amazon.

Variables

  • Numerical (Quantitative) Variable: Describes quantities of the objects of interest. Contains measured numerical values with measurement units. Quantitative data represent a quantity that can be measured (e.g., age, mileage).

  • Categorical (Qualitative) Variable: Describes qualities of the objects of interest. Tells which group or category an individual belongs to (e.g., gender, eye color, country of birth).

Example:

  • Age: Quantitative

  • Gender: Qualitative

  • Mileage of a car: Quantitative

  • Color of a car: Qualitative

Coding Categorical Data

  • Categorical data can be coded with numbers for easier input into computers (e.g., 1 for "Personal Growth", 2 for "Career Opportunities").

  • Yes/No questions can be coded as 0 for "No" and 1 for "Yes".

Storing Data: Stacked vs. Unstacked Data

  • Stacked Data: Data stored in a spreadsheet style format. Each row contains several characteristics of an individual; each column represents a variable. Stacked data can store many variables.

  • Unstacked Data: Data stored such that each column represents a variable from a different group. A single variable is broken into different groups. Unstacked data can only store two variables: the variable of interest and a categorical variable indicating group membership.

Organizing Categorical Data

Why Organize Data?

  • Raw data can be messy, hard to read, and difficult to see patterns.

  • Organizing data helps reveal patterns and relationships.

Frequency Tables

A frequency table displays each distinct outcome and its frequency (number of times observed).

Relative Frequency Tables

A relative frequency table displays the percentage, rather than the count, of values in each category.

Two-Way Tables

A two-way table shows how individuals are distributed across two categorical variables and their relationships. Each cell gives the count for a combination of values.

Class

Alive

Dead

Total

First

203

122

325

Second

118

167

285

Third

178

528

706

Crew

212

673

885

Total

711

1490

2201

Percentages and Rates

  • Percentages are useful for comparing groups of different sizes.

  • Rates are often reported as "number of events per 1,000 objects" or similar.

Example: If 400 students were surveyed and 65% were carrying calculators, then students were carrying calculators.

Collecting Data to Understand Causality

Causality

  • Treatment variable: Whether or not a specific treatment is used.

  • Outcome (response) variable: Whether or not a certain outcome is seen.

  • The goal is to determine whether the treatment variable causes a change in the outcome variable.

Groups in Experiments

  • Treatment Group: Individuals who receive the treatment.

  • Control (Comparison) Group: Individuals who do not receive the treatment.

Controlled Experiments vs. Observational Studies

  • Controlled Experiment: Researchers assign subjects to control or treatment groups. Only controlled experiments can establish cause-and-effect relationships.

  • Observational Study: Uses groups that are already existing; researchers do not assign subjects. Cannot establish causation.

Random Assignment

  • Participants are randomly assigned to control and treatment groups using methods such as rolling dice or flipping coins.

Designing Controlled Experiments

  • Sample sizes should be large enough to observe variability.

  • Subjects should be assigned randomly.

  • All conditions should be as similar as possible except for the treatment.

Example: Effects of Light on Mice

Group

Tumors

No Tumors

LD (12h light/12h dark)

4

46

LL (24h light)

14

36

Percentage of mice with tumors in LL group:

Percentage of mice with tumors in LD group:

This is a controlled experiment because of random assignment. Cause and effect can be inferred.

Example: Vitamin C and Allergies

Observational study: Mothers chose whether to take vitamin C. Cannot conclude causation.

Bias and Confounding Variables

  • Bias: Occurs when assignments are not random.

  • Confounding Variable: A characteristic other than the treatment that causes the outcomes.

Association is not Causation: Just because two variables are associated does not mean one causes the other.

Anecdotal Evidence

  • Anecdotes are stories about individual experiences and are not reliable for scientific conclusions.

Placebo Effect

  • A placebo is a "fake" treatment.

  • The placebo effect is the phenomenon of reacting to a treatment even if it is not real.

Blind and Double Blind Studies

  • Blind Study: Participants do not know their group assignment.

  • Double Blind Study: Both participants and researchers do not know group assignments.

Gold Standard for Experiments

  • Large sample size

  • Controlled and randomized

  • Double-blind

  • Placebo (if appropriate)

Example: Medical Study on Crohn's Disease

Treatment

Remission

No Remission

Combination

50

38

Inflix Alone

44

55

Azath Alone

30

70

Remission rates:

  • Combination:

  • Inflix Alone:

  • Azath Alone:

The combination treatment was the most effective. Cause and effect can be inferred due to randomization and placebo control.

Additional info: Some tables and percentages were inferred and completed for clarity and completeness.

Pearson Logo

Study Prep