Skip to main content
Back

Chapter 1: Introduction to Data – Essential Concepts in Statistics

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Chapter 1: Introduction to Data

Learning Objectives

  • Distinguish between numerical and categorical variables.

  • Understand and use rates (including percentages) and know when they are more useful than counts for describing and comparing groups.

  • Recognize when it is possible to infer a cause-and-effect relationship.

  • Explain how confounding variables prevent us from inferring causation and suggest confounding variables that are likely to occur in some situations.

  • Distinguish between observational studies and controlled experiments.

What Are Data?

What is Statistics?

Statistics is the science (and art) of collecting, organizing, summarizing, and analyzing data to answer questions and draw conclusions. The process involves several steps:

  • Formulating Questions

  • Collecting Data

  • Organizing and Summarizing Data

  • Making Conclusions

Statistics is essential for exploring the world, verifying beliefs, discovering patterns, and sharing findings. However, it must be used carefully, as inappropriate use can lead to inaccurate beliefs and results are always uncertain.

Major Concepts in Statistics

  • Variation: Differences or changes in an item or measurement. For example, writing the letter "A" with slight differences each time demonstrates variation.

  • Data: Observations gathered to draw conclusions. Examples include measurements (weight, height, distance), counts (number of customers), or lists (song titles).

Data are More Than Just Numbers

Data are numbers in context. They consist not only of the numbers themselves but also the story behind them. For example, the numbers 7.91, 9.64, 9.18, 10.33, 7.46 could represent birth weights or prices for lunch. Understanding the context is crucial for meaningful analysis.

Data Analysis

Data analysis is the process of examining collected data to explain what the data tell us about the real world. It involves summarizing, visualizing, and interpreting data to answer research questions.

Classifying and Storing Data

Sample and Population

  • Population: The complete set of people or objects being studied. Obtaining all data from the population is usually impractical.

  • Sample: A subset of the population from which data are obtained. Samples are used to make inferences about the population and should be representative to allow generalization.

Example: A company samples 30 similar companies to study 401(k) participation rates. The population of interest is all similar companies.

Variables

  • Numerical (Quantitative) Variable: Describes quantities and contains measured numerical values with units. Examples: age, mileage, number of calories.

  • Categorical (Qualitative) Variable: Describes qualities or categories. Examples: gender, eye color, country of birth, type of meat.

Example: For Arby's sandwiches, the type of meat is categorical, while number of calories and serving size are numerical.

Coding Categorical Data

  • Categorical data can be coded with numbers for easier computer input (e.g., 1 for "Personal Growth", 2 for "Career Opportunities").

  • Yes/No questions are often coded as 0 for "No" and 1 for "Yes".

Storing Data: Stacked vs. Unstacked Data

  • Stacked Data: Data stored in a spreadsheet format where each row is an individual and each column is a variable.

  • Unstacked Data: Data stored such that each column represents a variable from a different group. Useful for comparing two groups.

Organizing Categorical Data

Frequency Tables

A frequency table displays each distinct outcome and its frequency (number of times observed). It helps organize raw data to reveal patterns.

Relative Frequency Tables

A relative frequency table shows the percentage of values in each category, making it easier to compare groups of different sizes.

Two-Way Tables

A two-way table displays the relationship between two categorical variables, showing how many times each combination of categories occurs.

Class

Alive

Dead

Total

First

203

122

325

Second

118

167

285

Third

178

528

706

Crew

212

673

885

Total

711

1490

2201

Additional info: This table summarizes survival by ticket class on the Titanic.

Percentages and Rates

  • Percentages are useful for comparing groups of different sizes.

  • Rates are often reported as "number of events per 1,000 objects" or similar units.

Example: If 65% of 400 students carry calculators, then students carry calculators.

Collecting Data to Understand Causality

Causality

  • Treatment Variable: Whether a specific treatment is used.

  • Outcome (Response) Variable: Whether a certain outcome is observed.

The goal is to determine whether the treatment variable causes a change in the outcome variable.

Groups in Experiments

  • Treatment Group: Receives the treatment.

  • Control Group: Does not receive the treatment.

Controlled Experiments vs. Observational Studies

  • Controlled Experiment: Researchers assign subjects to groups and control conditions. Only controlled experiments can establish cause-and-effect relationships.

  • Observational Study: Researchers observe groups that already exist without assigning subjects. Cannot establish causality.

Random Assignment

  • Participants are randomly assigned to treatment or control groups using methods such as coin flips or computer randomization.

Designing Controlled Experiments

  • Large sample sizes are needed to observe variability.

  • Random assignment minimizes bias.

  • All conditions should be as similar as possible except for the treatment.

Example: Effects of Light on Mice

Group

Tumors

No Tumors

LD (12h light/12h dark)

4

46

LL (24h light)

14

36

Conclusion: Random assignment allows us to infer causality. More mice developed tumors in the LL group, suggesting light exposure may cause tumors.

Bias and Confounding Variables

  • Bias: Occurs when assignments are not random, leading to unrepresentative groups.

  • Confounding Variable: A characteristic other than the treatment that affects the outcome.

Example: Assigning the heaviest people to the exercise group introduces bias.

Association is Not Causation

  • Just because two variables are associated does not mean one causes the other.

  • Confounding variables may explain the association.

Example: Children with larger shoe sizes tend to have higher vocabulary scores, but age is the confounding variable.

Anecdotal Evidence

  • Anecdotes are personal stories and are not reliable for scientific conclusions.

Placebo Effect

  • A placebo is a fake treatment.

  • The placebo effect is when participants respond to a treatment because they believe it is real.

Blind and Double-Blind Studies

  • Blind Study: Participants do not know their group assignment.

  • Double-Blind Study: Neither participants nor researchers know group assignments.

Gold Standard for Experiments

  • Large sample size

  • Controlled and randomized assignment

  • Double-blind design

  • Use of placebo (if appropriate)

Example: Medical Study Comparing Treatments

Treatment

Remission

No Remission

Combination

78

59

Inflix Alone

94

118

Azath Alone

19

44

Conclusion: The combination treatment was most effective. Randomization and placebo use allow us to infer causality.

Key Terms and Formulas

  • Rate:

  • Percentage:

Summary Table: Types of Variables

Type

Description

Examples

Numerical (Quantitative)

Measures quantities

Age, Height, Number of calories

Categorical (Qualitative)

Describes categories or groups

Gender, Type of meat, Eye color

Summary Table: Study Types

Type

Description

Can Infer Causality?

Controlled Experiment

Researchers assign groups and control conditions

Yes

Observational Study

Groups exist naturally; no assignment

No

Pearson Logo

Study Prep