Skip to main content
Back

Introduction to Data and Statistical Thinking

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Introduction to Statistics and Data

What is Statistics?

Statistics is the science of collecting, organizing, summarizing, and analyzing data to answer questions and draw conclusions. It is essential for making informed decisions and understanding patterns in the world around us.

  • Variation: Refers to differences or changes in items or observations.

  • Data: Observations gathered to draw conclusions, which can include measurements (e.g., weight, height), counts, or categories.

The statistical process: Formulating Questions, Collecting Data, Organizing and Summarizing Data, Making Conclusions

Key Steps in Statistical Investigation:

  • Formulating Questions

  • Collecting Data

  • Organizing and Summarizing Data

  • Making Conclusions

Classifying and Storing Data

Populations and Samples

A population is the entire group of individuals or objects of interest, while a sample is a subset of the population used to make inferences about the whole.

  • Population: The complete set being studied.

  • Sample: A representative subset of the population.

To interpret data meaningfully, context is crucial. The "Six W's" help provide this context:

  • Who: Individuals measured

  • What: Variables measured

  • When: Time of data collection

  • Where: Location of data collection

  • Why: Purpose of the study

  • How: Method of data collection

Types of Variables

  • Numerical (Quantitative) Variables: Describe quantities and are measured with units (e.g., age, weight).

  • Categorical (Qualitative) Variables: Describe qualities or categories (e.g., gender, color).

Examples:

  • Age (Numerical)

  • Gender (Categorical)

  • GPA (Numerical)

  • Major (Categorical)

Storing Data: Stacked vs. Unstacked

Data can be stored in different formats:

  • Stacked Data: Each row contains all variables for one individual; columns represent variables.

Example of stacked data in a spreadsheet format

  • Unstacked Data: Each column represents a variable for a different group; typically used for comparing two groups.

Example of unstacked data with separate columns for men and women

Organizing Categorical Data

Frequency and Relative Frequency Tables

Organizing data helps reveal patterns and makes interpretation easier.

  • Frequency Table: Lists each category and the number of occurrences (counts).

Frequency table showing counts by class on the Titanic

  • Relative Frequency Table: Lists each category and the percentage of occurrences.

Relative frequency table showing percentages by class on the Titanic

Two-Way Tables

Two-way tables display the relationship between two categorical variables, showing how individuals are distributed across combinations of categories.

  • Each cell shows the count for a combination of values.

Two-way table showing survival by class on the Titanic

Rates and Percentages

Why Use Rates?

Rates (such as percentages or counts per 1,000) allow for fair comparisons between groups of different sizes. For example, comparing the number of injuries per 1,000 players in different sports gives a better sense of risk than raw counts.

  • Rate Formula: (where is a scaling factor, e.g., 1,000 or 10,000)

Collecting Data to Understand Causality

Causality and Study Design

To determine if one variable causes changes in another, researchers use different study designs:

  • Treatment Variable: The variable manipulated by the researcher.

  • Outcome (Response) Variable: The variable measured for change.

  • Treatment Group: Receives the treatment.

  • Control Group: Does not receive the treatment.

Types of Studies:

  • Controlled Experiment: Participants are randomly assigned to groups; only this design can establish causation.

  • Observational Study: Groups are pre-existing; can show association but not causation.

Random Assignment and Bias

  • Random Assignment: Ensures groups are similar except for the treatment, minimizing bias.

  • Bias: Systematic error due to non-random assignment or other flaws in study design.

Confounding Variables

A confounding variable is an outside factor that affects both the treatment and outcome, making it difficult to determine causality.

Placebo Effect and Blinding

  • Placebo: A fake treatment used to control for psychological effects.

  • Single-Blind Study: Participants do not know their group assignment.

  • Double-Blind Study: Neither participants nor researchers know group assignments, reducing bias.

Gold Standard for Experiments

  • Large sample size

  • Randomized and controlled

  • Double-blind

  • Use of placebo (if appropriate)

Example: Effects of Light on Mice

Fifty mice were randomly assigned to two groups: 12 hours light/12 hours dark (LD) and 24 hours light (LL). The number of mice developing tumors was recorded.

LD

LL

Tumors

4

14

No tumors

46

36

Table showing tumor development in mice under different light conditions

This is a controlled experiment because the assignment was random. The higher tumor rate in the LL group suggests a possible causal effect, but further studies are needed to confirm.

Association vs. Causation

Association between two variables does not imply that one causes the other. Confounding variables or coincidence may explain the relationship.

  • Anecdotal Evidence: Personal stories are not reliable for establishing scientific conclusions.

Summary Table: Types of Variables

Variable

Type

Age

Numerical

Gender

Categorical

GPA

Numerical

Major

Categorical

Summary Table: Frequency and Relative Frequency

Class

Count

%

First

325

14.77

Second

285

12.95

Third

706

32.08

Crew

885

40.21

References: De Veaux et al., Devore, Gould et al., Larsen & Marx, Navidi & Monk.

Pearson Logo

Study Prep