BackFoundations of Statistics: Data, Distributions, and the Normal Model
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Introduction to Statistics: The Five "W's" and One "H"
Understanding the Context of Data (Ch. 1 & 2)
Statistics is the science of collecting, analyzing, and interpreting data. To make sense of data, it is essential to understand the context in which it was collected. The Five "W's" and One "H" provide a framework for this:
Who: The subjects or cases about whom data is collected.
What: The variables measured or recorded.
When: The time at which the data was collected.
Where: The location or setting of the data collection.
Why: The purpose or motivation for collecting the data.
How: The method or process used to collect the data.
Example: In a survey of college students' study habits, "Who" refers to the students, "What" could be hours spent studying, "When" is the semester, "Where" is the university, "Why" is to improve academic support, and "How" is via an online questionnaire.
Types of Variables
Categorical Variables: Variables that place individuals into groups or categories (e.g., gender, major).
Quantitative Variables: Variables that take numerical values and for which arithmetic operations make sense (e.g., height, test scores).
Identifier Variables: Unique identifiers for each case (e.g., student ID).
Additional info: Categorical variables can be further classified as nominal (no natural order) or ordinal (ordered categories).
Displaying and Describing Categorical Data
Frequency Tables: Show counts for each category.
Relative Frequency Tables: Show proportions or percentages for each category.
Bar Charts: Visualize the distribution of categorical variables using bars for each category.
Pie Charts: Show the proportion of each category as a slice of a circle.
Example: A bar chart showing the number of students in each major.
Contingency Tables
Used to display the relationship between two categorical variables. Each cell shows the count for a combination of categories.
Category A | Category B | |
|---|---|---|
Group 1 | Count 1 | Count 2 |
Group 2 | Count 3 | Count 4 |
Conditional Distributions: The distribution of one variable for a fixed value of the other variable.
Displaying and Summarizing Quantitative Data (Ch. 3)
Visualizing Quantitative Data
Histograms: Show the distribution of quantitative data by grouping values into bins.
Stem-and-Leaf Plots: Preserve individual data values while showing distribution shape.
Dotplots: Simple way to display small data sets.
Example: A histogram of exam scores for a class.
Describing Distributions
Shape: Symmetric, skewed, unimodal, bimodal, etc.
Center: Mean (average), median (middle value).
Spread: Range, interquartile range (IQR), standard deviation (SD).
Outliers: Unusual values that do not fit the general pattern.
Summary Statistics
Mean:
Median: The middle value when data are ordered.
Mode: The most frequently occurring value.
Range:
Interquartile Range (IQR):
Standard Deviation (SD):
Five-Number Summary: Minimum, , Median, , Maximum.
Boxplots
Visualize the five-number summary.
Show center, spread, and potential outliers.
Understanding and Comparing Distributions (Ch. 4)
Comparing Groups
Use summary statistics and visualizations (side-by-side boxplots, histograms) to compare distributions.
Consider shape, center, spread, and outliers for each group.
Transforming Data
Sometimes, data are skewed and require transformation (e.g., log transformation) to make distributions more symmetric.
Log transformation:
Example: Taking the log of income data to reduce skewness.
The Standard Deviation as a Ruler and the Normal Model (Ch. 5)
Standardizing Data: z-scores
Standardizing allows comparison of values measured on different scales by converting them to a common scale.
z-score:
Indicates how many standard deviations a value is from the mean.
Example: A test score of 85 with a mean of 75 and SD of 5 has .
The Normal Model
The normal distribution is symmetric and bell-shaped, described by mean () and standard deviation ().
Empirical Rule: Approximately 68% of data within 1 SD, 95% within 2 SD, 99.7% within 3 SD of the mean.
Standard Normal Table: Used to find percentiles and probabilities for normal distributions.
Rescaling and Shifting Data
Adding a constant shifts the mean but does not change the spread.
Multiplying by a constant changes both the mean and the spread.
Scatterplots, Association, and Correlation (Ch. 6)
Scatterplots
Scatterplots are used to display the relationship between two quantitative variables.
Direction: Positive or negative association.
Form: Linear or nonlinear.
Strength: How closely the points follow a clear form.
Outliers: Points that do not fit the general pattern.
Correlation
Correlation coefficient (r): Measures the strength and direction of a linear relationship.
Range:
Properties: Symmetric, unitless, affected by outliers.
Example: Height and weight of individuals often show a positive correlation.
Additional info:
Some content on Chapter 9 (Lecture 6) is present but faint and incomplete. It appears to introduce concepts of sampling, populations, and possibly probability models, which are foundational for later chapters on probability and inference.