BackEssential Study Notes for Introductory Statistics: Data, Distributions, and Probability
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Chapter 1: Data Collection
Sources of Bias in Sampling
Understanding bias is crucial for collecting reliable data. Bias occurs when a sample does not accurately represent the population, leading to systematic errors in results.
Selection Bias: Occurs when certain groups are systematically excluded from the sample.
Nonresponse Bias: Results when individuals selected for the sample do not respond, and their nonresponses are related to the variable of interest.
Response Bias: Arises from inaccurate or untruthful responses, often due to poorly worded questions or social desirability.
Sampling Method Bias: Can occur if the sampling method (e.g., convenience sampling) does not produce a random sample.
Example: Surveying only morning class students about campus dining preferences may exclude students with different schedules, introducing selection bias.
Chapter 2: Organizing and Summarizing Data
Reading and Understanding a Histogram
A histogram is a graphical representation of the distribution of numerical data, where data is grouped into bins or intervals.
Each bar's height represents the frequency or relative frequency of data within that interval.
Histograms are useful for visualizing the shape, center, and spread of data.
Example: A histogram of exam scores can reveal whether most students scored in the middle range or if scores are spread out.
Identifying the Shape of a Distribution
Symmetric: Both sides of the histogram are approximately mirror images.
Skewed Right (Positive Skew): The right tail is longer; most data is concentrated on the left.
Skewed Left (Negative Skew): The left tail is longer; most data is concentrated on the right.
Uniform: All intervals have roughly the same frequency.
Misleading Graphs and How to Fix Them
Non-zero Baselines: Starting the y-axis above zero can exaggerate differences.
Inconsistent Intervals: Unequal bin widths distort the data's appearance.
3D Effects: Can make it difficult to interpret values accurately.
Fix: Use consistent scales, start axes at zero, and avoid unnecessary visual effects.
Example: A bar graph with a y-axis starting at 50 instead of 0 may make small differences appear large.
Chapter 3: Numerically Summarizing Data
Comparing and Contrasting Normal Curves
Normal curves are bell-shaped and symmetric. They are defined by their mean () and standard deviation ().
Larger Mean: Shifts the curve horizontally.
Larger Standard Deviation: Makes the curve wider and flatter.
Example: SAT scores with a higher mean indicate better overall performance; a larger standard deviation indicates more variability among students.
Using the Empirical Rule to Calculate Probabilities
The Empirical Rule applies to normal distributions:
About 68% of data falls within of the mean.
About 95% within .
About 99.7% within .
Formula:
where
Example: If test scores are normally distributed with and , about 95% of scores are between 50 and 90.
Five Number Summary and Boxplots
The five number summary consists of:
Minimum
First Quartile ()
Median
Third Quartile ()
Maximum
Boxplots visually display the five number summary and help compare distributions.
Calculating Outliers Using the 1.5 IQR Rule
Interquartile Range (IQR):
Outlier Boundaries:
(lower bound) (upper bound)
Values outside these bounds are considered outliers.
Percentiles and Quartiles
Percentile: The value below which a given percentage of observations falls.
Quartiles: Divide data into four equal parts: (25th percentile), Median (50th), (75th).
Example: If a score is at the 90th percentile, it is higher than 90% of the data.
Comparing and Contrasting Boxplots
Boxplots show the spread, center, and outliers of data.
Comparing boxplots helps identify differences in medians, variability, and skewness between groups.
Location of Median, Q1, and Q3
Median: Middle value (50th percentile).
Q1: Median of the lower half (25th percentile).
Q3: Median of the upper half (75th percentile).
Chapter 4: Describing the Relation Between Two Variables
Correlation Coefficient (r) from a Scatter Plot
The correlation coefficient () measures the strength and direction of a linear relationship between two variables.
ranges from -1 (perfect negative) to +1 (perfect positive).
Values near 0 indicate weak or no linear relationship.
Example: A scatter plot with points closely following an upward line has near +1.
Coefficient of Determination ()
represents the proportion of variance in the dependent variable explained by the independent variable.
Ranges from 0 to 1 (or 0% to 100%).
Higher indicates a stronger relationship.
Interpreting Slope and Y-Intercept
Slope: The change in the dependent variable for a one-unit increase in the independent variable.
Y-Intercept: The predicted value when the independent variable is zero.
Example: In the equation , the slope is 2 and the y-intercept is 5.
Calculating Predicted and Residual Values
Predicted Value: Substitute the independent variable into the regression equation.
Residual: The difference between the observed and predicted value.
Extrapolation
Extrapolation is predicting values outside the range of observed data, which can be unreliable.
Contingency Tables and Probability Calculations
Contingency tables display the frequency distribution of variables and are used to calculate probabilities.
Category A | Category B | Total | |
|---|---|---|---|
Group 1 | n11 | n12 | n1. |
Group 2 | n21 | n22 | n2. |
Total | n.1 | n.2 | n |
Marginal Probability: Probability of a single event (row or column total divided by grand total).
Conditional Probability: Probability of one event given another has occurred.
Example: Probability of being in Group 1 and Category A:
Chapter 5: Probability
General Addition Rule
Used to find the probability that at least one of two events occurs.
General Multiplication Rule
Used to find the probability that both events occur.
Conditional Probability
The probability of event A given that event B has occurred.
Listing the Sample Space
The sample space is the set of all possible outcomes of an experiment.
Example: Flipping two coins: S = {HH, HT, TH, TT}
Calculating Probabilities from a Contingency Table
Use the frequencies in the table to determine probabilities of events and combinations.
Marginal probabilities use totals; conditional probabilities use relevant row or column totals.
Example Table:
Success | Failure | Total | |
|---|---|---|---|
Treatment | 30 | 10 | 40 |
Control | 20 | 40 | 60 |
Total | 50 | 50 | 100 |
Example Calculation: Probability of success given treatment: