Essential Study Notes for Introductory Statistics: Data, Distributions, and Probability

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Chapter 1: Data Collection

Sources of Bias in Sampling

Understanding bias is crucial for collecting reliable data. Bias occurs when a sample does not accurately represent the population, leading to systematic errors in results.

Selection Bias: Occurs when certain groups are systematically excluded from the sample.
Nonresponse Bias: Results when individuals selected for the sample do not respond, and their nonresponses are related to the variable of interest.
Response Bias: Arises from inaccurate or untruthful responses, often due to poorly worded questions or social desirability.
Sampling Method Bias: Can occur if the sampling method (e.g., convenience sampling) does not produce a random sample.

Example: Surveying only morning class students about campus dining preferences may exclude students with different schedules, introducing selection bias.

Chapter 2: Organizing and Summarizing Data

Reading and Understanding a Histogram

A histogram is a graphical representation of the distribution of numerical data, where data is grouped into bins or intervals.

Each bar's height represents the frequency or relative frequency of data within that interval.
Histograms are useful for visualizing the shape, center, and spread of data.

Example: A histogram of exam scores can reveal whether most students scored in the middle range or if scores are spread out.

Identifying the Shape of a Distribution

Symmetric: Both sides of the histogram are approximately mirror images.
Skewed Right (Positive Skew): The right tail is longer; most data is concentrated on the left.
Skewed Left (Negative Skew): The left tail is longer; most data is concentrated on the right.
Uniform: All intervals have roughly the same frequency.

Misleading Graphs and How to Fix Them

Non-zero Baselines: Starting the y-axis above zero can exaggerate differences.
Inconsistent Intervals: Unequal bin widths distort the data's appearance.
3D Effects: Can make it difficult to interpret values accurately.
Fix: Use consistent scales, start axes at zero, and avoid unnecessary visual effects.

Example: A bar graph with a y-axis starting at 50 instead of 0 may make small differences appear large.

Chapter 3: Numerically Summarizing Data

Comparing and Contrasting Normal Curves

Normal curves are bell-shaped and symmetric. They are defined by their mean () and standard deviation ().

Larger Mean: Shifts the curve horizontally.
Larger Standard Deviation: Makes the curve wider and flatter.

Example: SAT scores with a higher mean indicate better overall performance; a larger standard deviation indicates more variability among students.

Using the Empirical Rule to Calculate Probabilities

The Empirical Rule applies to normal distributions:

About 68% of data falls within of the mean.
About 95% within .
About 99.7% within .

Formula:

where

Example: If test scores are normally distributed with and , about 95% of scores are between 50 and 90.

Five Number Summary and Boxplots

The five number summary consists of:

Minimum
First Quartile ()
Median
Third Quartile ()
Maximum

Boxplots visually display the five number summary and help compare distributions.

Calculating Outliers Using the 1.5 IQR Rule

Interquartile Range (IQR):
Outlier Boundaries:

(lower bound) (upper bound)

Values outside these bounds are considered outliers.

Percentiles and Quartiles

Percentile: The value below which a given percentage of observations falls.
Quartiles: Divide data into four equal parts: (25th percentile), Median (50th), (75th).

Example: If a score is at the 90th percentile, it is higher than 90% of the data.

Comparing and Contrasting Boxplots

Boxplots show the spread, center, and outliers of data.
Comparing boxplots helps identify differences in medians, variability, and skewness between groups.

Location of Median, Q1, and Q3

Median: Middle value (50th percentile).
Q1: Median of the lower half (25th percentile).
Q3: Median of the upper half (75th percentile).

Chapter 4: Describing the Relation Between Two Variables

Correlation Coefficient (r) from a Scatter Plot

The correlation coefficient () measures the strength and direction of a linear relationship between two variables.

ranges from -1 (perfect negative) to +1 (perfect positive).
Values near 0 indicate weak or no linear relationship.

Example: A scatter plot with points closely following an upward line has near +1.

Coefficient of Determination ()

represents the proportion of variance in the dependent variable explained by the independent variable.

Ranges from 0 to 1 (or 0% to 100%).
Higher indicates a stronger relationship.

Interpreting Slope and Y-Intercept

Slope: The change in the dependent variable for a one-unit increase in the independent variable.
Y-Intercept: The predicted value when the independent variable is zero.

Example: In the equation , the slope is 2 and the y-intercept is 5.

Calculating Predicted and Residual Values

Predicted Value: Substitute the independent variable into the regression equation.
Residual: The difference between the observed and predicted value.

Extrapolation

Extrapolation is predicting values outside the range of observed data, which can be unreliable.

Contingency Tables and Probability Calculations

Contingency tables display the frequency distribution of variables and are used to calculate probabilities.

	Category A	Category B	Total
Group 1	n11	n12	n1.
Group 2	n21	n22	n2.
Total	n.1	n.2	n

Marginal Probability: Probability of a single event (row or column total divided by grand total).

Conditional Probability: Probability of one event given another has occurred.

Example: Probability of being in Group 1 and Category A:

Chapter 5: Probability

General Addition Rule

Used to find the probability that at least one of two events occurs.

General Multiplication Rule

Used to find the probability that both events occur.

Conditional Probability

The probability of event A given that event B has occurred.

Listing the Sample Space

The sample space is the set of all possible outcomes of an experiment.

Example: Flipping two coins: S = {HH, HT, TH, TT}

Calculating Probabilities from a Contingency Table

Use the frequencies in the table to determine probabilities of events and combinations.
Marginal probabilities use totals; conditional probabilities use relevant row or column totals.

Example Table:

	Success	Failure	Total
Treatment	30	10	40
Control	20	40	60
Total	50	50	100

Example Calculation: Probability of success given treatment: