Statistics Review: Five Number Summary, Boxplots, Empirical Rule, Z-scores, Regression, and More

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Q1. Calculate the Five Number Summary for the house prices dataset.

Background

Topic: Descriptive Statistics – Five Number Summary

This question tests your ability to summarize a dataset using five key statistics: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.

Key Terms and Formulas:

Minimum: The smallest value in the dataset.
Q1 (First Quartile): The value below which 25% of the data fall.
Median: The middle value when the data are ordered.
Q3 (Third Quartile): The value below which 75% of the data fall.
Maximum: The largest value in the dataset.

Step-by-Step Guidance

Order the data from smallest to largest (already done in this case).
Identify the minimum and maximum values directly from the ordered list.
Find the median (the middle value). For 13 data points, the median is the 7th value.
Find Q1: This is the median of the lower half (excluding the overall median if the number of data points is odd).
Find Q3: This is the median of the upper half (again, excluding the overall median if the number of data points is odd).

Try solving on your own before revealing the answer!

Q2. Sketch the boxplot for the house prices.

Background

Topic: Boxplots (Box-and-Whisker Plots)

This question asks you to visually represent the five number summary using a boxplot, which helps identify the spread and potential outliers in the data.

Key Terms:

Boxplot: A graphical summary of the five number summary.
Whiskers: Lines extending from the box to the minimum and maximum values (unless there are outliers).

Step-by-Step Guidance

Draw a number line that covers the range of your data.
Draw a box from Q1 to Q3, with a line at the median.
Extend whiskers from the box to the minimum and maximum values (unless outliers are present).
If you have calculated fences (see next question), mark any outliers with a symbol (like a dot or asterisk).

Try sketching the boxplot before checking the answer!

Q3. Calculate the upper fence and lower fence for the house prices.

Background

Topic: Outlier Detection Using Fences

This question tests your ability to calculate the boundaries (fences) used to identify outliers in a dataset.

Key Formulas:

Interquartile Range (IQR):
Lower Fence:
Upper Fence:

Step-by-Step Guidance

Calculate the IQR by subtracting Q1 from Q3.
Multiply the IQR by 1.5.
Subtract this value from Q1 to get the lower fence.
Add this value to Q3 to get the upper fence.

Try calculating the fences before checking the answer!

Q4. Is the maximum price an outlier? Why?

Background

Topic: Outlier Identification

This question asks you to use the fences calculated above to determine if the maximum value is an outlier.

Key Concept:

If a value is greater than the upper fence or less than the lower fence, it is considered an outlier.

Step-by-Step Guidance

Compare the maximum value to the upper fence you calculated previously.
If the maximum is greater than the upper fence, it is an outlier; otherwise, it is not.
Explain your reasoning based on the comparison.

Try reasoning through this before checking the answer!

Q5. Interpret the summary statistics for family.income.median from statedata2008.

Background

Topic: Interpreting Summary Statistics

This question tests your ability to interpret measures such as mean, median, standard deviation, minimum, and maximum for a variable.

Key Terms:

Mean: The average value.
Median: The middle value.
Standard Deviation: A measure of spread.
Minimum/Maximum: The smallest/largest values.

Step-by-Step Guidance

Look at the mean and median to assess symmetry or skewness.
Consider the standard deviation to understand variability.
Check the minimum and maximum for possible outliers or range.
Summarize what these statistics tell you about the distribution of family income.

Try interpreting the summary statistics before checking the answer!

Q6. Describe the distribution of GDP using a histogram.

Background

Topic: Describing Distributions

This question asks you to interpret the shape, center, and spread of a distribution based on a histogram.

Key Terms:

Shape: Symmetric, skewed left/right, unimodal, bimodal, etc.
Center: Where most values cluster (mean or median).
Spread: Range, IQR, or standard deviation.

Step-by-Step Guidance

Examine the histogram for symmetry or skewness.
Identify the center of the distribution.
Describe the spread and note any outliers or unusual features.

Try describing the histogram before checking the answer!

Q7. Boxplot for median_age: Any outliers? Who?

Background

Topic: Boxplots and Outlier Identification

This question asks you to use a boxplot to identify outliers and specify which data points are outliers.

Key Concepts:

Use the five number summary and fences to identify outliers.
Outliers are points outside the fences.

Step-by-Step Guidance

Calculate the IQR and fences for median_age.
Identify any data points outside these fences.
List the specific data points (or states) that are outliers.

Try identifying outliers before checking the answer!

Q8. Create a bar plot of pizza size and a pie chart of pizza maker (Pizza.xls).

Background

Topic: Categorical Data Visualization

This question tests your ability to represent categorical data using bar plots and pie charts.

Key Terms:

Bar Plot: Shows frequency of each category (e.g., pizza size).
Pie Chart: Shows proportion of each category (e.g., pizza maker).

Step-by-Step Guidance

Count the frequency of each pizza size and each pizza maker.
Draw a bar plot for pizza size (bars for each size, height = frequency).
Draw a pie chart for pizza maker (each slice = proportion of total).

Try creating the plots before checking the answer!

Q9. Is the US murder rate data normal? Why? (μ = 5.9, σ = 5.4)

Background

Topic: Assessing Normality Using the Empirical Rule

This question asks you to determine if a dataset is approximately normal by applying the empirical rule.

Key Concepts:

Empirical Rule: In a normal distribution, about 68% of data fall within 1 standard deviation of the mean, 95% within 2, and 99.7% within 3.

Step-by-Step Guidance

Check if the data is symmetric and bell-shaped (if a histogram is available).
Apply the empirical rule to see if the data fits the expected percentages.
Consider any deviations from the rule as evidence against normality.

Try reasoning through this before checking the answer!

Q10. Empirical Rule for Vocabulary Sizes (μ = 14000, σ = 3000)

Background

Topic: Empirical Rule (68-95-99.7 Rule) for Normal Distributions

This question asks you to apply the empirical rule to a normal distribution of vocabulary sizes.

Key Formulas:

68%:
95%:
99.7%:

Step-by-Step Guidance

For 68%, calculate and .
For 95%, calculate and .
For 99.7%, calculate and .
For the percentage less than 17000, calculate the z-score: , then use the standard normal table.
For the percentage between 5000 and 14000, calculate the z-scores for both values and use the standard normal table to find the area between them.

Try applying the empirical rule before checking the answer!

Q11. Z-scores and Outliers for Vocabulary Sizes

Background

Topic: Z-scores and Outlier Detection

This question asks you to calculate z-scores and interpret whether a data point is extreme.

Key Formulas:

Z-score:
To find x from z:

Step-by-Step Guidance

For Mary, plug her value into the z-score formula.
Interpret if her z-score is considered extreme (typically |z| > 2 or 3).
For Phillip, use his z-score to solve for x (number of words).
Interpret if Phillip's z-score is considered extreme.

Try calculating the z-scores before checking the answer!

Q12. What vocabulary size is at the 90th and 10th percentiles? How could we calculate this without a calculator?

Background

Topic: Percentiles in a Normal Distribution

This question asks you to find the value corresponding to a given percentile using the normal distribution.

Key Concepts:

Percentiles correspond to specific z-scores in the standard normal distribution.
Use the formula:

Step-by-Step Guidance

Find the z-score that corresponds to the desired percentile (e.g., z ≈ 1.28 for the 90th percentile, z ≈ -1.28 for the 10th percentile).
Plug the z-score into the formula to find the vocabulary size.
Explain how to estimate z-scores for common percentiles without a calculator (using tables or memorized values).

Try finding the percentile values before checking the answer!

Q13. Conditions to Check for Regression

Background

Topic: Regression Assumptions

This question asks you to recall and explain the conditions that must be met before performing regression analysis.

Key Conditions:

Quantitative Variable Condition
Straight Enough Condition
Outlier Condition

Step-by-Step Guidance

Explain why each condition is important for regression analysis.
Describe how to check each condition (e.g., scatterplots for straightness, residual plots for outliers).
Discuss what to do if a condition is not met.

Try listing and explaining the conditions before checking the answer!

Q14. SAT, GPA dataset: What variables might be related? Can high school GPA predict college GPA? Do a complete analysis.

Background

Topic: Regression Analysis and Variable Relationships

This question asks you to identify related variables and perform a regression analysis to see if one can predict the other.

Key Steps:

Identify potential predictor and response variables.
Check regression conditions.
Fit a regression model and interpret coefficients.
Assess model fit (R²) and check residuals for outliers or patterns.

Step-by-Step Guidance

Identify which variables are quantitative and could be related (e.g., high school GPA and college GPA).
Check the regression conditions (see previous question).
Fit a linear regression model:
Interpret the slope and intercept in context.
Calculate and interpret R², and examine residuals for outliers or non-random patterns.

Try outlining the analysis before checking the answer!

Q15. Predictions for students with a 3.1 high school GPA; R² and interpretation; Residuals and outliers?

Background

Topic: Regression Predictions and Model Assessment

This question asks you to use a regression equation to make predictions, interpret R², and assess residuals for outliers.

Key Formulas:

Prediction:
R²: Proportion of variance in the response variable explained by the predictor.
Residual:

Step-by-Step Guidance

Plug 3.1 into the regression equation to get the predicted college GPA.
Interpret R² in the context of the data (e.g., what percent of college GPA variation is explained by high school GPA).
Calculate residuals for individual students and check for large or unusual values (potential outliers).

Try making the prediction and interpreting R² before checking the answer!

Q16. Cautions before using correlation or regression analysis

Background

Topic: Limitations and Pitfalls of Regression and Correlation

This question asks you to recall important cautions and best practices before performing regression or correlation analysis.

Key Points:

Always plot data first (scatterplots, residual plots).
Do not use regression on inappropriate data (e.g., non-quantitative, non-linear).
Check for patterns in residuals and presence of outliers.
Beware of lurking variables and remember that correlation does not imply causation.

Step-by-Step Guidance

List the steps to take before running regression/correlation (e.g., plotting, checking assumptions).
Explain why each step is important for valid analysis.
Discuss the dangers of ignoring these cautions (e.g., spurious results, misinterpretation).