BackStatistics Study Guide: Variables, Data Displays, Outliers, and Correlation
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Understanding Variables in Statistics
Types of Variables
In statistics, variables are characteristics or properties that can take on different values. They are classified as either qualitative (categorical) or quantitative (numerical):
Qualitative Variables: Describe qualities or categories. Examples: zip code, living with parents, employment status, fraternity/sorority membership.
Quantitative Variables: Represent measurable quantities. Examples: annual income, undergraduate GPA.
Note: Some variables, like zip code, are coded numerically but are still qualitative because the numbers do not represent meaningful quantities.
Graphical Displays and Summary Statistics
Histograms and Boxplots
Histograms and boxplots are graphical tools used to summarize and visualize the distribution of quantitative data.
Histogram: Shows the frequency of data within specified intervals (bins).
Boxplot: Summarizes data using the five-number summary: minimum, Q1, median, Q3, and maximum. Outliers may be indicated as points beyond the whiskers.
Measures of Center: Mean and Median
Mean: The arithmetic average of a data set.
Median: The middle value when data are ordered from smallest to largest.
Skewness: If the mean is greater than the median, the distribution is right-skewed (positively skewed). If the mean is less than the median, the distribution is left-skewed (negatively skewed).
Example: Interpreting a Histogram
Given a histogram of calcium concentrations in water, you can estimate the percentage of locations within a certain range by summing the frequencies in the relevant bins and dividing by the total number of observations.
Calculating Percentages from Histograms
To find the percentage of observations in a certain range:
Add the frequencies for all bins in the range.
Divide by the total number of observations.
Multiply by 100 to get a percentage.
Example: If 65 out of 105 patients had fewer than 180 days of depression, the percentage is:
Frequency Tables and the Median
Using Frequency Tables
Frequency tables summarize data by grouping values into intervals and counting occurrences.
Number of calls made | Frequency |
|---|---|
1 – 4 | 16 |
5 – 8 | 11 |
9 – 12 | 5 |
13 – 16 | 3 |
17 – 20 | 2 |
To find the number of people making more than 8 calls: sum frequencies for intervals above 8.
To find the median interval, determine the position of the median (middle value) and see which interval contains it.
Outliers and the 1.5*IQR Criterion
Identifying Outliers
An outlier is a value that lies far outside the range of the rest of the data. The 1.5*IQR rule is commonly used:
IQR (Interquartile Range):
Lower Fence:
Upper Fence:
Values outside these fences are considered outliers.
Example: If , , then . The upper fence is . Any value above 67.5 is an outlier.
Scatterplots, Correlation, and Regression
Scatterplots and Correlation
A scatterplot displays the relationship between two quantitative variables. The correlation coefficient () measures the strength and direction of a linear relationship:
ranges from -1 (perfect negative) to +1 (perfect positive).
The sign indicates direction; the magnitude indicates strength.
is unitless and does not change with changes in measurement units.
Linear Regression
Linear regression models the relationship between an explanatory variable () and a response variable () using the equation:
Slope (): Change in for a one-unit increase in .
Intercept (): Predicted value of when .
Example: If , the slope is -491, indicating a negative relationship.
Making Predictions
To predict for a given , substitute $ X $ into the regression equation.
Example: For ,
Interpreting Correlation in Context
A moderate to strong negative correlation (e.g., ) indicates that as increases, tends to decrease.
Correlation does not imply causation.
Summary Table: Key Statistical Concepts
Concept | Definition | Example |
|---|---|---|
Qualitative Variable | Describes a category or quality | Zip code, Employment status |
Quantitative Variable | Describes a measurable quantity | Annual income, GPA |
Mean | Arithmetic average | Sum of values / Number of values |
Median | Middle value in ordered data | Value at position |
Outlier | Value outside or | Value above 67.5 if , |
Correlation () | Strength and direction of linear relationship | (moderate positive) |
Regression Slope | Change in per unit change in |
Additional info:
When interpreting histograms, always check the scale and bin widths.
Boxplots are useful for comparing distributions and identifying outliers.
Correlation is only appropriate for linear relationships between quantitative variables.