Skip to main content
Back

Statistical Inference: Regression, Correlation, and Categorical Data Analysis

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Describing the Relation Between Two Variables

Simple Linear Regression and Correlation

This section explores how two quantitative variables are related using regression analysis and correlation. The goal is to model and interpret the relationship between variables such as study time and exam scores, or fertilizer amount and crop yield.

  • Scatter Plot: A graphical representation of paired data points to visualize the relationship between two variables.

  • Regression Equation: The equation of the best-fit line is typically written as , where is the predicted value, is the y-intercept, and is the slope.

  • Interpretation of Slope: The slope represents the average change in the response variable for each one-unit increase in the explanatory variable.

  • Correlation Coefficient (): Measures the strength and direction of the linear relationship between two variables. Values range from -1 to 1.

  • Coefficient of Determination (): Represents the proportion of the variance in the dependent variable that is predictable from the independent variable.

Example: If the regression equation for study time (hours) and exam score is , then for each additional hour studied, the exam score increases by approximately 5.98 points.

Statistical Significance in Regression

To determine if the observed relationship is statistically significant, hypothesis tests are conducted on the slope parameter.

  • Null Hypothesis (): (no linear relationship)

  • Alternative Hypothesis (): (linear relationship exists)

  • Test Statistic:

  • Decision Rule: Compare the p-value to the significance level (commonly 0.05). If , reject .

Example: In a study of fertilizer and tomato yield, a significant p-value indicates a linear relationship between fertilizer amount and yield.

Prediction and Confidence Intervals

  • Prediction: Use the regression equation to estimate the response variable for a given value of the explanatory variable.

  • Standard Error of Estimate (): Measures the typical distance that the observed values fall from the regression line.

  • Confidence Interval for Slope: Provides a range of plausible values for the true slope parameter.

Example: If the 95% confidence interval for the slope is (6.60, 10.57), we are 95% confident that the true increase in sales per customer is between $6.60 and $10.57.

Inference on Categorical Data

Chi-Square Tests for Independence and Goodness-of-Fit

Chi-square tests are used to analyze categorical data, testing hypotheses about distributions or relationships between categorical variables.

  • Goodness-of-Fit Test: Determines if a sample matches a population with a specific distribution.

  • Test Statistic: , where is the observed frequency and is the expected frequency.

  • Degrees of Freedom: For goodness-of-fit, where is the number of categories.

  • Critical Value: Compare the calculated to the critical value from the chi-square table at the chosen significance level.

Example: Testing if a die is fair by comparing observed and expected frequencies for each face.

Chi-Square Test for Independence

  • Purpose: To determine if two categorical variables are independent.

  • Hypotheses:

    • : The variables are independent.

    • : The variables are dependent.

  • Degrees of Freedom: , where is the number of rows and is the number of columns.

Example: Testing if income and happiness are independent using a contingency table and the chi-square statistic.

Chi-Square Test for Homogeneity

  • Purpose: To compare the distribution of a categorical variable across several populations.

  • Hypotheses:

    • : The proportions are the same across groups.

    • : At least one proportion is different.

Example: Testing if the proportion of a certain characteristic is the same in different groups.

Tables

Example: Study Time and Exam Score Data Table

Study Time (minutes)

Score

20

65

30

70

40

72

50

75

60

80

70

85

80

90

90

95

100

98

110

100

Purpose: To analyze the relationship between study time and exam score using regression and correlation.

Example: Fertilizer and Tomato Yield Data Table

Amount of Fertilizer (lbs)

Yield (lbs)

0

8

2

9

4

10

6

12

8

13

10

15

Purpose: To determine if there is a significant linear relationship between fertilizer amount and tomato yield.

Key Formulas

  • Regression Line:

  • Correlation Coefficient:

  • Coefficient of Determination:

  • Chi-Square Statistic:

  • Standard Error of Estimate:

Additional info:

  • Some context and explanations were expanded for clarity and completeness.

  • Tables were reconstructed based on visible data and standard statistical practice.

Pearson Logo

Study Prep