Skip to main content
Back

Regression Analysis, Correlation, and Statistical Inference: Study Notes for Statistics

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Regression Analysis and Correlation

Scatter Plots

Scatter plots are graphical representations used to visualize the relationship between two quantitative variables. Each point on the plot represents an observation with coordinates corresponding to the values of the two variables.

  • Purpose: To identify patterns, trends, and possible correlations between variables.

  • Interpretation: Patterns may suggest linear, nonlinear, or no relationship.

  • Example: Plotting height versus weight for a group of individuals to observe if taller people tend to weigh more.

Correlation

Correlation measures the strength and direction of a linear relationship between two quantitative variables.

  • Correlation Coefficient (r): Ranges from -1 to 1.

  • Positive Correlation: As one variable increases, the other tends to increase.

  • Negative Correlation: As one variable increases, the other tends to decrease.

  • No Correlation: No discernible linear relationship.

  • Example: A correlation coefficient of 0.85 indicates a strong positive relationship.

Formula:

Regression Analysis

Regression analysis is a statistical method for modeling the relationship between a dependent variable and one or more independent variables.

  • Simple Linear Regression: Models the relationship between two variables.

  • Multiple Regression: Models the relationship between one dependent variable and two or more independent variables.

  • Regression Equation:

  • Interpretation of Coefficients: Each coefficient represents the expected change in the dependent variable for a one-unit change in the corresponding independent variable, holding other variables constant.

  • Example: In a model predicting movie revenue, the coefficient for "Genre" indicates the average difference in revenue between genres.

R-squared

R-squared () is a statistical measure that represents the proportion of the variance for the dependent variable explained by the independent variables in the regression model.

  • Range: 0 to 1

  • Interpretation: Higher values indicate a better fit of the model to the data.

  • Formula:

  • Example: An of 0.66 means 66% of the variation in the dependent variable is explained by the model.

Statistical Inference in Regression

Confidence Intervals and p-values

Statistical inference in regression involves using confidence intervals and p-values to assess the significance of regression coefficients.

  • Confidence Interval: Provides a range of plausible values for a regression coefficient.

  • p-value: Tests the null hypothesis that a coefficient is zero (no effect).

  • Significance: If the p-value is less than the chosen significance level (e.g., 0.05), the coefficient is considered statistically significant.

  • Example: If the 95% confidence interval for a coefficient does not include zero and the p-value is less than 0.05, the effect is statistically significant.

Formula for Confidence Interval:

Formula for p-value:

Interpreting Regression Output

Multiple Regression Table Example

The following table summarizes regression coefficients, p-values, and confidence intervals for a model predicting movie revenue:

Variable

Coefficient

p-value

95% CI Lower

95% CI Upper

Intercept

7.36

0.06

-0.54

15.27

Genre

0.61

0.58

-1.74

2.96

MPAA rating

8.57

0.07

-0.88

18.02

Opening Screens

0.03

0.01

0.01

0.04

Review

1.77

0.06

-0.08

3.62

Additional info: This table is reconstructed from the provided regression output. It demonstrates how to interpret coefficients, p-values, and confidence intervals in a multiple regression context.

Interpreting Coefficients and Statistical Significance

  • Genre: The coefficient (0.61) is not statistically significant (p = 0.58 > 0.05).

  • MPAA rating: The coefficient (8.57) is not statistically significant (p = 0.07 > 0.05).

  • Opening Screens: The coefficient (0.03) is statistically significant (p = 0.01 < 0.05), and the 95% CI does not include zero.

  • Interpretation: Only "Opening Screens" has a statistically significant effect on movie revenue at the 0.05 level.

Review Problems and Applications

Correlation Interpretation

  • Strong Positive Correlation: indicates that as X increases, Y tends to increase.

  • Scatterplot Patterns: Different values of produce different scatterplot shapes (e.g., upward sloping for positive , downward for negative , no pattern for near 0).

Regression Model Interpretation

  • Predicted Value Calculation: Use the regression equation to estimate the dependent variable for given values of independent variables.

  • Example: For SAT scores and GPA, a decrease in SAT by 10 points leads to a decrease in GPA by .

Multiple Regression and Indicator Variables

  • Indicator Variables: Used to represent categorical variables in regression models (e.g., genre, rating).

  • Interpretation: The coefficient for an indicator variable shows the expected difference in the dependent variable between categories.

Vocabulary Terms

  • Scatterplot

  • Correlation

  • Regression Line

  • Intercept

  • Slope

  • Extrapolation

  • Regression Model

  • R-squared

  • Indicator Variables

Summary Table: Correlation Coefficient and Scatterplot Patterns

Correlation (r)

Pattern

Interpretation

+0.9

Strong upward linear

Strong positive relationship

0

No pattern

No linear relationship

-0.6

Downward linear

Moderate negative relationship

+0.5

Upward linear, moderate

Moderate positive relationship

Additional info: This table summarizes how the value of the correlation coefficient relates to the pattern observed in a scatterplot.

Key Formulas

  • Simple Linear Regression:

  • Multiple Regression:

  • R-squared:

  • Confidence Interval for Coefficient:

Applications

  • Regression Analysis: Used in business, economics, health sciences, and social sciences to predict outcomes and assess relationships between variables.

  • Correlation: Used to assess the strength of association between variables before modeling.

  • Statistical Inference: Used to determine whether observed effects are likely to be real or due to random chance.

Pearson Logo

Study Prep