Regression analysis allows us to predict values of a dependent variable y based on an independent variable x, but these predictions are estimates rather than exact values. To quantify the uncertainty around these predictions, we use prediction intervals, which are similar to confidence intervals but specifically apply to predicted y values from a regression model.
Consider a scenario where a public transportation association studies the relationship between temperature and the number of bus riders. The data shows a strong linear correlation with an r2 value of 0.874 and a standard error of the estimate, se, equal to 2.97. The regression line equation is given, allowing us to predict the number of riders for a specific temperature.
To predict the number of riders when the temperature is 35 degrees, substitute x0 = 35 into the regression equation:
\[\hat{y}_0 = b_0 + b_1 x_0\]For example, if the regression equation is \(\hat{y} = 79.143 - 1.0459x\), then:
\[\hat{y}_0 = 79.143 - 1.0459 \times 35 = 42.54\]This means the best estimate for the number of riders at 35 degrees is approximately 42.54.
Before constructing a 95% prediction interval for this estimate, two conditions must be verified: there must be a strong linear correlation between x and y, and the prediction point x0 must lie within the range of observed data to avoid unreliable extrapolation. In this case, both conditions are satisfied.
The prediction interval is calculated as:
\[\hat{y}_0 \pm t_{\alpha/2, n-2} \times s_e \sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum (x_i - \bar{x})^2}}\]Here, \(t_{\alpha/2, n-2}\) is the critical value from the t-distribution with \(n-2\) degrees of freedom, corresponding to the desired confidence level (e.g., 95%), \(s_e\) is the standard error of the estimate, \(n\) is the number of data points, \(\bar{x}\) is the mean of the observed x values, and the denominator is the sum of squared deviations of x values from their mean.
To find the critical t-value for a 95% prediction interval with 13 data points, calculate:
\[t_{0.025, 11} \approx 2.201\]Using the data, the mean temperature \(\bar{x}\) is approximately 29.92, the sum of the x values is 389, and the sum of squares of the x values is 12,255. The numerator and denominator inside the square root are computed as:
\[\text{Numerator} = n (x_0 - \bar{x})^2 = 13 \times (35 - 29.92)^2 = 335.1\]\[\text{Denominator} = n \sum x_i^2 - (\sum x_i)^2 = 13 \times 12,255 - 389^2 = 7,994\]Substituting these values into the margin of error formula:
\[E = t_{\alpha/2, n-2} \times s_e \times \sqrt{1 + \frac{1}{n} + \frac{\text{Numerator}}{\text{Denominator}}} = 2.201 \times 2.97 \times \sqrt{1 + \frac{1}{13} + \frac{335.1}{7,994}} \approx 6.914\]Finally, the 95% prediction interval for the number of riders at 35 degrees is:
\[( \hat{y}_0 - E, \hat{y}_0 + E ) = (42.54 - 6.914, 42.54 + 6.914) = (35.62, 49.45)\]This interval means we are 95% confident that the actual number of bus riders when the temperature is 35 degrees will fall between approximately 35.62 and 49.45.
Understanding prediction intervals enhances the interpretation of regression predictions by quantifying the expected variability around predicted values. Tools like Excel simplify the calculation process, especially for complex formulas involving sums and critical values, making it easier to apply these concepts in practical data analysis.