Understanding Regression Output
(or at least finding what you want!)
Suppose we are interested in investigating the relationship between two variables, X and Y, where X is the independent variable and Y is the dependent variable. We believe there is a linear relationship between the two variables. Therefore, we are considering the following model:
y = β0 + β1x + ε or E(y) = β0 + β1x
A sample provides the following:
X |
Y |
2 |
18 |
5 |
14 |
8 |
9 |
9 |
5 |
7 |
11 |
4 |
14 |
When we use Excels Regression Procedure under Tools, Data Analysis we obtain output that will be explained in three parts below. The first part gives what Excel refers to as Regression Statistics. For this example, the first two columns were provided by Excel and the next column explains what is reported:
Regression Statistics |
Explanation |
|
| Multiple R | 0.971873 |
|
| R Square | 0.944538 |
Coefficient of Determination--Proportion of
variation in y that can be explained by using x to estimate y. Calculated as SSR/SSTotal (where SSR is Sum of Squares Regression and SSTotal is Sum of Squares Total) |
| Adjusted R Square | 0.930672 |
|
| Standard Error | 1.194084 |
Estimate of the standard deviation of the points
around the line (i.e., a measure of how much the ys will vary for a given value of
x) Calculated as square root of MSE (where MSE is Mean Square Error |
| Observations | 6 |
Number of observations in the sample |
The second part of the output provides the Analysis of Variance (ANOVA) Table. To explain this output, the ANOVA table will be provided followed by another table that describes the corresponding cells:
| ANOVA | |||||
df |
SS |
MS |
F |
Significance F |
|
| Regression | 1 |
97.12998 |
97.12998 |
68.12136 |
0.001175532 |
| Residual | 4 |
5.703349 |
1.425837 |
||
| Total | 5 |
102.8333 |
| ANOVA | |||||
df degrees of freedom |
SS Sum of Squares |
MS Mean Square |
F (value to look up from the F distribution) |
Significance F (actually a p value) |
|
| Regression | number of βs in the model (not counting β0) | SSR= b12SSx |
MSR = SSR/dfR |
MSR/MSE | Corresponds to the area under the F curve to the right of the value calculated in the previous column. (Use dfR, dfe) |
| ERROR is a more common name for this row | obtained by subtraction | obtain by subtraction | MSE = SSE/dfe This provides s2an estimate of the variance of the ys for a given value of x |
||
| Total | n-1 |
SSy |
Also note that df(regression) + df(error) = df(total) and SS(regression) + SS(error) = SS(total). Significance F (the p value) can be found using Excel and the function =fdist(F, df(regression), df(error)). So for this example, we could type =fdist(68.12136,1,4) and Excel would return 0.001175532.
The F statistic and the associated p value are used to test a hypothesis. The associated hypotheses are:
H0: all of the βs (with the exception of b0) are 0
Ha: at least one of the βs is something other
than 0
For simple linear regression these hypotheses can be written more simply as:
H0: β1 = 0
Ha: β1 ¹ 0
For any hypothesis test, the result will be one of two things: Reject the null hypothesis or Fail to reject the null hypothesis. For this set of hypotheses, rejecting the null hypothesis would say that we have enough evidence to conclude that β1 is not 0. Notice in our model (y = β0 + β1x + ε), if β1 is not 0, different values of x will produce different values of y. Therefore, rejecting the null hypothesis equates to concluding that there is a linear relationship between x and y.
We determine if we should reject the null hypothesis or not by looking at the p value. The lower the p value, the more willing we are to reject the null hypothesis. The p value gives us a measure of how unusual the sample results would be if the null hypothesis were true.
If we are given a value for α, we can compare α to the p value. If α is larger than the p value, we reject the null hypothesis. Recall that α is selected by the decision maker to reflect the amount of risk he/she is willing to take.
The third part of the output gives information about the equation for the line that "best fits" the sample. Note: this provides the straight line with the lowest sum of squared residualsthis does not say that a straight line would be the best model!). The part of the table that we will use is reproduced below followed by another table that describes the quantities provided:
Coefficients |
Standard Error |
t Stat |
P-value |
|
| Intercept | 21.57416 |
1.276911 |
16.89558 |
7.19E-05 |
| X | -1.66986 |
0.202319 |
-8.25357 |
0.001176 |
Coefficients |
Standard Error |
t Stat |
P-value |
|
| Intercept | bo = ybar b1xbar |
|||
| X | b1 = Sxy/SSx |
sb1 = |
(b1-0)/sb1 |
P(t > |t calc|) |
The equation for the line is read from the part of the output. Since the regression equation is
yhat = b0 + b1x
we can substitute the values of b0 and b1 from the printout and write:
yhat = 21.57416 - 1.66986x
The t statistic and the associated p value are used to test a hypothesis. The associated hypotheses are:
H0: β1 = 0 given the other βs are in the model
Ha: β1 ≠ 0
given the other βs are in the model
For simple linear regression these hypotheses can be written more simply as:
H0: β1 = 0
Ha: β1 ≠ 0
Note that for simple linear regression, this test and the test from the ANOVA table are
the samethis will not be true when we move to multiple regression. All of our
general statements about rejecting the null hypothesis that me made above still apply.
Therefore, if the amount of risk that we are willing to take is more than the p value, we
will reject the null hypothesis and conclude that there is a linear relationship between x
and y.
Back to Tutorials Page
Back to BUSA 3110 Home Page