Understanding Regression Output
(or at least finding what you want!)

Suppose we are interested in investigating the relationship between two variables, X and Y, where X is the independent variable and Y is the dependent variable. We believe there is a linear relationship between the two variables. Therefore, we are considering the following model:

y = β0 + β1x + ε     or        E(y) = β0 + β1x

A sample provides the following:

X

Y

2

18

5

14

8

9

9

5

7

11

4

14

When we use Excel’s Regression Procedure under Tools, Data Analysis we obtain output that will be explained in three parts below. The first part gives what Excel refers to as Regression Statistics. For this example, the first two columns were provided by Excel and the next column explains what is reported:

Regression Statistics

Explanation

Multiple R

0.971873

R Square

0.944538

Coefficient of Determination--Proportion of variation in y that can be explained by using x to estimate y.

Calculated as SSR/SSTotal (where SSR is Sum of Squares Regression and SSTotal is Sum of Squares Total)

Adjusted R Square

0.930672

Standard Error

1.194084

Estimate of the standard deviation of the points around the line (i.e., a measure of how much the y’s will vary for a given value of x)

Calculated as square root of MSE (where MSE is Mean Square Error

Observations

6

Number of observations in the sample

The second part of the output provides the Analysis of Variance (ANOVA) Table. To explain this output, the ANOVA table will be provided followed by another table that describes the corresponding cells:

ANOVA

df

SS

MS

F

Significance F

Regression

1

97.12998

97.12998

68.12136

0.001175532

Residual

4

5.703349

1.425837

Total

5

102.8333

 

ANOVA

df

degrees of freedom

SS

Sum of Squares

MS

Mean Square

F

(value to look up from the F distribution)

Significance F

(actually a p value)

Regression number of β’s in the model (not counting β0) SSR=

b12SSx

MSR =

SSR/dfR

MSR/MSE Corresponds to the area under the F curve to the right of the value calculated in the previous column. (Use dfR, dfe)
ERROR is a more common name for this row obtained by subtraction obtain by subtraction MSE =

SSE/dfe

This provides s2—an estimate of the variance of the y’s for a given value of x

Total

n-1

SSy

Also note that df(regression) + df(error) = df(total) and SS(regression) + SS(error) = SS(total).  Significance F (the p value) can be found using Excel and the function =fdist(F, df(regression), df(error)).  So for this example, we could type =fdist(68.12136,1,4) and Excel would return 0.001175532.

The F statistic and the associated p value are used to test a hypothesis. The associated hypotheses are:

H0: all of the β’s (with the exception of b0) are 0
Ha: at least one of the β’s is something other than 0

For simple linear regression these hypotheses can be written more simply as:

H0: β1 = 0
Ha: β1 ¹ 0

For any hypothesis test, the result will be one of two things: Reject the null hypothesis or Fail to reject the null hypothesis. For this set of hypotheses, rejecting the null hypothesis would say that we have enough evidence to conclude that β1 is not 0. Notice in our model (y = β0 + β1x + ε), if β1 is not 0, different values of x will produce different values of y. Therefore, rejecting the null hypothesis equates to concluding that there is a linear relationship between x and y.

We determine if we should reject the null hypothesis or not by looking at the p value. The lower the p value, the more willing we are to reject the null hypothesis. The p value gives us a measure of how unusual the sample results would be if the null hypothesis were true.

If we are given a value for α, we can compare α to the p value. If α is larger than the p value, we reject the null hypothesis. Recall that α is selected by the decision maker to reflect the amount of risk he/she is willing to take.

The third part of the output gives information about the equation for the line that "best fits" the sample. Note: this provides the straight line with the lowest sum of squared residuals—this does not say that a straight line would be the best model!). The part of the table that we will use is reproduced below followed by another table that describes the quantities provided:

Coefficients

Standard Error

t Stat

P-value

Intercept

21.57416

1.276911

16.89558

7.19E-05

X

-1.66986

0.202319

-8.25357

0.001176

 

Coefficients

Standard Error

t Stat

P-value

Intercept

bo =

ybar – b1xbar

X

b1 = Sxy/SSx

sb1 =
s/(square root of SSxx)

(b1-0)/sb1

P(t > |t calc|)
where t calc is the number in the previous column (Use dfe)

The equation for the line is read from the part of the output. Since the regression equation is

yhat = b0 + b1x

we can substitute the values of b0 and b1 from the printout and write:

yhat = 21.57416 - 1.66986x

The t statistic and the associated p value are used to test a hypothesis. The associated hypotheses are:

H0: β1 = 0 given the other β’s are in the model
Ha: β1 ≠ 0 given the other β’s are in the model

For simple linear regression these hypotheses can be written more simply as:

H0: β1 = 0
Ha: β1 ≠ 0

Note that for simple linear regression, this test and the test from the ANOVA table are the same—this will not be true when we move to multiple regression. All of our general statements about rejecting the null hypothesis that me made above still apply. Therefore, if the amount of risk that we are willing to take is more than the p value, we will reject the null hypothesis and conclude that there is a linear relationship between x and y.

Back to Tutorials Page

Back to BUSA 3110 Home Page