Linear regression is a statistical tool that determines how well a straight line fits a set of paired data. The straight line that best fits that data is called the least squares regression line. This line can be used in a number of ways. One of these uses is to estimate the value of a response variable for a given value of an explanatory variable. Related to this idea is that of a residual.

Residuals are obtained by performing subtraction. All that we must do is to subtract the predicted value of *y* from the observed value of *y* for a particular *x*. The result is called a residual.

## Formula for Residuals

The formula for residuals is straightforward:

Residual = observed *y* – predicted *y*

It is important to note that the predicted value comes from our regression line. The observed value comes from our data set.

## Examples

We will illustrate the use of this formula by use of an example. Suppose that we are given the following set of paired data:

(1, 2), (2, 3), (3, 7), (3, 6), (4, 9), (5, 9)

By using software we can see that the least squares regression line is *y* = 2*x*. We will use this to predict values for each value of *x*.

For example, when *x* = 5 we see that 2(5) = 10. This gives us the point along our regression line that has an *x* coordinate of 5.

To calculate the residual at the points *x* = 5, we subtract the predicted value from our observed value. Since the *y* coordinate of our data point was 9, this gives a residual of 9 – 10 = -1.

In the following table we see how to calculate all of our residuals for this data set:

X | Observed y | Predicted y | Residual |

1 | 2 | 2 | 0 |

2 | 3 | 4 | -1 |

3 | 7 | 6 | 1 |

3 | 6 | 6 | 0 |

4 | 9 | 8 | 1 |

5 | 9 | 10 | -1 |

## Features of Residuals

Now that we have seen an example, there are a few features of residuals to note:

- Residuals are positive for points that fall above the regression line.
- Residuals are negative for points that fall below the regression line.
- Residuals are zero for points that fall exactly along the regression line.
- The greater the absolute value of the residual, the further that the point lies from the regression line.
- The sum of all of the residuals should be zero. In practice sometimes this sum is not exactly zero. The reason for this discrepancy is that roundoff errors can accumulate.

## Uses of Residuals

There are several uses for residuals. One use is to help us to determine if we have a data set that has an overall linear trend, or if we should consider a different model. The reason for this is that residuals help to amplify any nonlinear pattern in our data. What can be difficult to see by looking at a scatterplot can be more easily observed by examining the residuals, and a corresponding residual plot.

Another reason to consider residuals is to check that the conditions for inference for linear regression are met. After verification of a linear trend (by checking the residuals), we also check the distribution of the residuals. In order to be able to perform regression inference, we want the residuals about our regression line to be approximately normally distributed. A histogram or stemplot of the residuals will help to verify that this condition has been met.