A scatterplot is a type of graph that is used to represent paired data. The explanatory variable is plotted along the horizontal axis and the response variable is graphed along the vertical axis. One reason for using this type of graph is to look for relationships between the variables.

The most basic pattern to look for in a set of paired data is that of a straight line. Through any two points, we can draw a straight line. If there are more than two points in our scatterplot, most of the time we will no longer be able to draw a line that goes through every point. Instead, we will draw a line that passes through the midst of the points and displays the overall linear trend of the data.

As we look at the points in our graph and wish to draw a line through these points, a question arises. Which line should we draw? There is an infinite number of lines that could be drawn. By using our eyes alone, it is clear that each person looking at the scatterplot could produce a slightly different line. This ambiguity is a problem. We want to have a well-defined way for everyone to obtain the same line. The goal is to have a mathematically precise description of which line should be drawn. The least squares regression line is one such line through our data points.

## Least Squares

The name of the least squares line explains what it does. We start with a collection of points with coordinates given by (*x _{i}*,

*y*). Any straight line will pass among these points and will either go above or below each of these. We can calculate the distances from these points to the line by choosing a value of

_{i}*x*and then subtracting the observed

*y*coordinate that corresponds to this

*x*from the

*y*coordinate of our line.

Different lines through the same set of points would give a different set of distances. We want these distances to be as small as we can make them. But there is a problem. Since our distances can be either positive or negative, the sum total of all these distances will cancel each other out. The sum of distances will always equal zero.

The solution to this problem is to eliminate all of the negative numbers by squaring the distances between the points and the line. This gives a collection of nonnegative numbers. The goal we had of finding a line of best fit is the same as making the sum of these squared distances as small as possible. Calculus comes to the rescue here. The process of differentiation in calculus makes it possible to minimize the sum of the squared distances from a given line. This explains the phrase “least squares” in our name for this line.

## Line of Best Fit

Since the least squares line minimizes the squared distances between the line and our points, we can think of this line as the one that best fits our data. This is why the least squares line is also known as the line of best fit. Of all of the possible lines that could be drawn, the least squares line is closest to the set of data as a whole. This may mean that our line will miss hitting any of the points in our set of data.

## Features of the Least Squares Line

There are a few features that every least squares line possesses. The first item of interest deals with the slope of our line. The slope has a connection to the correlation coefficient of our data. In fact, the slope of the line is equal to *r(s _{y}/s_{x})*. Here

*s*denotes the standard deviation of the

_{ x}*x*coordinates and

*s*the standard deviation of the

_{ y}*y*coordinates of our data. The sign of the correlation coefficient is directly related to the sign of the slope of our least squares line.

Another feature of the least squares line concerns a point that it passes through. While the *y* intercept of a least squares line may not be interesting from a statistical standpoint, there is one point that is. Every least squares line passes through the middle point of the data. This middle point has an *x* coordinate that is the mean of the *x* values and a *y* coordinate that is the mean of the *y* values.