Linear Regression Analysis

Linear Regression And Multiple Linear Regression

Linear regression is a statistical technique that is used to learn more about the relationship between an independent (predictor) variable and a dependent (criterion) variable. When you have more than one independent variable in your analysis, this is referred to as multiple linear regression. In general, regression allows the researcher to ask the general question “What is the best predictor of…?”

For example, let say we were studying the causes of obesity, measured by body mass index (BMI). In particular, we wanted to see if the following variables were significant predictors of a person’s BMI: number of fast food meals eaten per week, number of hours of television watched per week, the number of minutes spent exercising per week, and parents’ BMI. Linear regression would be a good methodology for this analysis.

The Regression Equation

When you are conducting a regression analysis with one independent variable, the regression equation is Y = a + b*X where Y is the dependent variable, X is the independent variable, a is the constant (or intercept), and b is the slope of the regression line. For example, let’s say that GPA is best predicted by the regression equation 1 + 0.02*IQ. If a student had an IQ of 130, then, his or her GPA would be 3.6 (1 + 0.02*130 = 3.6).

When you are conducting a regression analysis in which you have more than one independent variable, the regression equation is Y = a + b1*X1 + b2*X2 + … +bp*Xp.

For example, if we wanted to include more variables to our GPA analysis, such as measures of motivation and self-discipline, we would use this equation.


R-square, also known as the coefficient of determination, is a commonly used statistic to evaluate the model fit of a regression equation. That is, how good are all of your independent variables at predicting your dependent variable?

The value of R-square ranges from 0.0 to 1.0 and can be multiplied by 100 to obtain a percentage of variance explained. For example, going back to our GPA regression equation with only one independent variable (IQ)…Let’s say that our R-square for the equation was 0.4. We could interpret this to mean that 40% of the variance in GPA is explained by IQ. If we then add our other two variables (motivation and self-discipline) and the R-square increases to 0.6, this means that IQ, motivation, and self-discipline together explain 60% of the variance in GPA scores.

Regression analyses are typically done using statistics software, such as SPSS or SAS and so the R-square is calculated for you.

Interpreting The Regression Coefficients (b)

The b coefficients from the equations above represent the strength and direction of the relationship between the independent and dependent variables. If we look at the GPA and IQ equation, 1 + 0.02*130 = 3.6, 0.02 is the regression coefficient for the variable IQ. This tells us that the direction of the relationship is positive so that as IQ increases, GPA also increases. If the equation were 1 - 0.02*130 = Y, then this would mean that the relationship between IQ and GPA was negative.


There are several assumptions about the data that must be met in order to conduct a linear regression analysis:

  • Linearity: It is assumed that the relationship between the independent and dependent variables is linear. Though this assumption can never be fully confirmed, looking at a scatterplot of your variables can help make this determination. If a curvature in the relationship is present, you may consider transforming the variables or explicitly allowing for nonlinear components.
  • Normality: It is assumed that the residuals of your variables are normally distributed. That is, the errors in the prediction of the value of Y (the dependent variable) are distributed in a way that approaches the normal curve. You can look at histograms or normal probability plots to inspect the distribution of your variables and their residual values.
  • Independence: It is assumed that the errors in the prediction of the value of Y are all independent of one another (not correlated).
  • Homoscedasticity: It is assumed that the variance around the regression line is the same for all values of the independent variables.


StatSoft: Electronic Statistics Textbook. (2011).