The number of degrees of freedom for independence of two categorical variables is given by a simple formula: (*r* - 1)(*c* - 1). Here *r *is the number of rows and *c *is the number of columns in the two way table of the values of the categorical variable. Read on to learn more about this topic and to understand why this formula gives the correct number.

## Background

One step in the process of many hypothesis tests is the determination of the number degrees of freedom. This number is important because for probability distributions that involve a family of distributions, such as the chi-square distribution, the number of degrees of freedom pinpoints the exact distribution from the family that we should be using in our hypothesis test.

Degrees of freedom represent the number of free choices that we can make in a given situation. One of the hypothesis tests that requires us to determine the degrees of freedom is the chi-square test for independence for two categorical variables.

## Tests for Independence and Two-Way Tables

The chi-square test for independence calls for us to construct a two-way table, also known as a contingency table. This type of table has *r* rows and *c* columns, representing the *r* levels of one categorical variable and the *c* levels of the other categorical variable. Thus, if we do not count the row and column in which we record totals, there are a total of *rc* cells in the two-way table.

The chi-square test for independence allows us to test the hypothesis that the categorical variables are independent of one another. As we mentioned above, the *r* rows and *c* columns in the table give us (*r* - 1)(*c* - 1) degrees of freedom. But it may not be immediately clear why this is the correct number of degrees of freedom.

## The Number of Degrees of Freedom

To see why (*r* - 1)(*c* - 1) is the correct number, we will examine this situation in more detail. Suppose that we know the marginal totals for each of the levels of our categorical variables. In other words, we know the total for each row and the total for each column. For the first row, there are *c* columns in our table, so there are *c* cells. Once we know the values of all but one of these cells, then because we know the total of all of the cells it is a simple algebra problem to determine the value of the remaining cell. If we were filling in these cells of our table, we could enter *c* - 1 of them freely, but then the remaining cell is determined by the total of the row. Thus there are *c* - 1 degrees of freedom for the first row.

We continue in this manner for the next row, and there are again *c* - 1 degrees of freedom. This process continues until we get to the penultimate row. Each of the rows except for the last one contributes *c* - 1 degrees of freedom to the total. By the time that we have all but the last row, then because we know the column sum we can determine all of the entries of the final row. This gives us *r* - 1 rows with *c* - 1 degrees of freedom in each of these, for a total of (*r* - 1)(*c* - 1) degrees of freedom.

## Example

We see this with the following example. Suppose that we have a two way table with two categorical variables. One variable has three levels and the other has two. Furthermore, suppose that we know the row and column totals for this table:

Level A | Level B | Total | |

Level 1 | 100 | ||

Level 2 | 200 | ||

Level 3 | 300 | ||

Total | 200 | 400 | 600 |

The formula predicts that there are (3-1)(2-1) = 2 degrees of freedom. We see this as follows. Suppose that we fill in the upper left cell with the number 80. This will automatically determine the entire first row of entries:

Level A | Level B | Total | |

Level 1 | 80 | 20 | 100 |

Level 2 | 200 | ||

Level 3 | 300 | ||

Total | 200 | 400 | 600 |

Now if we know that the first entry in the second row is 50, then the rest of the table is filled in, because we know the total of each row and column:

Level A | Level B | Total | |

Level 1 | 80 | 20 | 100 |

Level 2 | 50 | 150 | 200 |

Level 3 | 70 | 230 | 300 |

Total | 200 | 400 | 600 |

The table is entirely filled in, but we only had two free choices. Once these values were known, the rest of the table was completely determined.

Although we do not typically need to know why there are this many degrees of freedom, it is good to know that we are really just applying the concept of degrees of freedom to a new situation.