Confidence intervals are one part of inferential statistics. The basic idea behind this topic is to estimate the value of an unknown population parameter by using a statistical sample. We can not only estimate the value of a parameter, but we can also adapt our methods to estimate the difference between two related parameters. For example we may want to find the difference in the percentage of the male U.S. voting population who supports a particular piece of legislation compared to the female voting population.

We will see how to do this type of calculation by constructing a confidence interval for the difference of two population proportions. In the process we will examine some of the theory behind this calculation. We will see some similarities in how we construct a confidence interval for a single population proportion as well as a confidence interval for the difference of two population means.

## Generalities

Before looking at the specific formula that we will use, let's consider the overall framework that this type of confidence interval fits into. The form of the type of confidence interval that we will look at is given by the following formula:

Estimate +/- Margin of Error

Many confidence intervals are of this type. There are two numbers that we need to calculate. The first of these values is the estimate for the parameter. The second value is the margin of error. This margin of error accounts for the fact that we do have an estimate. The confidence interval provides us with a range of possible values for our unknown parameter.

## Conditions

We should make sure that all of the conditions are satisfied before doing any calculation. To find a confidence interval for the difference of two population proportions, we need to make sure that the following hold:

- We have two simple random samples from large populations. Here "large" means that the population is at least 20 times larger than the size of the sample. The sample sizes will be denoted by
*n*_{1}and*n*_{2}. - Our individuals have been chosen independently of one another.
- There are at least ten successes and ten failures in each of our samples.

If the last item in the list is not satisfied, then there may be a way around this. We can modify the plus-four confidence interval construction and obtain robust results. As we go forward we assume that all of the above conditions have been met.

## Samples and Population Proportions

Now we are ready to construct our confidence interval. We start with the estimate for the difference between our population proportions. Both of these population proportions are estimated by a sample proportion. These sample proportions are statistics that are found by dividing the number of successes in each sample, and then dividing by the respective sample size.

The first population proportion is denoted by *p*_{1}. If the number of successes in our sample from this population is *k*_{1}, then we have a sample proportion of *k*_{1}* / n*_{1.}

We denote this statistic by p̂_{1}. We read this symbol as "p_{1}-hat" because it looks like the symbol p_{1} with a hat on top.

In a similar way we can calculate a sample proportion from our second population. The parameter from this population is *p*_{2}. If the number of successes in our sample from this population is *k*_{2}, and our sample proportion is p̂_{2 }*= k*_{2}* / n*_{2.}

These two statistics become the first part of our confidence interval. The estimate of *p*_{1} is p̂_{1}. The estimate of *p*_{2} is p̂_{2. }So the estimate for the difference *p*_{1} - *p*_{2} is p̂_{1 }- p̂_{2.}

## Sampling Distribution of the Difference of Sample Proportions

Next we need to obtain the formula for the margin of error. To do this we will first consider the sampling distribution of p̂_{1 }. This is a binomial distribution with probability of success *p*_{1} and *n*_{1} trials. The mean of this distribution is the proportion *p*_{1}. The standard deviation of this type of random variable has variance of *p*_{1 }(1 - *p*_{1 })/*n*_{1}.

The sampling distribution of p̂_{2 }is similar to that of p̂_{1 }. Simply change all of the indices from 1 to 2 and we have a binomial distribution with mean of p_{2 }and variance of *p*_{2 }(1 - *p*_{2 })/*n*_{2}.

We now need a few results from mathematical statistics in order to determine the sampling distribution of p̂_{1 }- p̂_{2}. The mean of this distribution is *p*_{1} - *p*_{2}. Due to the fact that the variances add together, we see that the variance of the sampling distribution is *p*_{1 }(1 - *p*_{1 })/*n*_{1} + *p*_{2 }(1 - *p*_{2 })/*n*_{2. }The standard deviation of the distribution is the square root of this formula.

There are a couple of adjustments that we need to make. The first is that the formula for the standard deviation of p̂_{1 }- p̂_{2} uses the unknown parameters of *p*_{1 }and *p*_{2}. Of course if we really knew these values, then it would not be an interesting statistical problem at all. We would not need to estimate the difference between *p*_{1 }and *p*_{2.. }Instead we could simply calculate the exact difference.

This problem can be fixed by calculating a standard error rather than a standard deviation. All that we need to do is to replace the population proportions by sample proportions. Standard errors are calculated from upon statistics instead of parameters. A standard error is useful because it effectively estimates a standard deviation. What this means for us is that we no longer need to know the value of the parameters *p*_{1} and *p*_{2}. *.*Since these sample proportions are known, the standard error is given by the square root of the following expression:

p̂_{1 }(1 - p̂_{1 })/*n*_{1} + p̂_{2 }(1 - p̂_{2 })/*n*_{2.}

The second item that we need to address is the particular form of our sampling distribution. It turns out that we can use a normal distribution to approximate the sampling distribution of p̂_{1 }- p̂_{2}. The reason for this is somewhat technical, but is outlined in the next paragraph.

Both p̂_{1 }and p̂_{2 }have a sampling distribution that is binomial. Each of these binomial distributions may be approximated quite well by a normal distribution. Thus p̂_{1 }- p̂_{2 }is a random variable. It is formed as a linear combination of two random variables. Each of these are approximated by a normal distribution. Therefore the sampling distribution of p̂_{1 }- p̂_{2 }is also normally distributed.

## Confidence Interval Formula

We now have everything we need to assemble our confidence interval. The estimate is (p̂_{1 }- p̂_{2}) and the margin of error is *z** [_{ }p̂_{1 }(1 - p̂_{1 })/*n*_{1} + p̂_{2 }(1 - p̂_{2 })/*n*_{2.}]^{0.5}. The value that we enter for *z* *is dictated by the level of confidence *C. *Commonly used values for *z* *are 1.645 for 90% confidence and 1.96 for 95% confidence. These values for *z* *denote the portion of the standard normal distribution where exactly *C *percent of the distribution is between *-z* *and *z*. *

The following formula gives us a confidence interval for the difference of two population proportions:

(p̂_{1 }- p̂_{2}) +/- *z** [_{ }p̂_{1 }(1 - p̂_{1 })/*n*_{1} + p̂_{2 }(1 - p̂_{2 })/*n*_{2.}]^{0.5}