Understanding Quantiles

male college students studying
Hero Images/Getty Images

Summary statistics such as the median, first quartile and third quartile are measurements of position. This is because these numbers indicate where a specified proportion of the distribution of data lies. For instance, the median is the middle position of the data under investigation. Half of the data have values less than the median. Similarly 25% of the data have values less than the first quartile and 75% of the data have values less than the third quartile.

This concept can be generalized. One way to do this is to consider percentiles. The 90th percentile indicates the point where 90% percent of the data have values less than this number. More generally, the pth percentile is the number n for which p% of the data is less than n.

Continuous Random Variables

Although the order statistics of median, first quartile and third quartile are typically introduced in a setting with a discrete set of data, these statistics can also be defined for a continuous random variable. Since we are working with a continuous distribution we use the integral. The pth percentile is a number n such that:

-₶n f ( x ) dx = p/100.

Here f ( x ) is a probability density function. Thus we can obtain any percentile that we want for a continuous distribution.

Quantiles

A further generalization is to note that our order statistics are splitting the distribution that we are working with.

The median splits the data set in half, and the median, or 50th percentile of a continuous distribution splits the distribution in half in terms of area. The first quartile, median and third quartile partition our data into four pieces with the same count in each. We can use the above integral to obtain the 25th, 50th and 75th percentiles, and split a continuous distribution into four portions of equal area.

We can generalize this procedure. The question that we can start with is given a natural number n, how can we split the distribution of a variable into n equally sized pieces? This speaks directly to the idea of quantiles.

The n quantiles for a data set are found approximately by ranking the data in order and then splitting this ranking through n - 1 equally spaced points on the interval.

If we have a probability density function for a continuous random variable, we use the above integral to find the quantiles. For n quantiles, we want:

  • The first to have 1/n of the area of the distribution to the left of it.
  • The second to have 2/n of the area of the distribution to the left of it.
  • The rth to have r/n of the area of the distribution to the left of it.
  • The last to have (n - 1)/n of the area of the distribution to the left of it.

We see that for any natural number n, the n quantiles correspond to the 100r/nth percentiles, where r can be any natural number from 1 to n - 1.

Common Quantiles

Certain types of quantiles are used commonly enough to have specific names. Below is a list of these:

  • The 2 quantile is called the median
  • The 3 quantiles are called terciles
  • The 4 quantiles are called quartiles
  • The 5 quantiles are called quintiles
  • The 6 quantiles are called sextiles
  • The 7 quantiles are called septiles
  • The 8 quantiles are called octiles
  • The 10 quantiles are called deciles
  • The 12 quantiles are called duodeciles
  • The 20 quantiles are called vigintiles
  • The 100 quantiles are called percentiles
  • The 1000 quantiles are called permilles
Of course other quantiles exist beyond the ones in the list above. Many times the specific quantile used matches the size of the sample from a continuous distribution.

Use of Quantiles

Besides specifying the position of a set of data, quantiles are helpful in other ways. Suppose we have a simple random sample from a population, and the distribution of the population is unknown. To help determine if a model, such as a normal distribution or Weibull distribution is a good fit for the population we sampled from, we can look at the quantiles of our data and the model.

By matching the quantiles from our sample data to the quantiles from a particular probability distribution, the result is a collection of paired data. We plot these data in a scatterplot, known as a quantile-quantile plot or q-q plot. If the resulting scatterplot is roughly linear, then the model is a good fit for our data.