# Cluster Analysis and How Its Used in Research

Cluster analysis is a statistical technique used to identify how various units -- like people, groups, or societies -- can be grouped together because of characteristics they have in common. Also known as clustering, it is an exploratory data analysis tool that aims to sort different objects into groups in such a way that when they belong to the same group they have a maximal degree of association and when they do not belong to the same group their degree of association is minimal. Unlike some other statistical techniques, the structures that are uncovered through cluster analysis need no explanation or interpretation – it discovers structure in the data without explaining why they exist.

## What Is Clustering?

Clustering exists in almost every aspect of our daily lives. Take, for example, items in a grocery store. Different types of items are always displayed in the same or nearby locations – meat, vegetables, soda, cereal, paper products, etc. Researchers often want to do the same with data and group objects or subjects into clusters that make sense.

To take an example from social science, let’s say we are looking at countries and want to group them into clusters based on characteristics such as division of labor, militaries, technology, or educated population. We would find that Britain, Japan, France, Germany, and the United States have similar characteristics and would be clustered together. Uganda, Nicaragua, and Pakistan would be also be grouped together in a different cluster because they share a different set of characteristics, including low levels of wealth, simpler divisions of labor, relatively unstable and undemocratic political institutions, and low technological development.

Cluster analysis is typically used in the exploratory phase of research when the researcher does not have any pre-conceived hypotheses. It is commonly not the only statistical method used, but rather is done in the early stages of a project to help guide the rest of the analysis. For this reason, significance testing is usually neither relevant nor appropriate.

There are several different types of cluster analysis. The two most commonly used are K-means clustering and hierarchical clustering.

## K-means Clustering

K-means clustering treats the observations in the data as objects having locations and distances from each other (note that the distances used in clustering often do not represent spatial distances). It partitions the objects into K mutually exclusive clusters so that objects within each cluster are as close to each other as possible and at the same time, as far from objects in other clusters as possible. Each cluster is then characterized by its mean or center point.

## Hierarchical Clustering

Hierarchical clustering is a way to investigate groupings in the data simultaneously over a variety of scales and distances. It does this by creating a cluster tree with various levels. Unlike K-means clustering, the tree is not a single set of clusters. Rather, the tree is a multi-level hierarchy where clusters at one level are joined as clusters at the next higher level. The algorithm that is used starts with each case or variable in a separate cluster and then combines clusters until only one is left. This allows the researcher to decide what level of clustering is most appropriate for his or her research.

## Performing A Cluster Analysis

Most statistics software programs can perform cluster analysis. In SPSS, select analyze from the menu, then classify and cluster analysis. In SAS, the proc cluster function can be used.

Updated by Nicki Lisa Cole, Ph.D.

Format
mla apa chicago