ADVANCED CLUSTERING TOOLS IN R
R showcases several useful clustering tools, but there is a particular tool that is unusually useful in its combination of method and output display -it is the integration of hierarchical clustering with a visual display of its results in a heatmap. The term “heatmap” is often confusing, making most wonder – which is it? A “colorful visual representation of data in a matrix” or “a (thematic) map in which areas are represented in patterns (“heat” colors) that are proportionate to the measurement of some information being displayed on the map”? For our sole clustering purpose, the former meaning of a heatmap is more appropriate, while the latter is a choropleth.
The reason why we would want to link the use of a heatmap with hierarchical clustering is that a heatmap has the unique ability to clearly represent the information in a hierarchical clustering (HC) output, so that it is easily understood and more visually appealing. It is also (as an in-built package in R, “heatmap.2″) a mechanism of applying HC to both rows and columns in a data matrix, so that it yields meaningful groups that share certain features (within the same group) and are differentiated from each other (across different groups). It is richer than the usual clustering methods, since it combines observations or rows (e.g. customers) and features or columns (e.g. demographics). The row and column clustering will tell us which observations belong together most closely for which features. While this is generally what we expect out of a clustering output (say, k-means), the above tool does this in a one-step fashion, so that one doesn’t need to “profile” clusters the way one ends up doing in k-means like methods. The other useful feature of this tool we describe above is that since it uses hierarchical clustering, we don’t have to worry about data being numeric, an annoying restriction of k-means.
Consider the following simple example (an R dataset), “States”, containing information on 51 states on the following features:
EXAMPLE: “STATES” (R DATASET ON STATE EDUCATION PROFILES)
region: U. S. Census regions. A factor with levels: ENC, East North Central; ESC, East South Central; MA, Mid-Atlantic; MTN, Mountain; NE, New England; PAC, Pacific; SA, South Atlantic; WNC, West North Central; WSC, West South Central.
pop: Population: in 1,000s.
SATV: Average score of graduating high-school students in the state on the verbal component of the Scholastic Aptitude Test (a standard university admission exam).
SATM: Average score of graduating high-school students in the state on the math component of the Scholastic Aptitude Test.
percent: Percentage of graduating high-school students in the state who took the SAT exam.
dollars: State spending on public education, in \$1000s per student.
pay: Average teacher’s salary in the state, in $1000s.
We wish to account for all but the first column (region) to create groups of states that are common with respect to the different pieces of information we have about them. For instance, what states are similar vis-a-vis exam scores vs. state education spending? Instead of doing just a hierarchical clustering, we can implement both the HC and the visualization in one step, using the “heatmap.2″ package.
R CODE (output = “initial_plot.png”)
States[1:3,] # look at the data scaled
This initial heatmap gives us a lot of information about the potential state grouping. We have a classic HC dendrogram on the far left of the plot (the output we would have gotten from an “hclust()” rendering). However, in order to get an even cleaner look, and have groups fall right out of the plot, we can induce row and column separators, rendering an “all-the-information-in-one-glance” look. Placement information of the separators come from the HC dendrograms (both row and column). Lets also play around with the colors to get a “red-yellow-green” effect for the scaling, which will render the underlying information even more clearly. Finally, we’ll also eliminate the underlying dendrograms, so we simply have a clean color plot with underlying groups (this option can be easily undone from the code below).
R CODE (output = “final_plot.png”)
# Use color brewer library(RColorBrewer) my_palette
This plot gives us a nice, clear picture of the groups that come off of the HC implementation, as well as in context of column (attribute) groups. For instance, while Idaho, Okalahoma, Missouri and Arkansas perform well on the verbal and math SAT components, the state spending on education and average teacher salary is much lower than the other states. These attributes are reversed for Connecticut, New Jersey, DC, New York, Pennsylvania and Alaska. We see how the states and their features combine to form groups that yield helpful information.
This hierarchical-clustering/heatmap partnership is a useful, productive one, especially when one is digging through massive data, trying to glean some useful cluster-based conclusions, *and* render the conclusions in a clean, pretty, easily interpretable fashion – all done really just in 1 step. Its a step one can take to learn more about a dataset before actually analyzing it in detail, without spending too much time on translating the clustering results, as well as communicating them to different audiences.