# Correlation Matrix and Heatmap: R and Excel

A quick way to discover relationships between  pairs of quantitative variables in a dataset is a heatmap based on pair-wise correlations.

Here we do that in a variety of ways with the dataset StudentSurvey.csv

## In Excel with the Data Analysis Add-In

To activate the Data Analsysis Add-in, see this site

• Select Data–Data Analysis–Correlation
• Select the input range..in our case columns F:Q
• Check the box for “Labels in first row”
• Select output…either a new worksheet or a location in the current sheet

Fun take-aways:  The strongest positive correlations are between various SAT scores, Height/Weight, and Siblings/BirthOrder.  The strongest negative correlations are between Piercings and Height/Weight.  Do heavier, taller people really get fewer piercings?  Maybe there’s something else behind that 🙂

## In Excel without the Data Analysis plug-in

Some of my students have Office 2011 running on their MacBooks, and the Data Analysis add-in isn’t available for that version of Excel.  So here’s an excel add-in I made in Visual Basic following the many examples provided by the online excel community. To use it:

• open the file linked above in excel, then save as “Excel Add-in”…if all goes well, it will go into the proper directory to allow you to do the next step
• In Excel, load the custom add-in cm:  Tools—Add ins—cm
• make row and column labels:  select the labels for columns F thru Q, paste where you want the correlation matrix
• paste special with “transpose”  to make the row labels
• select the 13 by 13 array of cells and type =corrmatrix(f2:q363)   CTRL-SHIFT-ENTER
• you’ll know you’re on the right track if the autocomplete suggests “CorrMatrix” when you begin typing in the previous step.

## In LibreOffice Calc:

• Select the (contiguous) columns
• Select Data–Statistics–Correlation to make the matrix of correlations
• Use Format–Conditional Formatting to color the cells of the resulting matrix

The “get-what-you-pay-for” feature in LibreOffice (here version 5.2) is that the columns are labeled by a meaningless collection of numbers:  Column 1, Column 2, etc  which refer to the location of the column in relation to the selection, not the spreadsheet.  In fact, if the column headings are included in the selection, the correlation method complains about non numerical data.  To get the headings, you can copy and paste/paste special as in the second method above.

## R with base graphics:

m=StudentSurvey[6:17]
cm=cor(m,use=”na.or.complete”)
heatmap(cm)

The treelike network of lines is called a dendrogram — it seems to come by default with heatmap().  Notice the pairs connected at the first level of the dendrogram: Height/Weight, SATs, Siblings/BirthOrder.

## R with ggplot2

m=StudentSurvey[6:17]
cm=cor(m,use=”na.or.complete”)
library(“ggplot2”)
library(“reshape2”)  #we need this to “melt” matrix  cm into a dataframe
meltcm=melt(cm)
G=ggplot(meltcm)
G+geom_tile(aes(Var2,Var1,fill=value))

## Fancier with ggplot2

rcm=round(cm,2)
meltrcm=melt(rcm)
G=ggplot(meltrcm)
G+geom_tile(aes(Var2,Var1,fill=value))+geom_text(aes(Var2,Var1,label=value))