(In)Consistencies with base R functions and their input arguments.

My Introductory Statistics students are slightly annoyed by the fact that some base R functions have a conveniently uniform calling protocol, e.g. the frequently paired plot() and lm()

plot(response_variable~explanatory_variable,data=MyDataFrame)
linear_model=lm(response_variable~explanatory_variable,data=MyDataFrame)
abline(linear_model)

but then…they find that cor() doesn’t accept a formula x~y as an argument AND cor() won’t allow them to pass the dataframe with data=  .

Similarly, after getting over the shock of seeing NA as the output of mean(x) and learning that mean(x, na.rm=TRUE) gives them control over the treatment of NA in their data, they find that cor() doesn’t recognize na.rm and and instead needs something like use=”na.or.complete” !

I’ve written the experimental R package bRth that provides wrappers around a number of commonly-used base R functions (the ones named above and others) with the aim of providing a uniform calling protocol.  The names of the included functions are of the form <base_r_function_name>K:  meanK(), corK(), histK(), etc

To install the package from GitHub and see the contents, type the following in R (requires that the devtools package is installed)

devtools::install_github("ejbarth/bRth/bRth")
library("bRth")
package?bRth

The performance cost doesn’t seem to be too severe.  The functions themselves include just a little argument handling, and then a call to the base R function.  Here are timings for 10,000 calls to base:cor and corK that seem to indicate that if the underlying call to the base R function required negligible time, the wrapper adds about the same amount of time required by the base R function:

> system.time(replicate(10000,corK(Height~Weight,data=StudentSurvey,na.rm=TRUE)))
 user system elapsed 
 0.82 0.00 0.81 
> system.time(replicate(10000,cor(StudentSurvey$Height,StudentSurvey$Weight,use="na.or.complete")))
 user system elapsed 
 0.46 0.00 0.45

A funny thing happened with table() while making bootstrap samples in R

In my intro stats class, we spend some time computing bootstrap distributions for constructing confidence intervals.

One day, we gathered data from the 29 students in attendance about their class rank:  first year, sophomore, junior, senior.  I coded those categories with 1,2,3,4 and here are the results:

data

You can see that the this is basically a sophomore-level class.  We set about using bootstrap sampling to construct a 95% confidence interval for the proportion of sophomores.

I try to teach R using as few commands as possible, reusing what we know as we learn new techniques.  In this case, I used table()  again, extracting out the 2nd column with

table( … )[2]

to get the proportion of sophomores in each bootstrap sample as below.  The funny thing we noticed is that the bootstrap distribution was skewed to the left.  The resulting 95% confidence interval computed using percentiles surprised us because it captured 25%, and it seemed to us that the data didn’t support the possibility that a  class such as this one had a quarter sophomores.

code1hist1

What is going on above?  Turns out that with so few first-year students in our original sample, occasionally a bootstrap sample occurred with no first-years, like this:

badtable

I that case, my code to extract the proportion in the 2nd column actually grabbed a proportion of juniors, which will almost always be much smaller than the proportion of sophomores, resulting in those unwanted left-lying samples we see above.

In this case, with categories coded as 1,2,3,4,  replacing table()[2] with tabulate()[2] fixes the problem — tabulate populates the missing column with zero:

tabulate

That isn’t very general though, especially because we’d like a technique that applied when the categorical variable has entries codes as strings.  My solution was to use

length(which( … == 2)).

Here’s the much more believable result with that approach:

code2

hist2

Dotplots in R: Base Graphics, ggplot, more Base Graphics


The textbook I use in my intro stats course makes extensive use of dotplots as an intuitive alternative to histograms when the number of data points is small enough to visualize each case as a single dot.  Here are two base graphics and one ggplot solution.  I especially like the last option below.

Here base graphics function stripchart() takes some experimentation to get workable values for the parameters offset and at.  By contrast, qplot and ggplot handle this as easily as a histogram. Note however the vertical axis label “counts” in the qplot/ggplot dotplot clearly doesn’t match the scale of the axis, which looks more like a distribution, but probably not because I think the sum over that axis would be greater than 1.  In fact the documentation of  geom_dotplot admits to this error: “When binning along the x axis and stacking along the y axis, the numbers on y axis are not meaningful, due to technical limitations of ggplot2.”  Argh…..

We should probably hide the y axis with +scale_y_continuous(NULL, breaks = NULL) I  learned from stackoverflow of another method using base graphics plot() together with sort(), sequence(), and table().  After some needed attention to the axis labels, this option looks to me like the winner of the bunch..

stripchart(StudentSurvey$Height, method = "stack", offset = .5, 
                      at = .1, pch = 19)

qplot(Height,data=StudentSurvey,geom="dotplot")

G=ggplot(data=StudentSurvey)
G+geom_dotplot(aes(x=Height))

#Another way to get the job done using base graphics
x=StudentSurvey$Height
plot(sort(x), sequence(table(x)))

#Here's a custom function to make this last thing happen with appropriate labels
dotty=function(x){plot(sort(x), sequence(table(x)),ylab="count",xlab=deparse(substitute(x)))}

Correlation Matrix and Heatmap: R and Excel

A quick way to discover relationships between  pairs of quantitative variables in a dataset is a heatmap based on pair-wise correlations.

Here we do that in a variety of ways with the dataset StudentSurvey.csv

In Excel with the Data Analysis Add-In

To activate the Data Analsysis Add-in, see this site

  • Select Data–Data Analysis–Correlation
  • Select the input range..in our case columns F:Q
  • Check the box for “Labels in first row”
  • Select output…either a new worksheet or a location in the current sheet

excelanalysiscorrwindowexcelcorrmatrixanalsystool

Fun take-aways:  The strongest positive correlations are between various SAT scores, Height/Weight, and Siblings/BirthOrder.  The strongest negative correlations are between Piercings and Height/Weight.  Do heavier, taller people really get fewer piercings?  Maybe there’s something else behind that 🙂

In Excel without the Data Analysis plug-in

Some of my students have Office 2011 running on their MacBooks, and the Data Analysis add-in isn’t available for that version of Excel.  So here’s an excel add-in I made in Visual Basic following the many examples provided by the online excel community. To use it:

  • open the file linked above in excel, then save as “Excel Add-in”…if all goes well, it will go into the proper directory to allow you to do the next step
  • In Excel, load the custom add-in cm:  Tools—Add ins—cm
  • make row and column labels:  select the labels for columns F thru Q, paste where you want the correlation matrix
  • paste special with “transpose”  to make the row labels
  • select the 13 by 13 array of cells and type =corrmatrix(f2:q363)   CTRL-SHIFT-ENTER
  • you’ll know you’re on the right track if the autocomplete suggests “CorrMatrix” when you begin typing in the previous step.

corrmatrixexcelvba

 

In LibreOffice Calc:

  • Select the (contiguous) columns
  • Select Data–Statistics–Correlation to make the matrix of correlations
  • Use Format–Conditional Formatting to color the cells of the resulting matrix

The “get-what-you-pay-for” feature in LibreOffice (here version 5.2) is that the columns are labeled by a meaningless collection of numbers:  Column 1, Column 2, etc  which refer to the location of the column in relation to the selection, not the spreadsheet.  In fact, if the column headings are included in the selection, the correlation method complains about non numerical data.  To get the headings, you can copy and paste/paste special as in the second method above.

libreofficedatastatisticscorrelationmenulibreofficeconditionalformatingcormatrix_libreoffice

R with base graphics:

m=StudentSurvey[6:17]
 cm=cor(m,use=”na.or.complete”)
 heatmap(cm)

cormatrix_rbase

The treelike network of lines is called a dendrogram — it seems to come by default with heatmap().  Notice the pairs connected at the first level of the dendrogram: Height/Weight, SATs, Siblings/BirthOrder.

R with ggplot2

 m=StudentSurvey[6:17]
 cm=cor(m,use=”na.or.complete”)
 library(“ggplot2”)
 library(“reshape2”)  #we need this to “melt” matrix  cm into a dataframe
 meltcm=melt(cm) 
G=ggplot(meltcm)
G+geom_tile(aes(Var2,Var1,fill=value))

cormatrix_rggplot

Fancier with ggplot2

rcm=round(cm,2)
 meltrcm=melt(rcm)
 G=ggplot(meltrcm)
 G+geom_tile(aes(Var2,Var1,fill=value))+geom_text(aes(Var2,Var1,label=value))

cormatrix_rggplotplustext

Making plots in R: Base Graphics, qplot and ggplot

I won’t wade into the active debate about whether base graphics are better or worse than ggplot2 and qplot for R beginners (or maybe beginneRs…).  Rather, I’ll document side-by-side those plots that we find useful in my intro stats class.  In each case, I’ve let the default behavior speak for itself.  The dataset StudentSurvey.csv comes from the Lock5 text “Unlocking the Power of Data”.

Notice that I initialize the ggplot object with each example below,

G=ggplot(data=StudentSurvey)

which doesn’t fully take advantage of the graphics grammar:  if I was exploring this dataset with ggplot, I would do that initialization just once, and use the object  in various ways with G+ geom_xxx

In each case below, the results of the qplot and ggplot commands are identical, so I’ll show only two plots:  the base graphics and the qplot/ggplot.

After working these examples, maybe I start to see the point of those who claim that ggplot really isn’t more difficult for beginners.  I think that if you have one plot to draw and you know how to do it with base graphics, there’s no reason to call upon the richer features and more appealing logic of ggplot.  On the other hand, if you are doing work with a dataset that requires exploring with pictures, ggplot takes you much further.

Follow this link to another very helpful side-by-side comparison.

install.packages("ggplot2")
library("ggplot2")

Scatter plots:

 plot(StudentSurvey$Height,StudentSurvey$Weight)

 qplot(Height,Weight,data=StudentSurvey,geom="point")

G=ggplot(data=StudentSurvey)
G+geom_point(aes(x=Height,y=Weight ))

Histograms:

hist(StudentSurvey$Height)

qplot(Height,data=StudentSurvey,geom="histogram")

G=ggplot(data=StudentSurvey)
G+geom_histogram(aes(x=Height))

Box plots:

boxplot(StudentSurvey$Height~StudentSurvey$Gender)

qplot(Gender,Height,data=StudentSurvey,geom="boxplot")

G=ggplot(data=StudentSurvey)
G+geom_boxplot(aes(x=Gender,y=Height))

Side by Side Bar plots:

This one is harder.  For base graphics, we need to make a table of counts for barplot to consume.  The legend in base graphics isn’t drawn automatically, so you need to read the table correctly to enter the legend entries by hand.

I don’t know how to do this in qplot.

Notice the table and barchart ignore missing data, while  ggplot includes it.

t=table(StudentSurvey$Year,StudentSurvey$Gender)
barplot(t,beside=TRUE,legend=c("First Year","Junior","Senior","Sophomore"))

G=ggplot(data=StudentSurvey)
G+geom_bar(aes(x=Gender,fill=Year),position="dodge")

Scatter plots with regression line:

#Base Graphics
plot(StudentSurvey$Height~StudentSurvey$Weight)
regmodel=lm(StudentSurvey$Height~StudentSurvey$Weight)
abline(regmodel)

# ggplot
G=ggplot(data=StudentSurvey)
Gp=G+geom_point(aes(x=Height,y=Weight))
regmodel=lm(Weight~Height,data=StudentSurvey)
Gp+geom_abline(intercept=coef(regmodel)[1],slope=coef(regmodel)[2])

#a ggplot alternative that doesn't require calling lm
G=ggplot(data=StudentSurvey,aes(x=Height,y=Weight))
G+geom_point()+geom_smooth(method=lm,se=FALSE)

Dot plots

Here base graphics function stripchart() takes some experimentation to get workable values for the parameters offset and at.  By contrast, qplot and ggplot handle this as easily as a histogram.

Note however the vertical axis label “counts” in the qplot/ggplot dotplot clearly doesn’t match the scale of the axis, which looks more like a distribution, but probably not because I think the sum over that axis would be greater than 1.  In fact the documentation of  geom_dotplot admits to this error: “When binning along the x axis and stacking along the y axis, the numbers on y axis are not meaningful, due to technical limitations of ggplot2.”  Argh…..

We should probably hide the y axis with +scale_y_continuous(NULL, breaks = NULL)

I  learned from stackoverflow of another method using base graphics plot() together with sort(), sequence(), and table().  After some needed attention to the axis labels, this option looks to me like the winner of the bunch..

stripchart(StudentSurvey$Height, method = "stack", offset = .5, 
                      at = .1, pch = 19)

qplot(Height,data=StudentSurvey,geom="dotplot")

G=ggplot(data=StudentSurvey)
G+geom_dotplot(aes(x=Height))

#Another way to get the job done using base graphics
x=StudentSurvey$Height
plot(sort(x), sequence(table(x)))

#Here's a custom function to make this last thing happen
dotty=function(x){plot(sort(x), sequence(table(x)),ylab="count",xlab=deparse(substitute(x)))}