A funny thing happened with table() while making bootstrap samples in R

In my intro stats class, we spend some time computing bootstrap distributions for constructing confidence intervals.

One day, we gathered data from the 29 students in attendance about their class rank:  first year, sophomore, junior, senior.  I coded those categories with 1,2,3,4 and here are the results:

data

You can see that the this is basically a sophomore-level class.  We set about using bootstrap sampling to construct a 95% confidence interval for the proportion of sophomores.

I try to teach R using as few commands as possible, reusing what we know as we learn new techniques.  In this case, I used table()  again, extracting out the 2nd column with

table( … )[2]

to get the proportion of sophomores in each bootstrap sample as below.  The funny thing we noticed is that the bootstrap distribution was skewed to the left.  The resulting 95% confidence interval computed using percentiles surprised us because it captured 25%, and it seemed to us that the data didn’t support the possibility that a  class such as this one had a quarter sophomores.

code1hist1

What is going on above?  Turns out that with so few first-year students in our original sample, occasionally a bootstrap sample occurred with no first-years, like this:

badtable

I that case, my code to extract the proportion in the 2nd column actually grabbed a proportion of juniors, which will almost always be much smaller than the proportion of sophomores, resulting in those unwanted left-lying samples we see above.

In this case, with categories coded as 1,2,3,4,  replacing table()[2] with tabulate()[2] fixes the problem — tabulate populates the missing column with zero:

tabulate

That isn’t very general though, especially because we’d like a technique that applied when the categorical variable has entries codes as strings.  My solution was to use

length(which( … == 2)).

Here’s the much more believable result with that approach:

code2

hist2

Dotplots in R: Base Graphics, ggplot, more Base Graphics


The textbook I use in my intro stats course makes extensive use of dotplots as an intuitive alternative to histograms when the number of data points is small enough to visualize each case as a single dot.  Here are two base graphics and one ggplot solution.  I especially like the last option below.

Here base graphics function stripchart() takes some experimentation to get workable values for the parameters offset and at.  By contrast, qplot and ggplot handle this as easily as a histogram. Note however the vertical axis label “counts” in the qplot/ggplot dotplot clearly doesn’t match the scale of the axis, which looks more like a distribution, but probably not because I think the sum over that axis would be greater than 1.  In fact the documentation of  geom_dotplot admits to this error: “When binning along the x axis and stacking along the y axis, the numbers on y axis are not meaningful, due to technical limitations of ggplot2.”  Argh…..

We should probably hide the y axis with +scale_y_continuous(NULL, breaks = NULL) I  learned from stackoverflow of another method using base graphics plot() together with sort(), sequence(), and table().  After some needed attention to the axis labels, this option looks to me like the winner of the bunch..

stripchart(StudentSurvey$Height, method = "stack", offset = .5, 
                      at = .1, pch = 19)

qplot(Height,data=StudentSurvey,geom="dotplot")

G=ggplot(data=StudentSurvey)
G+geom_dotplot(aes(x=Height))

#Another way to get the job done using base graphics
x=StudentSurvey$Height
plot(sort(x), sequence(table(x)))

#Here's a custom function to make this last thing happen with appropriate labels
dotty=function(x){plot(sort(x), sequence(table(x)),ylab="count",xlab=deparse(substitute(x)))}