A funny thing happened with table() while making bootstrap samples in R

In my intro stats class, we spend some time computing bootstrap distributions for constructing confidence intervals.

One day, we gathered data from the 29 students in attendance about their class rank:  first year, sophomore, junior, senior.  I coded those categories with 1,2,3,4 and here are the results:

data

You can see that the this is basically a sophomore-level class.  We set about using bootstrap sampling to construct a 95% confidence interval for the proportion of sophomores.

I try to teach R using as few commands as possible, reusing what we know as we learn new techniques.  In this case, I used table()  again, extracting out the 2nd column with

table( … )[2]

to get the proportion of sophomores in each bootstrap sample as below.  The funny thing we noticed is that the bootstrap distribution was skewed to the left.  The resulting 95% confidence interval computed using percentiles surprised us because it captured 25%, and it seemed to us that the data didn’t support the possibility that a  class such as this one had a quarter sophomores.

code1hist1

What is going on above?  Turns out that with so few first-year students in our original sample, occasionally a bootstrap sample occurred with no first-years, like this:

badtable

I that case, my code to extract the proportion in the 2nd column actually grabbed a proportion of juniors, which will almost always be much smaller than the proportion of sophomores, resulting in those unwanted left-lying samples we see above.

In this case, with categories coded as 1,2,3,4,  replacing table()[2] with tabulate()[2] fixes the problem — tabulate populates the missing column with zero:

tabulate

That isn’t very general though, especially because we’d like a technique that applied when the categorical variable has entries codes as strings.  My solution was to use

length(which( … == 2)).

Here’s the much more believable result with that approach:

code2

hist2