In my intro stats class, we spend some time computing bootstrap distributions for constructing confidence intervals.
One day, we gathered data from the 29 students in attendance about their class rank: first year, sophomore, junior, senior. I coded those categories with 1,2,3,4 and here are the results:
You can see that the this is basically a sophomore-level class. We set about using bootstrap sampling to construct a 95% confidence interval for the proportion of sophomores.
I try to teach R using as few commands as possible, reusing what we know as we learn new techniques. In this case, I used table() again, extracting out the 2nd column with
table( … )
to get the proportion of sophomores in each bootstrap sample as below. The funny thing we noticed is that the bootstrap distribution was skewed to the left. The resulting 95% confidence interval computed using percentiles surprised us because it captured 25%, and it seemed to us that the data didn’t support the possibility that a class such as this one had a quarter sophomores.
What is going on above? Turns out that with so few first-year students in our original sample, occasionally a bootstrap sample occurred with no first-years, like this:
I that case, my code to extract the proportion in the 2nd column actually grabbed a proportion of juniors, which will almost always be much smaller than the proportion of sophomores, resulting in those unwanted left-lying samples we see above.
In this case, with categories coded as 1,2,3,4, replacing table() with tabulate() fixes the problem — tabulate populates the missing column with zero:
That isn’t very general though, especially because we’d like a technique that applied when the categorical variable has entries codes as strings. My solution was to use
length(which( … == 2)).
Here’s the much more believable result with that approach: