Averages vs histograms

With graphics being so easy to add to documents these days, why don’t we show more histograms in place of the typical approach of representing very complicated data with one or two numbers (eg average and standard deviation)? Sure, if your data is normally distributed, then those two numbers really are a great distillation of the data. However, lots of things aren’t normally distributed, and I’m lobbying for more use of histograms instead of (or, I suppose, in conjunction with) the numeric characteristics of the data set.

Here’s the example that got me thinking about this today. At my school student evaluations of instructors are very important. We use a seven-point Likert scale on questions such as “The instructor encourages me to learn actively” and “This course was a valuable learning experience.” Quite often reviews of faculty are peppered with means and occasionally standard deviations of evaluation data for the reviewed faculty member. However, the data is not normally distributed at all! It can be bimodal (some hate me, some love me), or highly skewed in other ways. I’ve been working lately to provide an interface for our evaluations to help people on the tenure and promotion committee make wise recommendations. Instead of having to click through to each course, I’ve made a nice table that shows the average for the class on each question. The table rows are the various courses the faculty member has taught. But while thinking about the notion of showing histograms in addition to averages, I hit upon using PHP to dynamically create SVG’s with the histograms. Here’s what it looks like:

5 courses for an anonymous faculty member. Each column is a different question on our standard evaluation.

5 courses for an anonymous faculty member. Each column is a different question on our standard evaluation.

I feel like you learn a lot by looking at the (tiny) histograms. Take the three “4.44”s that are in the third class. The middle one is much more bimodal than the other two.

What am I lobbying for? I’d love it if many more reports/journal articles/newspaper stories did this kind of thing. The graphics generation and inclusion is really not that hard, and I think it communicates the whole story, not just a distilled version.

One downside is the inability to describe the data very easily. I was showing this to my partner and I was trying to say “this one is different than that one” and I had to point to them. I couldn’t easily describe them. So I resorted to saying “the 4.44 one . . .” etc. I suppose this is backing up my point that the data sets are complex and resist easy description, but I know my colleagues on the tenure and promotion committee like to really discuss these evaluations a lot.

Here’s another interesting point from a friend of mine (who’ll remain anonymous):

Averages and SDs are **NOT** appropriate for categorical data. They assume the “distance” between each category is equal, as if the numerical choices were locations on a spatial scale. They are not. You’ve got two choices: Report number of responses in each bin (as you’re playing with); or turn to Rasch analysis, which is designed for exactly this problem. But it’s not for the faint of heart…

Interesting, huh?

Your thoughts? Here are some starters for you:

  • This is great. I totally agree that representing all of the data is much better than any distillations. I would even go further by suggesting . . .
  • This is dumb. We use the distillations for several very good reasons . . .
  • Why do you use evaluation data at all? They’ve clearly been shown to be problematic.
  • Why a 7-point Likert scale? How about a 2-point Love-ert scale?
  • How did you make those SVG histograms in PHP?
  • PHP?!!? I’m never reading this blog again.
  • Wait, I thought you only knew how to use Mathematica.

About Andy Rundquist

Professor of physics at Hamline University in St. Paul, MN
This entry was posted in math and tagged , . Bookmark the permalink.

6 Responses to Averages vs histograms

  1. bretbenesh says:

    I give a big thumbs up to your anonymous friend: categorical data should definitely not be averaged.

    I think that your histograms are preferable, although I might even prefer a basic list of frequency distributions if I were on Rank and Tenure (two notes: this is probably a personal preference, and I definitely prefer just seeing the numerical frequencies if there are only five possibilities on the Likert scale, as I am used to, but I might start preferring the histograms if I had to deal with seven possibilities).

    And that is some pretty nice PHP.

  2. Mr. John says:

    Hello Andy. I am a former student of yours. I have a few thoughts on this piece. Based on your description, it seems that Hamline puts way too much stock in student evals. I just read an article the other day about how Student Evals are actually one of the worst indicators of teaching quality whereas a peer evaluation system would be much more preferred.

    Nonetheless, I love your point about how each of the 4.44 averages are not the same as the histogram showed.

    As a side note, I think that if the evals are so high stakes teachers should have some say in how questions are worded and what gets asked so you have more input in the evaluation process.

    What I learned in your class can not be quantified in an evaluation process. I learned the value of struggling. I learned what it is like to delve into really difficult material and try to make it work. Having had that experience has really helped me better understand my own students I am working with this year.

    Finally, I think your classes really foster self knowledge and reflection which again does not really get captured in the student eval process.

  3. Joss Ives says:

    Ignoring the categorical data issue, and the fact that I would almost always take a visual representation over numerical summary representation, wouldn’t the standard deviation of the bimodal 4.44 be so large that it could only represent bimodal data? More data is usually better, but sometimes one just wants a summary (hello Metacritic).

    • andrewkbennett says:

      It makes sense that most of the time, the standard deviation of a bimodal distribution will be larger, but is it necessarily? Those shape-related features can still get lost in translation, I think. Consider {2,2,2,3,3,4,4,5,5,6,6,6} and {1,2,2,3,3,4,4,5,5,6,6,7}. Not a great example, perhaps, but does show that the standard deviation of the more “bimodal-looking” distribution can be smaller.

  4. What Firefox Extension can I get to download streaming videos from sites?
    comment tomber enceinte quand on prend la pilule https://garage46stool.wordpress.com/2015/12/24/real-world-secrets-in-comment-tomber-enceinte-an-intro/

  5. tandrsc says:

    I’ve just come across this. I had a similar problem in my supplement poll where I wanted to show which supplements most people found most helpful.

    I’ve used smilies to indicate where there are more good than bad (or vice versa) and a rollover shows the actual numbers.

    I used python and javascript.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s