I tried an experiment yesterday in my Introduction to Computational Data Science course. We have been working on doing analysis of Kaggle data sets, with each student having picked what they want to work on that will also lead to some web page scraping later in the semester. They have to analyze the data in multiple ways and eventually tell a story with it. We’ve been learning lots of python and pandas tricks to do all this work, but I wanted to help them deal with the sorts of questions they constantly have to think about. So I decided that together we’d all tackle a brand new (to us) data set in class and see how far we could get. I didn’t know what to expect, but I knew we’d hit a bunch of tough points and I planned to have some meta conversations about them when we did. It went pretty well and I thought I’d get some notes down here about it.
We started with each small group (which, of course, I set up using my “Would you rather” approach) proposing a search term we could use on Kaggle. Then they voted on the proposed teams and I typed the winner (“wealth”) into the search box. Then we voted on which search return we should use. (Note that I’m not going to give the details about which data set we used. See below for why.)
Next we loaded the data set into python (using Google’s Colab, if you must know) and started looking at it. A quick shout-out to the pandas library: the data had 1,000,000 rows and we were able to work with it with no problems at all (imagine opening it in google sheets or that upstart other spreadsheet software – I think it’s called excel or something like that).
We hit our first snag when we were trying to figure out what types of things were in a particular column. Students suggested just printing that column but they saw that we’d still have to scroll through a million rows to really see what’s in there. They asked me to run the unique() function on that column but that still had 1000 items in it (still tough to scroll through and get a good sense of what’s going on). We settled on value_counts() to see the most popular items in that column, but then we hit another snag.
We couldn’t tell if the unique values in that column told the whole story. We were trying to see, for example, if a single row might have two things in that column, kind of like “car, truck” when describing accidents. Does car show up and then later truck in their own rows or if an accident involves both are they together in one row. Looking at the unique elements we saw “car” but we couldn’t be sure that “car, truck” might also be somewhere in that 1000 items. That’s when a student said “just use df.column.str.contains(“.*car.*”) and we’ll be in business!” Excellent, just what we’d learned in the last two weeks – a combo of pandas and regular expression jujitsu! But, alas, it didn’t work.
You see, I knew as soon as I saw the result of value_counts() that we were in trouble. I know all you pandas ninjas out there are laughing at me right now, but neither I nor my students knew how to filter the index values of a pandas series. Every suggestion they had got slapped down because it would only work on regular columns, not index columns.
I’m super happy to report that I wasn’t faking it. I literally didn’t know how to do it. However, and this I reported to the class, I knew it could be done. That’s one of my learning outcomes: I want my students to have confidence in what’s possible, even if they don’t know how to do it. So I asked them what they wanted to do. Did they want to google how to do it in pandas or follow another student’s suggestion and just do it straight in python using a loop. We had a great conversation about what they might put into a google search to help out and it was clear that they’d always add “pandas” to the search terms. We tried a few but then I had a brainstorm. I said “This will seem stupid and overkill, but I know it will work. Watch this.” And then I typed pd.series(df.column.value_counts().index.values).str.contains(“.*car.*”). Yep, I took the unique results from a pandas data frame and recast it as a new and different pandas series just so that the index column would be a normal column. Super overkill. But it worked. The students groaned and said it was likely a really dumb way to do it.
So we stopped to talk about it. And I think it was my favorite part of the day.
I said “come on, it works, who cares?” Some responded saying that it can’t possibly be the elegant solution that surely exists. They talked about how they hate it when they do something dumb like this and later learn a much better way. For me this is one of the key things about algorithmic thinking. Helping students see and discuss issues like this are what I love. Yep it was overkill. Yep it’s not elegant. But it works. Is that the end of the story? It doesn’t seem like my students think so.
By the way, after all that we learned that nothing like “car, truck” exists so we were in business! Next we wanted to get a visual on the data for the most popular item in that column. I’ll call it “cars” for now. Basically we wanted to know how another numeric column behaves for the rows about “cars”. We decided on a histogram and we were surprised by the result. Essentially it only had one bar way on the left and then a bunch of empty space all the way to the right. I reminded them that if it showed that much empty space it must be that there were some really small bars that just were hard to see. What the heck? That’s when I showed them that while df.column.hist() and plt.hist(df.column) both show the same graph, the latter also prints the raw data for the histogram bins. That’s super useful when you’re trying to see what’s going on with weird data. Sure enough the first bar had a count of 60,000 and the next 8 bars had counts of zero and the last bar had a count of 9 (I’d say 9! but that’s actually much bigger than 60,000).
Looking at the value of the bins one student shouted “typo!” meaning that those 9 must be due to data entry problems. They had good reason to say that (sorry, still not going to give away the details, see below). We did some quick calculations to see if there could possibly be 9 counts that far away from the rest of the data and we’re pretty convinced that they’re typos.
But now time was running out and we wanted to see much more detail from the 60,000. I said we could try to get rid of the 9 or just zoom in on the graph. It was interesting to see that no one had immediate ideas for doing either of those, though I’m sure they could see how to filter out the 9. Instead I just ran the histograms with 100 bins instead of 10 and that first bar split up a little. I again told them it was a dumb move but at least we knew there was some cool structure to the data.
Since time had effectively run out, I gave them a choice. I could either do the usual and go back to my office to make a screencast that finished the work, and give them all the proper syntax to use. Or I could do nothing and we could pick it back up on Monday, including the meta conversations. They really liked that, so I’m going for it.
Because of that choice, I am forcing myself to not dig into the data set. I know they want to eventually be able to put the data into a really cool visual that I don’t know how to do, but I’m making sure I don’t cheat and look all that up right now. It’s also why I’m not telling you, dear reader, the details of what we’re up to. If I did I’m afraid one of you would tell us all how to do what we want to do. But that would take the fun out of it!
I’d love to hear your thoughts about this. Here are some starters for you:
- I’m in this class and I really think the meta conversations helped me a lot. In particular . . .
- I’m in this class and you keep describing our boring work as “interesting discussions.” Please stop.
- I stopped reading when you said you weren’t going to give any details. This is just clickbait.
- What do you have against Excel?
- I think you should have just scrolled through the million rows. Surely their eyes would catch all the cool patterns right away!
- I like this live coding. Were you worried that it would go off the rails?
- I think if I did live coding I’d do a lot of practicing first. Did you do that?
- What is your deal with that dumb factorial joke?
- Do the students know when you shift to meta discussion? Is there a signal or are you explicit about it?
- I think this was possibly a cool class but maybe their vote for more indicates their enjoyment instead of their learning. Are those decoupled in your class?
- I’m in this class and my enjoyment and learning are the same!
- I’m in this class and my enjoyment and learning are completely uncorrelated.
- I can’t believe you think I’d drop everything and just do your work for you.
- I can’t believe you’re not giving me the details. I want to do all your work for you.