Computational Data Science early semester thoughts

I’m back in the classroom! At least for a semester anyways. In the dean’s office I teach one class per year and last year was a fully online course, so this is a fun adventure (so far at least). This is just a post capturing some of the things that have happened that have piqued my interest. Here’s a quick (linked) list:

A quick description of the class: This is Introduction to Computational Data Science, a course that comes at the beginning of our new CDS major/minor program (though our python programming course is a prerequisite). The entire program aims to help students find and gather data, analyze it, and tell a story or make a decision with it. This course has them take their python skills and focus them on data. After this class students should be able to use tools like pandas, web scraping, and APIs to collect, analyze, and tell stories about data.

Grouping students with “Would you rather . . .” questions

I’ve used this before, but I’d forgotten how fun it can be. Nearly daily I’ll put students into work groups for things like brainstorming ideas for analyzing a particular Kaggle data set. There’s only 16 students in the class, but I know from experience that they don’t always get to know each other even in such a small class. Even though I’m using Canvas as the LMS for this course, I’ve decided to load up my old Rundquist-hates-blackboard-so-he-wrote-his-own-LMS LMS with the class roster so I could use my old group randomizer.

It shows all the students with checkboxes next to them. I check who’s present, then indicate the max group size I’m interested in. It randomizes them and ensures that no group has more than the max and no group has less than the max minus one. But the part that’s fun is that it also displays a randomized “Would you rather . . .” question from this site. I encourage the students to find their group, introduce themselves, and then answer the wyr question. Then I tell them what the group is supposed to do for the class.

I find it to be an easy way to build community, and it seems to be working pretty well.

Major projects (Twitter mining, web scraping, Digital footprint)

We have three major projects for this course.

Twitter

Students will identify a topic they’d like to use twitter data to look at. It could be a hashtag, a topic, a famous user, whatever. They need to craft their research question and learn and use tools that will allow them to analyze hundreds of tweets and thousands of users.

Twitter is open data and they have a robust API that the python tweepy library is useful for accessing. The free development accounts have some data limitations but should provide plenty of data for my students.

Web Scraping

Students need to find a topic that has both a Kaggle data set and web pages that contain data that could extend the data set. Kaggle is great because it has data on tons of topics. So far in class we’ve explored the olympic medals data set pretty extensively. The reason we’re not stopping there is that for this program students need to know more than just how to deal with well-formatted data. Scraping data from web pages is a really useful skill in those times when such well-formatted data doesn’t exist. Of course it’s interesting that even the clean Kaggle data often needs some more work to refine the format, so it’s really nice to start with that. The examples we’ve done in class is to look at the most popular first names for winning olympic medals. First name is not a column in the data set so we had to learn how to extract it from the full name column. Pretty straightforward stuff, but it’s already been fun to brainstorm different things to do with the data. We’ll get to the web scraping part later in class.

Digital Footprint

This course satisfies the “Diversity” requirement of our general education requirements. It does so by having students look at their own digital footprint from a privilege perspective. They’re going to seek out their own digital presence and compare and contrast it with those who are different from themselves. At first I thought it might be interesting to compare with each other, but I was smartly warned off from that. Instead they’ll write a report about their research into themselves and other cultures/countries/etc.

Weekly work

When I was setting up the calendar for the course I was trying to think about the best ways to infuse the major projects. I settled on Fridays. I figured we’d spend Mondays and Wednesdays working on tools and skills and then find ways to apply them in class on Fridays. Of course the bulk of the work they should do on these projects will be outside of class, but I want to make sure I’m modeling some approaches they should be taking.

I think I’m happiest with the Mondays and Wednesdays right now, as the focus on tools and skills is pretty straight forward. Fridays can feel a little at loose ends but I’m still working on it.

Take this past week: On Monday I asked the students to brainstorm (in groups) things to search for on Kaggle. Each group came up with a suggestion and then we voted. Olympics won and we landed on a data set that lists all the medals won from 1896 to 2012. It lists the sport, the athlete, the year, the location, and the medal. It has 31,000 rows (which I immediately asked the students to gut check). They then worked in their groups to brainstorm interesting questions to ask of the data and by the end of the class we had a great set of questions.

On Wednesday we took that set of questions and voted on the top three. I sampled only 15 rows from the data set so that they could be viewed without scrolling and asked the three groups to manually do the task they were assigned to. The topics were:

  • Which first names have won the most medals?
  • What is the connection between length of name (character count) and medals won?
  • Which repeating initials have won medals?

You can see that these are all pretty similar, but each group had slightly different things to do.

You’ll see below that I modified my instructions a little in interesting ways, but ultimately the groups were able to think and talk about how to go from manually dealing with a small data set and getting a computer to do it on a larger scale.

Class came to a close and I asked them a question about the type of resources I should provide to help them out. Should I abstract the skills they were talking about and make some screencasts that show how to do those tasks in pandas/python or should I just go do the three projects for them? I warned them that to fully do the projects would involve some things they haven’t learned yet (namely regular expressions) but I had a suspicion that it might be more helpful to them since they’d already invested some time thinking about these problems. They voted for that and we had a brief discussion about how I’ve only ever really learned how to use software tools when I really wanted to get something done. I think I’ll keep that in mind when producing resources for them in the future.

Finally on Friday they worked in pairs to brainstorm their own webscraping/kaggle project and I did some live coding for them that went a little sideways.

Describing data analysis to a third grader

Above I talked about how I asked students in their groups to manually determine how to analyze a small set of data. Specifically I asked them to carefully determine what they the humans are doing and write down the steps. I warned them that depending on the verbs they chose it might be easy or hard to later translate those steps into an algorithm for a computer to do the work. Easy verbs include “read”, “scan”, and “count” and a hard verb example is “figure out.”

They got to work and I was meandering among the groups. I noticed that the “first name” group’s first instruction was “split the string at the comma.” There’s nothing wrong with that from an algorithmic perspective, but I was worried it might be too computer-centric for all the students in class (not all have actually taken the programming class because we are trying to find ways to grow the program).

That’s when I had a great idea. I encouraged them instead to write instructions that third graders would have to follow. I picked that age/grade quickly and seemingly randomly but I think it was a decent choice. We talked about making assumptions about what they’d know and realized that we could assume some things, like how a lastname, firstname list would likely be recognizable to third graders even though they likely almost never write their name that way. It also helped remove phrases like “split the string at the comma” from their instructions. The ultimate idea was to get the students to understand that the types of things they’re interested in can be often explained at a simple level, and then it’s their task to find out how to translate that for a computer. I think I’ll keep going with that approach with other similar skill development days in the future.

Video coding assignments

As I’ve done with so many of my physics classes, I’m grading students describing their work instead of their work product. I’m finding that in a coding class that’s super interesting. I see a lot of videos with code that look quite similar (I don’t mind at all if they work with each other or find code online) but I never get two identical submissions. The students walk me through their code and it provides me an opportunity to send vids back to them asking clarifying questions. I’m doing Standards-Based Grading with this class so that feedback process continues through the semester.

I really like hearing the students describe their code. You can tell what they came up with themselves versus what they found elsewhere. You can tell what they really understand and what they’re just copying by rote from other work. You can tell when they haven’t thought about a particular case of inputs and when they’re thinking about how to extend the code. You can also occasionally hear their joy when it works!

Quick SBG note: I’m using my one week rule (that I used to call my 2-week rule) where if they let standard sit for a week the score solidifies. I think that will work well in a skills-based class like this.

Coding for themselves and not others

There’s one aspect of the text we’re using that I really don’t like. It’s constantly asking students to use input() and print() commands when doing things. At first I thought I just didn’t like it because I like jupyter’s notebook approach better (just do myfunction(4) instead of input(“hello dear user please input an integer”)) but I realized there’s something more subtle: For data analysis coding you’re often coding just for yourself. That’s one of the big things that distinguishes this class from the previous programming course. There you might be learning how to write code for others to interact with. In this class you’re using a tool to solve a problem. Often for yourself. Your audience comes in later when you give them a report.

Also, you do work with others during the coding, but nearly always that means you’re writing code that they’re going to (re)use. Hence a function that returns a list is almost always going to be more useful than a function that has a loop with a print statement in it.

I’ll be curious to hear your thoughts on this.

Google learning outcomes

On Friday we had a really interesting conversation about things I want students to take away from this class. I had just finished what I knew was going to be an unfulfilling (for all of us) live python coding session to show them how to investigate a kaggle data set about deaths from disease. I screwed up the syntax and ended up using the wrong functions a bunch of times. Once I was trying to add up deaths due to cancer and ended up just counting how many years were in the data set. Yep, super wrong, but it lead to this cool conversation:

I pointed out that a really important thing in a class like this is to realize that you can’t possibly memorize all the various python/pandas/etc commands we’ll be learning. They’ll need to figure out what system they’ll use to ensure that they can always figure that sort of thing out. I gave the example of keeping good notes somewhere but then admitted that I just don’t really do that myself. I asked what they thought I did instead and they nailed it: trust google.

What they (and I) mean by that is that you can quite easily find good syntax help by just doing something like googling “pandas sum column filter”.

But then I told them about another major thing I need them to take away from the class: confidence that things are possible. When I reflect on times when I’ve done a crappy job of teaching a software tool to someone, it’s when I fail to get the person to buy into that mantra: it’s possible! I think a class like this can succeed if it puts students in situations where they don’t know how to do something but they develop confidence that they can figure it out. This is so similar to physics teaching that I’m feeling dumb for not articulating that much earlier in my career. In physics we have tools (conservation of momentum, conservation of that human-invented, not strictly necessary idea – energy, etc) and we want students to think of how they can put those to use in solving problems. In coding, we have tools (libraries, software, apis etc) and we want students to think of how they can put those to use in solving problems!

Your thoughts?

Ok, that was a lot but I’m happy I got it down. It’ll help me as I continue to reflect on how to improve this class. Any thoughts/comments/questions? Here are some starters for you:

  • I’m in this class and I think it’s going pretty well. Here’s what has helped me . . .
  • I’m in this class and I honestly think it’s crap. Here’s why . . .
  • Why in the world did you let students in who haven’t taken a programming class?
  • My 3rd grader writes their name lastname, firstname all the time now thanks to this dumb post.
  • I think the one week rule is dumb and here’s why . . .
  • Another super long boring post from you. But at least you figured out how to do anchors so I could jump around – thanks!
  • I think all deans should teach at least one class, that’s a great idea!
  • I don’t think deans should be allowed near a classroom. This is dumb.
  • Why are you giving Canvas a chance?
  • I think if you want 3 person groups and you have 16 people you should have 5 3-person groups and a random person who’s screwed.
  • That “would you rather” site has some weird ones. Do you use them?
  • Because you sometimes capitalized Kaggle and sometimes not I assume you mean that there are two sites with similar capabilities. You only linked to one, though, so I’m totally confused
  • I don’t understand how this class can satisfy a diversity requirement. Can you say more?
  • What do you mean by “gut checking” data?
  • Here are some more hard verb examples for you . . .
  • What do you mean when you say energy isn’t strictly necessary?

About Andy Rundquist

Professor of physics at Hamline University in St. Paul, MN
This entry was posted in sbar, sbg, screencasting, teaching, technology. Bookmark the permalink.

9 Responses to Computational Data Science early semester thoughts

  1. bretbenesh says:

    Are there any seats left in this course? Perhaps I could join.

    • Andy Rundquist says:

      haha, it would be so fun to have you in class! Ultimately this class should enroll at 35-40 but I think nearly everything I talked about here should scale decently.

  2. Emma says:

    – Ditto on @bretbenesh. I bet if you did podcasts/videocasts there would be a great demand for it online!
    – “I’ve only ever really learned how to use software tools when I really wanted to get something done”… so true!!
    – On the Google factor, I’d say trust Google, but go to the documentation first! Python is almost always right, if difficult to interpret. Following some guy on stackoverflow gets you started, but then you might be stuck with his interpretation of how to accomplish something, often there are multiple approaches!
    – User inputs stink for a coder who is reading another person’s code, just unnecessary mess. I find them most helpful for debugging Python script run from the command line.
    – Last — love the Diversity component! I think an important skill for future data scientists is finding built-in biases in statistical practices. Great book on this: “Weapons of Math Destruction” by data scientist Cathy O’Neil!
    Thanks again for sharing, it’s really cool to see Hamline offering this class!

    • Andy Rundquist says:

      Thanks, Emma! I’ve heard a lot about that book by O’Niel. In fact we almost picked it for a common read for the incoming students last year!

  3. I would say that this is what most individuals think of when someone says the words data science or machine learning. It involves learning how to perform exploratory data analysis and running sklearn regressors and classifiers. Most of the class focuses on understanding these methods and how best to optimize them for a given set of data (there is a little more to it than doing model
    Best Data Science Training Course in Pune – Fusion Technology Solutions

  4. Pingback: Cold live coding in class | SuperFly Physics

  5. Steven Wolf says:

    I like the, “it’s possible” mantra. That being said, I do wonder if something like, “it’s worth it” would be a worthwhile addition. I can think of many times that I tried to do something and was spending all sorts of time trying to figure it out. I could see someone saying, “well, I’m sure it’s possible, but I don’t have time to pull it off right now,” or some other excuse. I think “it’s possible” needs a friend to keep me motivated on the task I want to accomplish. That way, the message that I can send myself is, “I’m sure it’s possible, and when I figure it out, it’ll be worth it,” or something like that.

    • Andy Rundquist says:

      Ooh, I really like adding “it’s worth it” especially if I can tie it to how the code can be repurposed and used again.

  6. Pingback: Web App Dev with GAS course | SuperFly Physics

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s