The presentation schedule for Thursday, May 3rd at 3:30pm will be as follows:

Team Members
1 Team Gold Hannah, Sandra, Sarah
2 Sporg Bradley, Igor, Jacob
3 Beautiful Soup Nikki, Brandon K, Katie M, Zoe
4 mythonPython Ashley, Katie, Mark
5 Team Mikaela Mikaela
6 Death By Data™ Andrew, Bryce, Jorge, Karp
7 Team Uhh Colin, Kline
8 /dev/null Chad, David, Zach
9 tsne Brandon, Clare, Harrison
10 Ryallison Allison, Ryan

Looking forward to it!

Quiz #8, the last one of the semester, has been posted to Canvas and is due on May 1st at midnight. Warning: it’s pretty graphic in parts.

May Day! May Day! …

If you are not yet paired up with anyone for the final “exam,” please contact me immediately so I can pair up the last few! Trying to work with this DataFest data set alone is like going swimming in the ocean alone — you should not do it.

Seven different people asked me for a last-minute extension on Homework #4, and a couple of people turned it in only half-finished in order to meet the deadline. So I’ve decided to grant a stay of execution until Sunday, April 22nd at midnight. (Apologies to those who worked diligently to turn it in on time if they feel miffed that their tardy classmates are getting a reprieve. It’s always a balancing act, which I confess I don’t always balance perfectly.)

Hey gang, the Data Mavens & friends are going to the movies (Paragon Theatres) to see the hot new Steven Spielberg film Ready Player One, based on Ernest Cline’s classic sci-fi novel! Join us at the Bell Tower at 6:30pm Saturday April 21st and catch a ride to the flick. Meet some cool new people, including Stephen’s trophy wife! :-O



Apparently, the Yelp API only returns “up to three” reviews for a business whenever you access the reviews endpoint. Oh well. You’ll only get 300 data points per city, then, which I guess we’ll call “enough.”

Quiz #7 has been posted to Canvas, and is due at midnight, April 23rd. Unlike most of the previous quizzes, this one is open-Python.

Good luck!

The Data Fest data set has been posted to Canvas, and can be downloaded at any time. It’s large (153 MB) and is a .zip file that contains a single .csv file. They showed a video from the data donor on Friday night which I’m in the process of locating, and will post when I get a hold of it. It doesn’t give a ton of information, but it does help orient you to the original purpose of the data.

Also in Canvas is the “data dictionary” which concisely describes the different fields.

I’m available to answer questions on all of this, to the best of my ability. Again, the task is, quote, “analyze this data and find some interesting things in it.” Your presentation should explain what those interesting things are, motivate why they are interesting, and back them up with appropriate statistics, plots, etc.

The rules for teams for the final exam are these:

  1. People who went to the Penn State DataFest must be teamed up with their Penn State DataFest teams. People who didn’t go to the Penn State DataFest must be teamed up only with other people who didn’t go to the Penn State DataFest.
  2. Non-Penn-State-DataFesters must be on a team of either 2 or 3 people. (No more, no less.)
  3. The restriction on data challenges this semester (“you can’t be on a team with anyone you’ve previously been on a team with”) does not apply for the final. You are free to choose anyone for your team, subject to items 1 and 2 above.

A couple of students have rightly noted that the .csv file I’m asking for in homework #4 is un-normalized out the wazoo. In particular, the city and the price is going to be repeated in the table along with the business name for every one of that business’s reviews.

This is true, and is okay. But if you do want to normalize the table, you are permitted to instead submit two .csv files for the plumbers (and one for your other business type) called “plumber_businesses.csv” and plumber_reviews.csv“. The former should have business, city, and price columns, whereas the latter should have business and review columns. Obviously, business will be a primary key in the first table and a foreign key in the second. (This assumes that business names are globally unique, even across cities; if you feel this is unlikely, you could add a city column to your plumber_reviews DataFrame as well.)