The Academy Awards: Building a Data Set & Category Viz

The Academy Awards are nearing, and all the trailers are now reminding us of how many Oscars this or that movie have been nominated for. After enough of these ads, it’s hard not to wonder what the chances of each movie are to win. The amateur data junky in me wanted to find out, and the Oscars are a manageable enough problem that some visualization and analyses should be feasible.

In the next post, we’ll go over the various finding. For now, there’s actually a chunk of data conditioning to go over. I’ll show you where I got the data from, how I parsed it out, and what choices were made to get an apples to apples Oscar win analysis underway!

The Data Source

What better source for data on the Oscars, than their website itself? Many sites (such as filmaffinity) have lists of Oscar nominations (ONs from here on out), and I had started to scrape those. However, inconsistent formatting was making it a pain, so I went to the source. The Oscars have an official Academy Awards Database that is nicely searchable. You can also just tell it a year range and it’ll spit out ALL the ONs and winners.

I had it give me the ONs from 1940 on, which ended up in a list like this:

Argo 
Stage 16 Pictures Production; Warner Bros.
2012 (85th) 
ACTOR IN A SUPPORTING ROLE -- Alan Arkin {"Lester Siegel"}
* 
FILM EDITING -- William Goldenberg
MUSIC (Original Score) -- Alexandre Desplat
* 
BEST PICTURE -- Grant Heslov, Ben Affleck and George Clooney, Producers
SOUND EDITING -- Erik Aadahl and Ethan Van der Ryn
SOUND MIXING -- John Reitz, Gregg Rudloff and Jose Antonio Garcia
* 
WRITING (Adapted Screenplay) -- Screenplay by Chris Terrio

There was some cleanup needed of the raw data before I even brought it into Python. There are several notes throughout the data, many of them indicating that a nomination was revoked. Those nominations were deleted. Then there were special awards and non-standard formatting that got corrected. Also, many of the same movies were nominated two years in a row (most commonly foreign films in the middle of the 1900s). Those were split into different years.

This left me with a nicely structured file, like above. The first two lines of every group are the movie title, then the production company. The third is the year. After that, everything is <CATEGORY> — <PEOPLE OR WHATEVER IN CATEGORY>. If there is an asterisk before a category line, then that category won.

Merging Categories

The Oscar categories have been fluid for quite some time. The categories in use today aren’t even the same as all the categories from 5 to 10 years ago! As the file stood, there were 91 unique categories since 1940. However, many categories became renamed or phased out (such as Cinematography for black and white film). To decide which categories to merge, a combination of Wikipedia and plotting the category years was done.

The goal of this is to condense all the previous categories into the ones that existed in 2013. Let’s go through the major category groups and talk about each one.

Music Categories

MusicCategories

The only two music categories today are for a song and for the score. You can see a lot of the variability in the past for some pretty specific categories. The music categories of MUSIC (SONG), MUSIC (ORIGINAL SONG), and MUSIC (SONG ORIGINAL FOR THE PICTURE)  were combined into what I call ORIGINAL SONG. The rest were combined into ORIGINAL SCORE.

Notice how MUSIC (SONG) and MUSIC (ORIGINAL SONG) are the primary song categories over time (look at where one ends and the other begins), with lots of smaller categories coming in and out of the picture. Cool.

Visual Effects Categories

EffectsCategories

All of the visual effects categories were combined into one final category, named VISUAL EFFECTS.

Sound Categories

SoundCategories

Since two categories exist for sound (MIXING and EDITING), and it isn’t obvious with a category like “SOUND” to decide where it goes, this is where we check history. SOUND is what became SOUND MIXING, and everything else became SOUND EDITING. The nice part about the above graph is that we can easily see where those transitions were made. The special achievements and anything with ‘EDITING’ was put into the SOUND EDITING’ category.

Writing Categories

WritingCategories

Writing gets split between ORIGINAL and ADAPTED SCREENPLAY. The labels are very very long for some of these. An example is:

WRITING (SCREENPLAY WRITTEN DIRECTLY FOR THE SCREEN BASED ON FACTUAL MATERIAL OR ON STORY MATERIAL NOT PREVIOUSLY PUBLISHED OR PRODUCED)

Yikes! Everything that was written directly for the screen (or similar) went to ORIGINAL, and the rest went to ADAPTED.

The Rest of the Categories

RemainingCategories

Take a minute to look at the above plot. Most of the category splits here are pretty obvious. The ones to look out for are that ART DIRECTION became PRODUCTION DESIGN.

The Final Category List

The final category list, based on the above, is (drum-rolls are appropriate):

  • ORIGINAL SCORE
  • VISUAL EFFECTS
  • ACTOR
  • CINEMATOGRAPHY
  • ADAPTED SCREENPLAY
  • SUPP ACTRESS
  • ORIGINAL SCREENPLAY
  • PRODUCTION DESIGN
  • SHORT
  • SOUND MIXING
  • BEST PICTURE
  • ACTRESS
  • COSUTME DESIGN
  • DOCUMENTARY
  • FOREIGN FILM
  • ANIMATED SHORT
  • SUPP ACTOR
  • SHORT DOCUMENTARY
  • MAKEUP AND HAIR
  • ORIGINAL SONG
  • DIRECTING
  • FILM EDITING
  • SOUND EDITING
  • ANIMATED FEATURE FILM

Now that we have a workable list of movies with their processed (and intelligible) categories, we can explore the data! That’s in the next post.

Advertisements

One Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s