The other night, while watching Wheel of Fortune, I got curious if the WoF staff made RSTLNE less common during the final round of the game. I found some data, worked on it for too short a period of time, and wound up with a decent sized post on the DataIsBeautiful subreddit. There were lots of good comments and criticisms (I had made some dumb errors). These were well addressed (graphically and otherwise) in Mr. Ingraham’s WaPo article, which I commend to you.
I wanted to re-do my work here as a way to correct my mistakes and also ask some questions not yet asked of the data.
Goals of the Project
The goal, initially, was only to see if the final round of WoF had an irregular distribution of letters. I especially wanted to see if they under-used RSTLNE as a way to keep the game challenging. I accomplished that in a rough manner pretty quickly. I hadn’t spent enough time building an interesting story with the data, though. There were conclusions to draw about which letters should be chosen, but I only showed the relative frequency of the letters. The WaPo article did that quite well, and I’ll reproduce that below.
The additional goal, one which has been mentioned by many people, but I haven’t seen it done, is to measure the value of letters based on their information content. In other words, instead of measuring the value of letter choices based on how many letters get revealed, measure the value based on how many words the letters make feasible or infeasible for the puzzle. We’ll cover that in the next post.
Gathering the Data
I had originally gone to this site for the data, since it appeared to be updated frequently, and had easily scraped final round clues. All of the answers are in a
<span class="BonusPuzzleBlock"> element, and the site used that element for 2011 and on. Scrapy did all the hard work, and I had all the clues. As others pointed out, there are some issues with that data. It’s not a whole lot of data, we don’t know what the users guessed, and it was full of repeats. The repeats ended up not mattering, because there were roughly the same number of repeats as originals.
The one thing that site does have is all the puzzles for the whole show (not just the final rounds). That data is useful for deciding if it’s WoF puzzles in general that have non-standard letter distributions, or if a non-standard letter distribution just occurs for the final puzzles. More on that later.
The site that the WaPo article used is this angelfire site. It has complete listings going back to 2007, and it also keeps the listings in an easily-scraped table form. It also has the contestant letter guesses! I scraped that with the help of this excellent stackoverflow post that dealt with the multirows in the tables.
Getting Letter Frequencies
Once we have the data, the rest is just counting. First, we look at the letter frequency in terms of how many times the letter showed up relative to the total number of letters.
Next, we look at the letter appearance frequency. This is how many puzzles the letter showed up in (at least once).
The second chart gives a better idea of which letters to guess, since it tells you more about if you’d even get the letter at all in a puzzle. The first one will help us find out our original goal.
We can, using Wikipedia to get letter frequencies in the English language, make a plot that shows how common the WoF letters are in relation to the natural language. The next plot is based on the WaPo article. I had originally shown it as a bar chart, but that stripped some important information. Here we can see regions that show which letters are more or less common than their natural frequency, as well as view the relative magnitudes of these ratios.
At this point, we’ve answered the first question we wanted to know about. The letters RSTLNE show up less frequently in WoF final puzzles than they do in regular language. The letter E is the most under-represented, while L is pretty close to normal. We can compare this to the rest of the WoF puzzles in all the shows and see how distinct that difference is. Doing so gives us the following chart, where it appears that the main round puzzles are behaving ‘naturally’.
Picking the Best Letters
The best letters to pick, at least from this analysis, are the ones that show up in the most puzzles. This is a rough proxy for information gain, since picking a letter that is more likely to show up gives the player a good shot at figuring out what the words are. We’ll do it the way WaPo did, then try a different metric and see what happens. First, let’s look at what letters the contestants picked the most.
We have CDMA vs. DGHO (recalling the appearance frequency above) as the most common picked vs most common to show up. If we use relative frequency, GHPO is the most common. The distribution of how many letters we uncover are shown below.
There’s not much difference between DGHO and GHPO, so it seems as long as you have GHO, you’re pretty well set. Let’s check that by adding the GHO strategy to the same plot above:
GHO gets better letter revealing performance than CDMA, with one less letter! Judging by the previous charts, The power from CDMA comes only from the A. If it was CDMO it would probably do a little better.
I am curious, and many other commenters were as well, if it would be more appropriate to judge letters based on the information they convey. Keep in mind that information is a ‘no’ to all the possible options for what the state of things could be. For WoF, the information provided by a letter is the amount of words that we know are no longer feasible.
We’ll look at evaluating WoF letters under that mindset in the next post.