Races (the running kind) are a great resource for anyone looking to play with a moderately sized data set. It’s not hard to make some descriptive and pretty charts, and you can do some simple “who ran fastest?” breakdowns by various factors.
In this post, I’ll look at a race that took place recently here in Atlanta: The Hot Chocolate 15K/5K. The best parts of the results they posted is that they include split times for each third of the 15K. That means that we can do some analyses on runner pace consistency and time trends!
We’ll start with some quick visualization of the participants and the results to orient ourselves to the data, and then get into analyzing (spoiler: with clustering) runner pacing structures.
Let’s real quick take a look at the two pieces of demographic data that are available in the data: Age and Sex.
The amount of women racing far exceeds that of the men (5473 women vs 1833 men, almost exactly 3 times as many). The distribution of ages is about the same between the two, though.
Now, let’s see how fast the runners were. First, we’ll look at the distribution of paces for all the runners:
This kind of skew is common in race distributions (from the data I’ve seen). The right (slower) skew is always much greater than the really fast runners. This is from more open races, though. I haven’t seen what it’s like for races with time requirements for entry.
Next we’ll do a breakdown by sex:
This breakdown suffers from the relative lack of men who ran. With many fewer data points, it’s quite likely that there’s more self-selection in speed for male runners than for female runners.
Let’s get into some 2D data. We can look at the density of Age vs. Race Time as a 2D histogram:
The runner’s performance isn’t strongly related to age at all (rho = 0.121), and that’s obvious from the chart. Age and Race Time are probably irrelevant to each other, due to races being self-selected events.
The race very nicely had bib scanners set up at the 5K and 10K marks (and of course the 15K) so racers could see some rough splits of their time. Before we go any further, it needs to be mentioned that this course is HILLY. While on those hills, it was distracting to wonder if the hilliness would cause a significant impact on the pace splits for runners. Here’s an elevation map that I made with a site called Veloroutes:
In general, we might expect pace splits to remain relatively constant (on a flat course). Either that, or start to get slower as the race goes on (for runners, like me, that don’t train all that often to be consistent enough). For a really hilly course, it made sense to hypothesize that paces might degrade quite a bit, especially for the last 5K, which had the worst of the hills (really, just one long hill to huff up).
Let’s use the data to test these ideas. The first thing we want to do is get all the pace data to be comparable somehow. If person A ran each 5K split at 8 min/mile, and person B ran it at 9 min/mil, and we compared those for all 7,300 racers, the distributions of pace splits would be muddles with the magnitude of the pace.
The solution here is simple; we compare each runner to their 15K average pace. That way, if runner A had three splits of: [8.1, 7.8, 8.1], we’d take those from the average (8 min/mile) to get three splits of: [0.1, -0.2, 0.1]. Those values are comparable between runners, since they reflect a normalized shift in the runner’s performance.
Remember that negative paces mean faster than the race average. We’ll look at the splits in two ways: densities and a scatter plot. Whenever you see a “PX” value, it just means the pace for the 1st, 2nd, or 3rd 5K split. Here are the densities:
The densities show that, on average, everyone starts faster than average, then gets faster than that (on average) for the 2nd 5K, and gets much slower for the last 5K. Notice how the middle section doesn’t have much skew, but the first and third sections are skewed fast and slow, respectively. This fits in line with what I hypothesized earlier.
The first scatter plot is a detail of P1 vs P3, where some detail is added to show what the positive and negative pace values mean:
Next, let’s look at all of them together:
The first and second splits don’t look very correlated, but there is a significant correlation between the first and last splits. The faster you ran the first split, the slower you ran the last one. This is likely due to both fatigue and the nature of displaying differences from average. If one segment is much faster than average, at least one of the others must be much slower. It’s interesting that the relationship isn’t as strong for P1 and P2, or P2 and P3.
For those curious about the correlations, here’s the correlation matrix:
P1 P2 P3 P1 1.000000 -0.226036 -0.806403 P2 -0.226036 1.000000 -0.393723 P3 -0.806403 -0.393723 1.000000
Seeing the densities and the scatters help us to understand some of the structure of the paces overall, but it doesn’t quite help us to answer the question about what kinds of pace splits the runners had. There are several options:
- Fast – Fast – Slow
- Fast – Slow – Fast
- Slow – Fast – Fast
- Reverse the Fast and Slow above
- All the same
- Generally getting slower
- Generally getting faster
Which ones are the most common? How about the least common? Answering that kind of questions means we need to find groups of common structures in the data. That sounds like clustering should be a pretty good fit for that task (and it makes nice-looking charts!). To do the clustering, I started with K-Means, with K set to 20. I set it to 20 because there are 9 scenarios that I listed above, but they could all have varying magnitudes.
The results of the clustering will be shown in two ways. One is just color-coded scatter plots. The other is a graphic of the cluster centers, where you can see how that cluster’s paces progress over the race (from left to right). The top of each cluster graphic has the number of runners that fit into that cluster. Let’s look at the centers first:
And the scatter plot of where they are (the cluster centers are to the right, with the appropriate colors):
K-Means does what it does best (makes as equal a volume groups as possible in the data) and gives us 20 different “kinds” of runners. The most common runner is pretty average, but slows down in the last third. After that, the runners start increasing their separation from their averages in various ways. Notice that many of the shapes are repeated, just with more extreme magnitudes (such as Fast-Average-Slow).
I’ve attempted to show (and since I’m colorblind, I do this for my own benefit) where each cluster shape is on the P1 vs P3 scatter plot. If a cluster doesn’t have an arrow, that means that its center is outside the inner plot’s range. I kept in zoomed in for readability.
Here’s that same kind of plot, but for when we look for only 10 clusters:
I wanted to try out some other clustering, especially clustering that can pick the number of clusters for you. MeanShift is a good option (as it selects by density, and I want to find dense regions of runners), and it doesn’t crash like other ones that need 7000 x 7000 distance matrices (I’m using Python, so that bogs things down).
The MeanShift gave 16 clusters, only 5 of which had more than 10 members. The others are extreme outliers. The top 5 cluster centers follow:
Here are their positions in the data:
Since MeanShift works by finding high-density regions, there’s a lot of aggregation in the biggest cluster. It managed to find some clusters farther out of that dense middle, and those clusters represent runners that have more extreme splits in their paces. It’s interesting to note that the extreme behaviors don’t include a “Slow-Fast-Slow” pattern. Those patterns do show up to some extent in the K-Means clusters.
Now we can see how the top runners sorted out into those clusters to see if there is any particular cluster that ‘creates’ more winners. The top 200 runners were selected, and their cluster membership recorded. The following chart shows those clusters, in order of occurring the most for winners. Two numbers are given above the cluster center. The first is the percentage of the top 200 runners that ended up in the cluster. The second is the percentage of all the runners in that cluster. Higher winner percentages than all runner percentages indicate a structure that might be more likely to win.
About 86% of the winners belonged to clusters of highly consistent paces. It’s not the the extreme clusters can’t win (some do!), it’s that maintaining a fast enough average pace while it varies wildly while you run might be taxing, and be harder to produce a winner.
What do we know about different kinds of runners now? Mostly the same thing as was hypothesized, but now we have pictures! There are all kinds of different magnitudes and shapes of the pace structures, some more common than others. The K-Means showed us many different magnitudes for the same kind of shapes. The MeanShift did some major aggregation, leaving us with only a few distinct kinds of runners.
We saw that the top 200 runners were more likely to have highly consistent paces, but that there is room for more extreme pace structures.
Overall, the clustering proved insightful, and helped in seeing what kinds of pace structures existed. What do you think? Do you have suggestions on better ways to cluster or format the data before clustering? Let us know in the comments.