When journalist Nick Faris wrote last year that if Alexander Ovechkin catches Wayne Gretzy’s lifetime goal-scoring record, it could be credited to his “massive edge in OT goals,” I knew it was a bad take. First off, Ovi’s 24 goals currently represent 3% of his total goals scored, so there’s not a “massive edge” there. Second, there are so many ways in which hockey—all sports, really—have changed between Gretzky’s playing era (1979-1999) and Ovi’s (2005-present), that summarizing the difference with one statistic is misleading.

All I really wanted to do was find a time series of goalie height and weight. Goalie size seemed the most obvious difference between the two eras—it’s easily quantifiable and doesn’t rely on more nuanced discussions around hocky play or strategy. Alas, I was just unable to find a graph or even a table.

Here’s a run-down of the data and tools I used in today’s post on the Gretzky-Ovechkin Eras.

Goals Against Average (GAA)

This was, by far, the easiest series to get. The Hockey Reference website has the exact table I was looking for and includes, by year, total games played, games played, goals, power play goals, power play opportunities, power play percentage, penalty killing percentage, shots against, saves, save percentage, and goals against average.

Overtime Games

This one seemed like it should be easy to find. I mean, if Hockey Reference is going to provide those 10 series in a single table, then the number of overtime games should be easy to include there too, right? Nope.

So, I went over to NHL.com where I could download final team-level statistics for each season in separate Excel files. It was time-consuming and tedious, which is why I only went back to the 1977-1978 season. Prior to 2005, the file included the number of OT games, but starting with the 2005-2006 season, when the OT rules changed, the files didn’t include a simple count of the number of OT games. For this latest period, I calculated the number of OT games as ROW — RW + OT + SOWin, where ROW=Regulation Plus Overtime Wins, RW=Regulation Wins, OT=Overtime Losses, and SOWin=Shootout Games Won. I double-checked my numbers against a couple of blog posts that had tabulations to make sure this is correct.

Now, with 43 separate files, I read them into Stata, calculated the number of OT games, sorted, appended, and exported them to a new Excel file.

Goalie Height & Weight

This is where I started this entire exercise. I first found player data on when they started playing and thought I could use this as a proxy for height/weight in each year, but it wasn’t perfect. I then remembered that there is an API in the R programming language that pulls from NHL.com. Turns out that the NHLAPI has height and weight for each player.

Now I’m no expert at using R or the NHLAPI, but it seems like to get player data from the API, you either need to know their exact name or their ID number (presumably some number assigned in the data)—you can’t just download all goalies or all forwards. So, I went deeper in the Hockey Reference website and grabbed the name of every NHL goalie, which I could then load as a series of strings into an array in R to query the NHL API.

With all of these names, I could pull their ID from the main data file that contains their player statistics, and then use that to pull out their demographic information from a separate file. This is just a series of queries and a left_join. The last thing I needed to do was delete repeated names, which I think happens if a player gets traded mid-season, by using the duplicated argument. Finally, I created a dummy variable for the Gretzky and Ovechkin years of play.

There are a few caveats with these data:

  • Height and weight don’t change in the data. This probably doesn’t matter for height, but weight presumably changes, so it could conceivably matter. I don’t know if they have the most recent data and project backwards or the other way around.
  • The data include all years of play, including other leagues. I deleted all non-NHL-playing years.
  • I did not filter by number of games played—if the goalie was listed as having played in the NHL in that year, they are included in my data.

Visualizing the Results

I’m trying to learn Tableau these days, so I set about making the basic graphs in Tableau. But I also wanted to include single static graphs in the main post, so I also made them in Excel.

The graphs basically consist of three elements:

  • Main line graph of the data series. Creating this in both Tableau and Excel is pretty straightforward.
  • Lines for the average value in the two eras. In Excel, this is just a simple line, but it’s a little more complicated in Tableau because you have to make a separate line chart and then create a dual axis chart and synchronize the axes. Plus, the order in which you create all the graphs matters. I don’t know; that seemed tedious. Plus, adding the little annotation marks was tedious—in Excel, it’s a simple data label that I could easily edit manually.
  • Bands for the two eras. In Excel, I do this by combining a column chart with the line chart. In Tableau, I used the Reference Band tool and added the label at the bottom of each band—after learning how to remove the background behind the label, I actually like this approach in Tableau because Excel requires either a scatterplot hack or a text box, which is hard to align.

I also created a table with a little Gantt chart at the top of the post. I’ve already posted a video on how making Tables in Tableau is unnecessarily difficult, but I was mostly able to do it with two tables and a chart organized in a dashboard. In Excel, this version is most easily done by using Conditional Formatting as a sort of stripe chart. You can watch how I did this in both tools in this YouTube video.

Downloads

I’ve put everything in this zip file, including my raw data tabs, Stata and R code, Tableau file, and Excel graphs. Have at it but please cite PolicyViz.com if you publish anything.

Conclusion

In the end, this was a fun data collection and data visualization exercise. I enjoyed trying to find all the different data sources and it really just speaks to how sometimes if you want to answer a data-driven question, you need to do some work. But if you love the idea and the challenge, it can be fun and rewarding too.