Welcome to the 2014 Crunching The Numbers Year-End Review! Okay, so it's probably a little late for "year-end," but like Congress passing the budget, late is better than never. For those of you who may be new to the site, "Crunching The Numbers" is a series I do during the regular season where I try to use football metrics to predict which teams will succeed and which teams will fail (don't tell Patrick!).
In an effort to improve the formula, each year I conduct a simple linear regression analysis on my statistics to determine how well they relate to winning in the NFL. I then use that data to either add statistics that are relevant or remove ones that aren't. Last year (the first year that I conducted the analysis) I analyzed seventy-three stats from Team Rankings and found that about thirty of them actually correlate to winning. This year, I only analyzed these statistics for the sake of time. If to you it seems like I am applying a bias for excluding the "losing" statistics from 2013, you're right. The rationalizing thought behind it is if that there was a true shift towards the metrics I am excluding, the ones that I am including would fail to correlate with winning. So until I see evidence that suggests I am analyzing the wrong statistics, I am going to stay with what I have.
But wait, there's more! I have also embarked on a long-term statistic trending project. Not only have I analyzed how the selected metrics correlated with winning in 2014, but I also looked into how they relate to winning for the past two seasons. To do this, I took the average of how each team performed in each category for the past two years (32 regular-season games) and then analyzed that against how many games each team has won since 2013. The idea is that if a statistic is not a fluke and truly correlates with winning, then that relationship will get stronger as more seasons are analyzed. Conversely, if the metric is a fluke (or if the league is moving in a different direction) then the relationship will get weaker. I'll get into the results in a bit, but first I will offer a brief background on the numbers themselves.
How to Read the Numbers
(Note: there is a TL;DR summary at the bottom of this section)
When a linear regression analysis is conducted, the result is called a correlation coefficient, which is just a fancy term for a number that tells you whether or not there is a relationship between two sets of data. If the number is positive, then there is a positive correlation, meaning that both sets of data behave the same. The easiest example here is points scored - as a team scores more points, it wins more games, meaning that there is a positive correlation between the two. On the other hand, a negative number implies a negative correlation, meaning that both sets of data behave oppositely. Sticking with the same example, if a team allows fewer points, it wins more games.
In order to be confident that the relationship is meaningful and not just a coincidence, the value must reach a certain threshold. To make things easier, I have normalized this threshold to 1. Essentially, if a correlation coefficient is at least 1.00, then we can say with 95% confidence that the trend is real. For reference, the maximum absolute value the coefficient can have is 2.8623.
- A positive number implies a positive relationship (for example, more points scored = more games won)
- A negative number implies a negative relationship (for example, fewer points allowed = more games won)
- In order to have 95% confidence that a number is not coincidence, the value of the number must be at least 1.00
- The maximum value a number can have is 2.8623
We're going to get into a lot of numbers in the next two posts in this three-part miniseries, so I'll start off slow with the overall results of the rankings (called the "Rank Index"). For reference, I compare them to the score differential, since that has by far the strongest connection to winning:
|Score Differential (Week 4)||2.08|
|Rank Index (Week 4)||1.96|
I included the correlation going back to Week 4 for two reasons. Firstly, Week 4 is the first week of the season that I calculate the rank index in order to get a good sample size. Secondly, since the point of "Crunching The Numbers" is to predict if a team will succeed, it only makes sense to see how it fared early in the season. In fact, the entire exercise slowly loses value as the season progresses, so comparing my final Rank Index to the final Score Differential is essentially pointless; at that time, no "predictive model" will beat out what has already taken place.
But there is something to glean from the Week 4 results. As you can see, both the rank index and score differential have a very strong correlation to winning right out of the gate. Unfortunately, the score differential is still a better indicator of future success. My set goal with Crunching The Numbers is to develop a formula that will be correlate towards winning more in Week 4 than the score differential. Remember, the maximum possible value here for the correlation is 2.8623, so there is plenty of room to improve upon the 2.08 that the score differential generates.
The rank index did a good job of predicting success, and my 2014 iteration of the formula was an improvement upon its 2013 predecessor (the correlation value for 2013 was 1.57, significantly lower than this year's value of 1.96). However, until it can consistently predict which team's will win better than the simple use of the score differential, the entire exercise is essentially pointless. I still believe that, like the Ark of the Covenant, that magic predictive formula is out there somewhere, and someday I will find it. Hopefully ancient spirits won't jump out of my computer screen and make my head explode when that happens.
Keep a lookout for my next post on this topic, where I will make observations on how the individual statistics performed and what they mean for modern professional football.