hive
mapreduce

How Data Changes Preconceptions About NFL Football, The Weather And The Parallel Universe

Next Story

LearnXinYminutes Is The Occasional Coder’s Best Friend

Cloudera’s Jesse Anderson wanted to know if weather changes the outcome of NFL football games. He was also curious if player arrests were linked to their team winning or losing. And finally, he was curious how this might all play out in a parallel universe.

So he did what any data scientist would do. He collected the data from 471,392 plays in 2,898 games played since 2002. To get the weather for each game, he faced a problem. The stadiums could be found on Wikipedia but not all the weather stations were labeled well-enough to find out which were closest to where the games were being played. But he could see the longitude and latitude of each weather station.  Once he had the weather stations pinpointed, he downloaded all the data from the national Climate Data Center. With all that data, he then joined player arrest data kept by the San Diego Tribune.

He also used MapReduce, the technology pioneered by Google, to analyze the information so he could determine if the data changed pre-conceived notions or proved our views on how weather affects the outcome of a game. He applied the data to the player arrests to learn more about who gets arrested more, the winning or the losing team.

He separated the data into 96 columns, taking into account a host of factors. For the stadium data he factored in the capacity of the stadium, the type of turf,  the elevation and other data points. The weather data included the type of precipitation, the wind and the temperatures. Looking at player arrests, he had columns for the name of the player arrested, the team he played on and whether the arrest took place during a home or away game.

Querying the data in MapReduce requires a lot of custom work. So he imported the data into the Hive database so he could query it more easily. He also used Cloudera Impala, a new data analytics technology based on work also done at Google.

Out of this, his analysis proved and disproved his preconceptions. Some of the results:

  • 1,105 games had some sort of inclement weather.
  • Games with inclement weather have a 93% chance of fumble compared to 56% games where the weather conditions were not a significant factor.
  • The Baltimore Ravens were the only team with a weather advantage, sporting a record of 22-14.
  • Generally weather does not impact the outcome of games. But the data does show that weather does affect what happens on the field.

As for arrests:

When there are arrests of players on the home team, away team or both — the home team wins 57 percent of the time.

From 2002-2012, each team had many arrests from a low of 56 percent in 2002 to a high of 91 percent in 2012.

Arrests were so numerous that Anderson could not determine if they were due to the team having a discipline issue or some other factor.

Football in a Parallel Universe

Out of all the research, Anderson wanted to know if there is true randomness in a sport. At a certain point, an atom is in a certain state. And so if you knew the state of a subatomic particle,  the outcome of the game could be determined through simulation of the data from all the subatomic particles in and around the stadium. Things like thunderstorms could be predicted. So no,  he concluded, there is no true randomness in a sport.

There is no way right now to analyze data at the subatomic level as Anderson describes. Nor is there proof that weather affects the outcome of a game. But there is proof that data-driven approaches can help dispel pre-conceived notions. And that can often be the difference between a win or a loss.