The Most Important Offseason Acquisition For The San Francisco Giants Could Be Hadoop

Editor’s note: Barry Eggers is a managing director at Lightspeed Venture Partners where he focuses on information technology infrastructure, with a specific interest in cloud computing, big data, storage, consumerization of IT, and networking. Follow him on Twitter.

Baseball, more so than other sports, is known for its massive data collection, complex statistics and informed managerial decisions. So it should be no surprise that, just as corporate enterprises are going through a big data revolution, so will baseball. While the technology that enables big data is quite technical and designed to operate behind the scenes, the direct impact of big data on the average consumer will be quite visible over time. Hadoop, with its ability to manage massive data sets, is about to change the game of baseball.

Evolution Of Data Collection In Baseball

In the late 1800s, baseball was about measuring balls, strikes, hits, runs, and wins. By the mid-1900s, percentages became all the rage: We saw the emergence of batting average (BA), earned run average (ERA), on-base percentage (OBP), slugging percentage (SLG), and fielding percentage (FLD). Then, during the 1970s and 1980s, Bill James wrote a series of Baseball Abstract books that provided a new perspective on evaluating players and measuring the true impact on their teams’ chances of winning.

James’ innovations include such formulas as runs created (Total Bases x [Hits+Walks])/(Plate Appearances), range factor (Assists + Put Outs)/(Games Played), and the “Temperature Gauge” to measure how “hot” a player is. James’ original metrics have been refined over time. For example, runs created has been replaced by weighted runs created-plus (wRC+), which compares a player’s on-base plus slugging percentage with the rest of the league and accounts for ballpark factors and run-scoring environments. These abstracts played a leading role in Michael Lewis’ best-selling novel, and later Hollywood film, Moneyball. Clearly, there’s a lot more than spitting chew and emptying Gatorade coolers going on in the dugout.

Today’s Game

In the modern game of baseball, everything is measured. The trajectory and location of every pitch are tracked in all 30 stadiums and the movements of every fielder are now being tracked in certain stadiums. The San Francisco Giants are early adopters; major league hitters now have a batted-ball spray chart and an associated heat map, which measures the effectiveness of each hit ball as it relates to every ballpark. The Oakland A’s have also earned recognition for their use of data – not only for in-game strategy but also to build their roster. Wondering why Billy Beane traded for Arizona’s Chris Young last month – this might provide some clues. Soon, the trajectory of every hit ball will be recorded by video cameras in major league ballparks. Big brother is watching the Panda.

Welcome To The “Big Data Era Of Baseball”

This, of course, is where things get interesting. Until now, baseball teams have been focused on measuring finite events, crunching complex statistics, and performing a basic type of tactical decision analysis. But now teams are beginning to gather unstructured data. So just as corporate enterprises have moved from structured to unstructured data to provide new insights and give them an advantage over the competition, so will baseball teams.

At least one major league team, and likely more, is evaluating a small Hadoop cluster. Hadoop is a programming framework that supports the processing of very large data sets. To give you an example, companies like Google and Yahoo use it to give you the best search results quickly by analyzing data from all over the web to determine the best result.

So why would a baseball organization need a Hadoop cluster? Because unstructured data may unlock insights that are not apparent from the structured event data that is available to every team. Baseball managers, like CEOs, believe that the past is a great predictor of the future. By having his data scientist run a Hadoop job before every game, Bruce Bochy can not only make an informed decision about where to locate a 3-1 Matt Cain pitch to Prince Fielder, but he can also predict how and where the ball might be hit, how much ground his infielders and outfielders can cover on such a hit, and thus determine where to shift his defense.
Taken one step further, it’s not hard to imagine a day where managers like Bochy have their locker room data scientist run real-time, in-game analytics using technologies like Cassandra, Hbase, Drill, and Impala.

Will Big Data Ruin Baseball?

This raises the question, will big data ruin baseball? Will tracking and analyzing this mountain of data take the enjoyment out of the game? I don’t think so. Our national pastime has survived the Black Sox scandal, the designated hitter, pull over uniforms, free agency, night games, multiple players’ strikes, the dead ball era, the live ball era, and of course steroids. Big Data is not nearly as threatening.

In fact, big data might be the great neutralizer between large market and small market teams. Teams with the most advanced predictive algorithms would have an advantage. Bay Area teams should have an even larger advantage since it is the epicenter of big data. If you are an avid Giants fan and a data scientist, your dream job may soon be available. But move quickly, because the team across the Bay may already have a head start – they do, after all, have a Hadoop-like elephant for a mascot.