The company developed Kafka, an open source message streaming tool to help make it easier to move massive amounts of data around a network from application to application. It has become so essential today that LinkedIn has dedicated 1800 servers moving over 2 trillion transactions per day through Kafka, Jiangjie Qin, lead software engineer on the Cruise Control project told TechCrunch.
With that kind of volume, keeping the Kafka clusters running has become mission-critical, so earlier this year the team decided to create a tool that would recognize when a cluster was going to break. Then based on a set of predefined rules, it would auto configure the cluster to use the correct number of resources, fix itself and keep running. The tool became Cruise Control
Prior to creating Cruise Control, engineers would have to manually reconfigure a cluster each time one went down, and Qin says this was a tricky proposition because it could end up having a cascading impact across clusters if it was configured incorrectly. By putting the machine in charge of cluster management with some human oversight, it greatly simplified the process and allowed them to scale cluster repair to meet the needs of their growing network in a way that just wasn’t possible when the engineering team had to do all of the work manually.
At its core, Qin explained this was a load balancing problem. Did the cluster have the right number of resources to stay running without having a negative impact on other clusters in the network. He said this was a matter of identifying some common configurations and applying a set of goals to each one. The machine can very quickly assess the needs of the cluster, check it against the set of common configurations and a set of goals to choose the correct one.
To make sure, it’s on track, it’s possible to put a human check in the workflow where Cruise Control asks an engineer to review the optimization plan before continuing.
If this seems like a tool that would have been nice to have before this, Qin acknowledges that it is, but it took the scalability issues to drive the company to apply the engineering resources to find a solution to the problem.
It took about half a year of tinkering to find the right solution where the machine could process the changes more efficiently than humans could. The company plans to release the tool to the open source community with the goal of not only improving the way it keeps Kafka clusters in balance, but also applying the same load balancing principles to other distributed systems, which should come in handy for a number of use cases, Qin says.