Davis Nguyen, Vice President, is a member of the Aladdin Product Group’s Business Operations Systems team.
As BlackRock continues to grow in terms of AUM, products, customers, and portfolios managed, trading volumes are increasing exponentially. Tasked with trying to scale an application ecosystem so that it remains responsive in the face of these growing demands, I embarked on a journey into scalable architecture design that would lead me to develop a generic library that achieves distributed consensus in a decentralized way.
The Need for Speed (and therefore Consensus)
One of the most common techniques for scaling an application or service is to run more instances of it. A fundamental problem with this, however, is achieving overall system reliability in the presence of faulty processes. There needs to be a way for each instance in a cluster to agree on some state that is needed for each computation (like which instance should be doing what piece of work). This is known as the consensus problem. Many distributed technologies address this problem by relying on a centralized service such as Apache ZooKeeper to keep track of and manage distributed application state.
Jody Kochansky, our Global Head of Software Development, called me into his office to discuss how I was planning to scale out one of the core application ecosystems in Aladdin. With the ever-growing demands of the business of the Aladdin platform, I told him we were looking at using ZooKeeper as a centralized service to coordinate the sharding of requests, allowing the application to scale horizontally in a fault‑tolerant way.
While we had been using Zookeeper for a while in production for other applications, Jody was concerned about adding yet another system dependency into the mix and the increased complexity and operational maintenance that goes along with it. What if there was a way to achieve the same desired behavior directly in the application instances themselves? Jody challenged us to create a generic library that could be embedded – not just into our application – but into any cluster of applications, allowing them to distribute application state in a decentralized manner. By doing so we could remove the operational cost and risk of having a common piece of ZooKeeper infrastructure being maintained for a variety of different application suites.
Raft to the Rescue
In my quest to come up with a response to Jody’s challenge, I stumbled upon the Raft consensus algorithm. It can be thought of as a simplified alternative to Paxos. Raft breaks up the consensus logic into smaller components, making it more coherent and digestible. After reading the white paper and understanding the mechanics of Raft, I knew it was exactly what I was looking for.
I explored using open source library implementations of Raft but none could be plugged into my existing Java application and integrate seamlessly with BlackRock’s communication layer. Writing a wrapper layer on top of an existing library was an option but would have required a lot more code than writing the library natively from scratch and would ultimately be harder to support and maintain. As such, I decided to write my own implementation of Raft along with an automated distributed testing framework to mimic scenarios such as network latency
Within a couple of days, I was able to write a small proof of concept library which I aptly named libraft. For example, it leverages BlackRock’s proprietary messaging system, BMS, utilizing a dedicated BMS broadcast channel for service discovery and intracluster communication. It also hooks directly into our Production Operations team’s tools, providing application owners who use the library a self-managed way to support and monitor their application cluster.
Raft in a Nutshell
Raft achieves consensus via an elected leader. In our case, once a leader is elected, it’s main role will be to monitor the cluster state and assign partition ranges to each of the servers in the cluster. The process of electing a leader in a raft cluster follows a simple but elegant algorithm. Each instance starts off in a follower state, with a randomized timeout in which it expects a heartbeat message from the leader.
If no heartbeat is received during this time, either because the existing leader has become unresponsive or there is no active leader, the follower changes its status to be a candidate for a new election term. Candidates then ask the other servers in the raft cluster to vote for them, and if that candidate receives a majority vote it will elect itself as leader. It will then assume the normal duties of being a leader, and will remain so until becoming unresponsive itself.
The Proof is in the Pudding
We now have two live clusters running in production, one of which has been running for almost six months. These clusters have reliably served millions of requests a day since their inception and have achieved a maximum throughput increase of 800%. With the success of this milestone we are looking at onboarding more applications with libraft and continuing the journey of maturing the technology for wider adoption in the BlackRock ecosystem.