Unlimited throughput: how to increase auto-merge performance

«Find the beginning of everything, and you will understand a lot.» — Kozma Prutkov

My name is Ruslan and I’m a Release Engineer at Bumble— the parent company that operates Badoo and Bumble apps. Recently I found that the need to optimise the auto-merge system is inevitable in our mobile projects. The task turned out to be interesting, so I decided I would share its solution with you. In this article, I tell you how we implemented automatic Git branch merging in the past and how we then increased the auto-merge performance and avoided compromising existing high-reliability standards.

Custom auto-merge

Many programmers run git merge every day, resolve conflicts and verify their changes using tests. Some people automate builds so that they run automatically on a separate server. But it’s still up to the individual to decide which branches to merge. Others go further and add automatic merging which results in a Continuous Integration (CI) system.

For example, GitHub offers a semi-manual mode where a user with permission to write to the repository can check the “Allow auto-merge” box. Providing the conditions configured in the project settings are met, the branch will be merged into the target branch. Bitbucket is more sophisticated in this regard but it also imposes significant restrictions on the branching model, branch names and the number of merges.

Such automation may be sufficient for small projects. But as the number of developers and branches increases, the constraints imposed by services can have a significant impact on CI performance. For example, we used to have a merge system whereby a consistent merge strategy kept the main branch in a stable state. One of the prerequisites for a merge was a successful build with all the commits of the main branch merged into the developer branch. This strategy works reliably but it greatly depends on the build time. And we soon noticed that we were starting to approach the limit. For example, with a build time of 30 minutes, processing 100 merges per day would require more than two days. To avoid this limitation and get maximum freedom of choice in merge strategies and branching models, we opted to create our own auto-merge.

The result is that now we have a custom auto-merge, which we can adapt to the needs of each team. Let’s take a look at the implementation of one of the most interesting schemes that our Android and iOS teams use.

Terms

Main. This is how I shall refer to the main branch in the repository.

Build. This refers to a TeamCity build associated with a Git branch and a ticket in the Jira tracker. It runs all kinds of checks including static analysis, compilation and testing. A successful build on the latest branch revision combined with a «To Merge» ticket status is one of the prerequisites for auto-merge.

Example of a branching model

We tried different branching models in mobile projects and came up with the following simplified version.

The developer creates a new branch based on main with a name that has the identifier of the ticket in the tracker e.g. PRJ-k. When the ticket is ready, the developer changes its status to «Resolved». Using hooks built into the tracker, we run a build for the ticket branch. At a certain point, when the changes have passed review and the necessary autotests at different levels, the ticket receives the status «To Merge» and it gets picked up by the automation and sent to main.

Once a week we create a release_x.y.z branch based on main, run the final builds on it, fix bugs if necessary, and submit the result on the App Store or Google Play. All phases of the branches are reflected in the statuses and additional fields of Jira tickets. Our REST API client helps in communicating with Jira.

This simple model not only allowed us to build a reliable auto-merge but also proved to be convenient for everyone involved in the process. But the auto-merge implementation itself changed several times before we achieved high performance and minimised side-effects such as conflicts, ticket reopening and unnecessary rebuilds.

Version one: a greedy strategy

At first, we went from the simple and obvious. We took all the tickets that were in «To Merge» status, selected those for which we had successful builds, and sent them to main with the git merge command, one at a time.

Note: I have simplified the description of the first version somewhat. In reality, there was a dev branch between main and the developer branches, where all the problems described above occurred. Before merging main into dev, we were trying to stabilise the builds with special “integration” branches, which were automatically created based on the dev at 24-hour intervals.

We checked whether TeamCity had a current successful build using the getAllBuilds REST API method as follows (pseudocode):

haveFailed = False # Are there any failed builds
haveActive = False # Are there any active builds

# Get buildType builds for the branch at commit revision
builds = teamCity.getAllBuilds(buildType, branch, commit)

# Checking each build
for build In builds:
# Checking each revision in the build
for revision in build.revisions:
if revision.branch is branch and revision.commit is commit:
# Build is up to date
if build.isSuccessful:
# Build is up to date and successful
return True
else if build.isRunning OR build.isQueued
haveActive = True
else if build.isFailed:
haveFailed = True
if haveFailed:
# Remove the ticket from the queue by reopening it
ticket = Jira.getTicket(branch.ticketKey)
ticket.reopen("Build Failed")
return False
if not haveActiveBuilds:
# There are no active, failed or successful builds. Starting a new one
TriggerBuild(buildType, branch)

Revision is another way of calling the Git commits in TeamCity. Revisions are displayed as 16-character sequences on the «Changes» tab of the build page in the TeamCity web interface. Using revisions we can easily determine, for instance, if a ticket branch needs to be rebuilt or if the ticket is ready for merging.

It’s important to know that you can (and often should) specify the revision in the lastChanges parameter of your request to add a new build to the queue. Otherwise, TeamCity may select an obsolete branch revision when it runs the build. As will be shown below, specifying the revision is necessary when, for example, the logic outside TeamCity is based on looking for builds on specific commits (our case).

From the time a ticket is moved to a ready status (or in our case «Resolved») the branch associated with it is usually unchanged. Therefore, most often the build remains relevant. Moreover, the «To Merge» status indicates that the last build most likely didn’t fail, because we have a hook that reopens the ticket if the build fails.

At first glance, the next steps seem obvious: take all the resolved tickets and merge them into main one by one. This is what we did in the first version of auto-merge.

It seemed to work well, and it was certainly fast enough. But at the same time, it required a lot of attention. Sometimes there were situations where changes to several tickets conflicted with each other. Being quite a common phenomenon, merging conflicts did not raise any particular challenges at first. Any that did arise were resolved by the developers on release duty. However, with the increasing number of developers, tasks and, consequently, branches, putting the release in order started to require more and more effort. Delays in resolving conflicts began to affect new tasks. There’s no need to labour this point as I’m sure you’ll know what I mean. The main thing was that the conflicts had to be dealt with, and so prevented from entering the release.

Merge conflicts

When you change the same line of code in different branches and try to merge them into main, Git will ask you to resolve merge conflicts. You have two options to choose from before committing the changes.

This should be familiar to almost every Version Control System (VCS) user. The CI process, like any VCS user, needs to resolve conflicts. It is also true that this has to be done somewhat blindly, with little or no understanding of the codebase.

If the git merge command fails and all files in the git ls-files — unmerged list have conflict handlers, we parse the contents of each such file against the <<<<<<<, ======= and >>>>>>> conflict markers. If the conflicts are only caused by an application version change, for example, we choose the latest version of the code between the local and remote parts of the conflict.

A merge conflict is one of the simplest types of conflicts in CI. Where there is a conflict with the main, CI must notify the developer of the problem, and exclude the branch from the next auto-merge cycles until it has new commits.

The solution is as follows: break at least one of the necessary merge conditions. Since the branch is associated with a tracker ticket, we can reopen the ticket by changing its status. This way, we exclude the ticket from the auto-merge while simultaneously notifying the developer at the same time (since they’re subscribed to the changes in the ticket). To be on the safe side we also send a message via messenger.

Logical conflicts

Is it possible that despite the success of a couple of branches’ builds in isolation, after merging them with the main branch might the build on the main branch fail? Experience shows that indeed it might. For example, if the sum of a and b in each of two branches does not exceed 5, this does not guarantee that the combined changes in a and b in those branches will not total to a larger sum.

Let’s try to reproduce this using the following test.sh Bash script as an example:

#!/bin/bash
get_a() {
printf '%dn' 1
}
get_b() {
printf '%dn' 2
}
check_limit() {
local -i value="$1"
local -i limit="$2"
if (( value > limit )); then
printf >&2 '%d > %d%sn' "$value" "$limit"
exit 1
fi
}
limit=5
a=$(get_a)
b=$(get_b)
sum=$(( a + b ))
check_limit "$a" "$limit"
check_limit "$b" "$limit"
check_limit "$sum" "$limit"
printf 'OKn'

Let’s commit it and create a couple of branches: a and b.

Let get_a() return 3 in the first branch and get_b() return 4 in the second:

$ diff --git a/test.sh b/test.sh
index f118d07..39d3b53 100644
--- a/test.sh
+++ b/test.sh
@@ -1,7 +1,7 @@
#!/bin/bash

get_a() {
- printf '%dn' 1
+ printf '%dn' 3
}

get_b() {

$ git diff main b
diff --git a/test.sh b/test.sh
index f118d07..0bd80bb 100644
--- a/test.sh
+++ b/test.sh
@@ -5,7 +5,7 @@ get_a() {
}

get_b() {
- printf '%dn' 2
+ printf '%dn' 4
}

check_limit() {

In both cases, the sum does not exceed 5 — and our test passes:

$ git checkout a && bash test.sh
Switched to branch 'a'
OK
$ git checkout b && bash test.sh
Switched to branch 'b'
OK

But after merging main with branches, the tests stop running, despite there being no obvious conflicts:

$ git merge a b
Fast-forwarding to: a
Trying simple merge with b
Simple merge did not work, trying automatic merge.
Auto-merging test.sh
Merge made by the 'octopus' strategy.
test.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
$ bash test.sh
7 > 5

«It would be easier if instead of get_a()and get_b() we used assignments: a=1; b=2», says the attentive reader, and they would be right. Yes, it would be easier. Perhaps that’s why Git’s built-in auto-merge algorithm would have successfully detected the conflict situation (which would not have prevented a logical conflict problem from being demonstrated):

$ git merge a
Updating 4d4f90e..8b55df0
Fast-forward
test.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

$ git merge b
Auto-merging test.sh
CONFLICT (content): Merge conflict in test.sh
Recorded preimage for 'test.sh'
Automatic merge failed; fix conflicts and then commit the result.

In practice, conflicts can be far less obvious. For example, different branches may rely on API of different versions of some dependency libraries, with the newer version not supporting backwards compatibility. It’s hard to do without in-depth knowledge of the codebase (read: project developers). In fact, CI is exactly what you need to solve these problems.

Of course, we won’t get away with just resolving the conflict — someone has to then make the changes. But the sooner we discover a problem, the fewer people will be involved in resolving it. Ideally, only the developer of one of the conflicting branches will need to be involved. If there are two such branches, one of them may well be connected to the main.

Preventative measures

So, the main thing is to keep the logical conflict out of the main. Otherwise, you will face a long and painful search for the source of errors, and then trying to find a programmer who should or can solve the problem. And this needs to be done as quickly and accurately as possible both to avoid delays in release and avoid new branches of the logic conflicts which are based on the already identified conflict. Such conflicts often prevent a large part of the application from working correctly or prevent it from running at all.

We need to synchronise the branches so that their combined contribution to the main will not cause the release build to crash. Clearly, all the ready branches need to be merged one way or another, and tests need to be run based on the merge result. There are many ways of doing this, let’s see what our particular path was.

Second version: a sequential strategy

It became clear that the existing auto-merge readiness conditions were insufficient for the ticket. Some means of synchronisation between branches were going to be required, some sort of order.

Git is supposed to be a synchronisation tool but the order in which branches get into main and the time when they are updated from main is up to us to decide. To determine exactly which branches cause problems in main, you can try sending them there one by one. Then you can queue them up and arrange the order based on when the ticket reaches «To Merge» status in «First In, First Out» fashion.

So, now we’ve got the order but how do we next connect the branches? Suppose we merge the first ticket in the queue into main. Because main has changed, it may conflict with the other tickets in the queue. So, before merging the next ticket, we need to make sure that the updated main is still compatible with it. For this, simply merge main into the ticket. But, because after merging main with the branch its state is different from what it was in the build, you have to restart the build. All other tickets in the queue must wait for the build to complete and the tickets ahead of them be processed in order to maintain order. This is roughly the reasoning that led us to a consistent auto-merge strategy.

The scheme works reliably and predictably. Thanks to mandatory synchronisation with the main and subsequent rebuilds, conflicts between branches can be detected immediately before they get to the main. Previously, we could only resolve a conflict after a release had been merged with many branches, most of which had nothing to do with it. In addition, the predictability of the algorithm allowed us to show the queue of tickets in the web interface so that we could roughly estimate the speed of their branches in the main.

But there is one significant drawback with this method: the throughput of auto-merge depends on build time in a linear fashion. With an average build time of 25 minutes for an iOS app, we can expect a maximum of 57 tickets per day. In the case of an Android app, however, it takes about 45 minutes, which limits auto-merge to 32 tickets per day. The latter is even fewer than the number of Android developers in our company.

In practice, the waiting time for a ticket in «To Merge» status averaged 2 hours and 40 minutes, with «spikes» of up to 10 hours! The need for optimisation became obvious. It was necessary to increase the speed of mergers while maintaining the stability of a consistent strategy.

The final version: a combined sequential and greedy strategy

Damir Davletov, a developer from the iOS team suggested returning to the greedy strategy while trying to retain the benefits of the sequential one.

Returning to the idea of the greedy strategy: we merged all branches of ready tickets into the main. The major problem was the lack of synchronisation between branches. We just had to solve this problem and we had a fast and reliable auto-merge!

If you want to estimate the total contribution of all tickets in «To Merge» status in main, why not merge all branches into some intermediate branch «Main Candidate» (MC) and run the build on it? If the build is successful, you can safely merge the MC into main. Otherwise, you’ll have to exclude some tickets from MC and restart the build.

How do you know which tickets to exclude? Suppose we have n tickets. In practice, the cause of the build crash is most often one ticket but we don’t know for sure where it is — all positions from 1 to n are equally likely. So, we divide n in half to find the failing ticket.

Since a ticket’s place in the queue is determined by the time it reaches «To Merge» status, it makes sense to take the half in which the tickets with the longest waiting time are located.

Following this algorithm, for k failing tickets, in the worst case we have to do O(k*log2(n)) builds before we can process all the failing tickets and get a successful build on the remaining ones.

The probability of a favourable outcome is high. Also, while the builds on the MC branch are failing, we can continue working with the sequential algorithm!

So, we have two autonomous auto-merge models: sequential (let’s call it Sequential Merge, or SM) and greedy (let’s call it Greedy Merge, or GM). To benefit from both, you have to let them run in parallel. But parallel processes, as we know, require synchronisation. You can achieve this either by means of interprocess communication or non-blocking synchronisation or by a combination of the two. Anyway, I am unaware of any other methods.

We implement the processes themselves as a queue of script commands. The commands can be either one-shot or periodic. Since auto-merge will never come to an end, and the queue controller is better suited to manage reruns, we will go for the second type.

All that remains is to prevent all possible cases of race conditions. There are lots of them, so we’ll just look at a couple of the most important ones:

  1. SM-SM and GM-GM: between the commands of the same type.
  2. SM-GM: between SM and GM within the same repository.

We can easily solve the first problem with a mutex based on a token, which includes the command name and the repository name. Example:

lock_${command}_${repository}.

But the second case is more difficult. Let me explain. If the SM and GM act inconsistently, the SM may merge the first ticket in the queue into the main and the GM will not notice the ticket, i.e. it will merge all other tickets neglecting to take the first into account. For example, if the SM moves the ticket to In Master status and the GM always selects tickets by their To Merge status, the GM may never process the ticket merged by the SM. In this case the very first ticket may conflict with at least one of the others.

To avoid logical conflicts, the GM needs to handle all tickets in the queue without exceptions. For the same reason, the GM algorithm in conjunction with the SM must always maintain the same order of tickets in the queue as the SM, because this order determines which half of the queue will be chosen in the case of a failed build in the GM. Providing these conditions are met, the ticket handled by the SM will always be part of the GM build, which will mean that we achieve the correct synchronisation.

The result was that we managed a kind of non-blocking synchronisation.

A quick note on TeamCity

There are multiple nuances to the GM process but to explore them all would overload the article. However, there is one that deserves attention. In the course of development, I encountered a problem with the GM command looping: the process was constantly rebuilding the MC branch and creating a new build in TeamCity. The problem turned out to be that TeamCity had no time to download updates of the repository that contained the MC branch created by the GM process a few seconds earlier. Incidentally, the repository update interval in TeamCity was about 30 seconds.

As a hotfix, I introduced a floating build tag: I created a tag in TeamCity with a name similar to automerge_ios_repo_git, and moved it from build to build to see which build was up-to-date, what state it was in etc. But when I realised this approach was flawed, I set out to find out how to tell TeamCity about the new state of the MC branch and how to attach revisions to builds.

The solution might seem obvious to some but I confess that it took me a while to find it. It turns out that you can attach a revision to a build when adding it to the queue using the lastChanges parameter of the addBuildToQueue method:

<lastChanges>
<change
locator="version:{{revision}},buildType:(id:{{build_type}})"/>
</lastChanges>

In this example, {{revision}} is replaced with a 16-character commit sequence, and {{build_type}} is replaced with the build configuration identifier. But this is not enough, because without new commit information TeamCity may refuse our request.

In order for a new commit to reaching TeamCity, you either have to wait about as long as specified in the VCS root configuration settings or ask TeamCity to check for changes in the repository («Pending Changes») using the requestPendingChangesCheck method and then wait for TeamCity to download the changes containing the commit. This check is performed by the getChange method, where at least the commit itself should be passed to changeLocator as a parameter of the version locator. By the way, at the time of writing this article (and the code), there was no description of the version parameter on the ChangeLocator page in the official documentation. Maybe that’s why I didn’t immediately realise it existed nor that it’s a 40-character 16-character commit hash.

Pseudocode:

teamCity.requestPendingChanges(buildType)

attempt = 1
While attempt <= 20:
response = teamCity.getChange(commit, buildType)
If response.commit == commit:
return True # Got it
sleep(10)

return False

On the merges’ maximum rate

The greedy strategy has the disadvantage that it can take a long time to find a branch with a bug. For example, 6 builds for 20 tickets can take about three hours. Can this drawback be eliminated?

Suppose there are 10 tickets in the queue, of which only the 6th ticket causes the build to fail.

With the greedy strategy, we try to build all 10 tickets at once, which causes the build to fail. Then we collect the left half (1 to 5) successfully because the ticket with the error is in the right half.

If we had run the build immediately on the left half of the queue, we would have saved time. And if the problem was not the 6th ticket, but the 4th ticket, it would have been better to run the build on a quarter of the entire queue length, i.e. on tickets 1 to 3, for example.

Continuing with this thought, we can conclude that the likelihood of failed builds can only be eradicated completely if we run the builds of all ticket combinations in parallel:

Note, in order to avoid conflicts here, the order must be respected, making combinations like “fifth and first” are unacceptable. Then you can simply take the successful builds and merge their tickets into main. Failed builds would not be time-consuming.

It’s roughly the same algorithm implemented in the premium GitLab feature called «Merge Trains». A «Train» is a queue of «merge requests» to the main branch. For each such request, changes to the branch of the request itself are merged with changes to all the requests placed before it (i.e. the requests added to the train earlier). For example, for three merge requests A, B and C, GitLab creates the following builds:

  1. Changes from A are merged into the main branch.
  2. Changes from A and B are merged into the main branch.
  3. Changes from A, B and C are merged into the main branch.

If a build fails, the corresponding request is deleted from the queue and the builds of all previous requests are restarted (ignoring the deleted request).

GitLab limits the number of builds running in parallel to twenty. All other builds wait in a queue outside the train. As soon as the build finishes, the next build in the waiting queue takes its place.

Running concurrent builds on all possible combinations of tickets in the queue allows for very high merge rates. By eliminating the waiting queue, you can even approach maximum speed.

But while there are no barriers to human thought, the limits of hardware resources can be seen quite clearly:

  1. Each build needs its own agent in TeamCity.
  2. In our case, a mobile application build has about 15–100 dependency builds, each of which needs to be run on an agent.
  3. The auto-merge mobile app builds in main are only a small fraction of the total number of builds in TeamCity.

Having weighed up the pros and cons, we decided to go with the SM + GM algorithm for now. At the current ticket queue growth rate, the algorithm shows good results. If we notice potential throughput issues in the future, we will probably go in the direction of «Merge Trains» and add a couple of parallel GM builds:

  1. The whole queue.
  2. The left half of the queue.
  3. Left quarter of the queue.

The results

What we have achieved with our combined auto-merge strategy:

  • reduced the average queue size by a factor of 2–3;
  • reduced the average waiting time by a factor of 4–5;
  • achieved merge in the order of 50 branches a day in each of the projects mentioned;
  • increased the throughput and maintained a high level of reliability. We have as good as removed the limit on the number of tickets per day.

Sample merge graphs over several days:

Number of tickets in the queue before and after implementation of the new algorithm:

The average number of tickets in the queue («AVG») decreased by a factor of 2.5 (3.95/1.55).

Ticket waiting time in minutes:

The average waiting time («AVG») decreased by a factor of 4.4 (155.5/35.07).


Unlimited throughput: how to increase auto-merge performance was originally published in Bumble Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Badoo

Leave a Reply

Your email address will not be published.


*