Bring your monorepo down to size with sparse-checkout

This post was written by Derrick Stolee, a Git contributor since 2017 who focuses on performance. Some of his contributions include speeding up git log --graph and git push for large repositories. You can hear him speak at Git Merge in Los Angeles on March 4.


Git 2.25.0 includes a new experimental git sparse-checkout command that makes the existing feature easier to use, along with some important performance benefits for large repositories.

Does your repository have so many files at root that your source directory is growing out of control? Do commands like git checkout or git status slow to a crawl? These problems can be extremely frustrating, especially for developers who just need to modify a small fraction of the files available. Now, an improved and experimental sparse-checkout feature allows users to restrict their working directory to only the files they care about. Specifically, if you use the “microservices in a monorepo” pattern, you can ensure the developer workflow is as fast as possible while maintaining all the benefits of a monorepo.

Previously, in order to use the sparse-checkout feature you needed to manually edit a file in your .git directory, change a config setting, and run an obscure plumbing command. The instructions for getting back to normal were even more complicated. The new git sparse-checkout command makes this process much easier.

Using sparse-checkout with an existing repository

To restrict your working directory to a set of directories, run the following commands:

  1. git sparse-checkout init --cone
  2. git sparse-checkout set <dir1> <dir2> ...

If you get stuck, run git sparse-checkout disable to return to a full working directory.

The init subcommand sets the necessary Git config options and fills the sparse-checkout file with patterns that mean “only match files in the root directory”.

The set subcommand modifies the sparse-checkout file with patterns to match the files in the given directories. Further, any files that are immediately in a directory that’s a parent to a specified directory are also included. For example, if you ran git sparse-checkout set A/B, then Git would include files with names A/B/C.txt and A/D.txt as well as E.txt.

Let’s explore an example to demonstrate how to change the included directories in your working directory.

Example: microservices in a monorepo

To follow along, I created a sample Git repository that you can clone and test yourself. This repository doesn’t actually build a real application, but imagine that it’s the monorepo for a photo storage and sharing application.

This repository includes the following file structure in the initial two levels:

client/
    android/
    electron/
    iOS/
service/
    common/
    identity/
    list/
    photos/
web/
    browser/
    editor/
    friends/
boostrap.sh
LICENSE.md
README.md

Imagine that this repository is a monorepo containing many microservices and clients.

  • The client directory contains independent clients for three different platforms: Android, Desktop (using Electron), and iOS.
  • The service directory contains all of the server-side logic for several independently-deployable microservices.
  • The web directory contains the Javascript-enabled “serverless” static web pages that use Javascript to communicate with those microservices.

While all of this code is in the monorepo, every user of the repository doesn’t need every directory. But, the users are constantly updating all 1,557 files in the repo when they use git pull to update their local changes with the latest master. Note that this number is very small, so imagine adding an extra three zeroes to the end of the numbers.

$ ls
bootstrap.sh*  client/  LICENSE.md  README.md  service/  web/

$ find . -type f | wc -l
1590

To get started with the sparse-checkout feature, you can run git sparse-checkout init --cone to restrict the working directory to only the files at root (and in the .git directory).

$ git sparse-checkout init --cone

$ ls
bootstrap.sh*  LICENSE.md  README.md

$ find . -type f | wc -l
37

A developer can do very little with only these files. However, the team building the Android app can usually get away with only the files in client/android and run all integration testing with the currently-deployed services. The Android team needs a much smaller set of files as they work. This means they can use the git sparse-checkout set command to restrict to that directory:

$ git sparse-checkout set client/android

$ ls
bootstrap.sh*  client/  LICENSE.md  README.md

$ ls client/
android/

$ find . -type f | wc -l
62

For more complicated scenarios, the architecture team created a boostrap.sh script at the root of the repository. This script exists even when the repository only contains the files at root, and provides the commands for teams that need more detailed sparse-checkout cones.

For example, the team running the identity service needs the code for their microservice, as well as the directory containing common code for all microservices.

$ ./bootstrap.sh identity
Running ‘git sparse-checkout init --cone’
Running ‘git sparse-checkout set service/common service/identity’

$ ls
bootstrap.sh*  LICENSE.md  README.md  service/

$ ls service/
common/  identity/

$ find . -type f | wc -l
90

The team that builds the photo browser in the web runs a web app and occasionally needs to deploy their own version of the microservices to do local testing of their web features. Their bootstrap code requires a larger cone, including all microservice directories and the web/browser directory.

$ ./bootstrap.sh browser
Running ‘git sparse-checkout init --cone’
Running ‘git sparse-checkout set service web/browser’

$ ls
bootstrap.sh*  LICENSE.md  README.md  service/  web/

$ ls service/
common/  identity/  items/  list/

$ ls web
browser/

$ find . -type f | wc -l
641

At any point, we can check which directories are included in our working directory using the list subcommand.

$ git sparse-checkout list
service
web/browser

Cloning in sparse mode

The examples I’ve provided started with a repository that already had every tracked file filled in the working directory. This is how you’d start using the sparse-checkout feature with your existing repositories. However, if you’re cloning a new repository, you can avoid filling the working directory with the huge list of files by using the --no-checkout option as you clone.

$ git clone --no-checkout https://github.com/derrickstolee/sparse-checkout-example
Cloning into 'sparse-checkout-example'...
remote: Enumerating objects: 14, done.
remote: Counting objects: 100% (14/14), done.
remote: Compressing objects: 100% (12/12), done.
remote: Total 1901 (delta 1), reused 11 (delta 1), pack-reused 1887
Receiving objects: 100% (1901/1901), 170.91 MiB | 74.79 MiB/s, done.
Resolving deltas: 100% (181/181), done.
 
$ cd sparse-checkout-example/
 
$ git sparse-checkout init --cone
 
$ ls
bootstrap.sh  LICENSE.md  README.md

To make this model work for your organization, some top-down design needs to be in place. First, the build system needs to allow having only a subset of directories on-disk. Second, your organization should create a helper tool so individuals can enlist in the directories they need according to their role. The bootstrap.sh script in our example is the simplest way to create that tool. Finally, by reducing coupling between components in your system, you can reduce the size of the cone each user needs to do their daily work.

The resulting large monorepo will seem a lot smaller for the developers working with it day-to-day. Their Git commands will need to inspect fewer files when running commands like git status, git add, or git checkout.

Now we’ll investigate what’s special about “cone mode” that allows the feature to scale better than before.

Cone mode: restricted patterns improve performance

The sparse-checkout feature was created by adapting the same pattern-matching logic from .gitignore files. These files list a set of patterns that are evaluated in order. Lines may start with !, which means to not include paths that match those patterns.

For example, the following is the sparse-checkout file after running git sparse-checkout init without a previous sparse-checkout file:

/*
!/*/

The first pattern means “match everything” and the second pattern means “do not match anything inside a folder”. The second pattern is evaluated after the first, so even though the first pattern matches every path in the repository, we still exclude files within at least one folder.

These patterns can be very complex, allowing you to specify even more complicated patterns. For instance, you could restrict your repository to only contain documentation files and C header files using the pattern set:

!/*
**/docs/
**/*.h

In this way, the pattern matching becomes quite expensive as the sparse-checkout file grows and the number of files in your repository grows, too. In fact, it grows quadratically. If there are N patterns and M files in your repository, then to checkout your repository, Git has to try each of the N patterns against each of the M files to look for a match. That grows quickly with O(N*M) time to determine which paths match your definition.

Some large repositories may need thousands of patterns to describe the working set. Checking each of these patterns with millions of paths leads to billions of pattern checks in every git checkout command. For example, a large repository we used for testing had over 1,000 folders to match in a root tree of three million files, leaving 700,000 files populated in the working directory. The old pattern matching algorithm took five minutes to complete. It might be faster to just populate the entire working directory than to restrict it with these patterns.

To visualize this quadratic growth, consider our simplest example we just mentioned, where the Android team would run git sparse-checkout set client/android. The index entries to depth two are listed on the left; there are over 1,500 total index entries in the repository. Across the top are the five patterns created by the set subcommand. For each index entry and each pattern, there’s a black box that represents a pattern match query.

As discussed in our previous example, most users will only need to restrict based on folder prefix matches. In those cases, the set of patterns allow us to use a different mechanism for matching that drops the quadratic growth.

In the example, the web developer ran git sparse-checkout set service web/browser, which creates the following patterns in the sparse-checkout file:

/*
!/*/
/service/
/web/
!/web/*/
/web/browser/

By checking the patterns in order, the files match what we want, including every file…

  • at root
  • under service/
  • in web/browser/
  • all immediately below web/

Files in client/ are excluded by the second line, and files in web/editor/ are excluded by the fifth.

These patterns could be evaluated by the usual method: iterate through the patterns and evaluate the path against that pattern. Instead, we do something much simpler and much faster.

While reading the sparse-checkout file, each pattern is either a recursive or parent pattern:

  • Recursive: /<path>/
  • Parent: !/<path>/*/

As Git parses the patterns, it creates two hashsets for the recursive and parent paths. Given a recursive pattern /<path>/, it adds <path> to the recursive set. Given a parent pattern !/<path>/*/, it removes <path> from the recursive set and adds it to the parent set. You can read the code for more details.

Given a file path such as /A/B/C/D.txt, Git includes the path in the working directory if any of the following are true:

  • A/B/C is in the parent set,
  • A/B/C is in the recursive set,
  • A/B is in the recursive set, or
  • A is in the recursive set.

In this way, Git matches the M patterns across N files in O(M + N*d) time, where d is the maximum folder depth of a file. Of course, this assumes that Git inspects the files in an arbitrary order. When Git is evaluating which files match the sparse-checkout patterns, it inspects the files in a sorted order. This means that when the start of a folder matches a recursive pattern exactly, Git marks everything in that folder as “included” without doing any hashset lookups. Git also detects the start of a folder that’s outside of our cone and marks everything in that folder as “excluded” similarly. Finally, this reduces the typical time to be closer to O(M+N).

Continuing the process, the files on root get included immediately. The directories check if they’re recursive or parent closures. The only recursive match is “client/android” so that would auto-fill all contained paths as “matched”. The “client” directory appears in the parent pattern set, so we test all entries in that directory but the only other match is the README file. The parent and recursive pattern sets don’t include the “service” and “web” directories, so we automatically skip the entries in those directories.

In the figure, each green check mark is a successful hashset containment query and each red X is an unsuccessful hashset containment query. There are many fewer hashset containment queries than there were pattern match queries. This logic to auto-match all paths within a folder replaced a TODO comment from eight years ago. It’s unlikely that this kind of logic could be applied consistently without restricting the set of patterns the way that cone mode does.

Finally, in our real example that took five minutes before, our pattern matching now takes less than one second.

Sparse-checkout and partial clones

Pairing sparse-checkout with the partial clone feature accelerates these workflows even more. This combination speeds up the data transfer process since you don’t need every reachable Git object, and instead, can download only those you need to populate your cone of the working directory. You can test it right now with the example repository by adding --filter=blob:none to the clone command:

$ git clone --filter=blob:none --no-checkout https://github.com/derrickstolee/sparse-checkout-example
Cloning into 'sparse-checkout-example'...
Receiving objects: 100% (373/373), 75.98 KiB | 2.71 MiB/s, done.
Resolving deltas: 100% (23/23), done.
 
$ cd sparse-checkout-example/
 
$ git sparse-checkout init --cone
Receiving objects: 100% (3/3), 1.41 KiB | 1.41 MiB/s, done.
 
$ git sparse-checkout set client/android
Receiving objects: 100% (26/26), 985.91 KiB | 5.76 MiB/s, done.

In the examples, let’s take a look at the number of objects it took to adopt each persona. To clone the entire repository, we needed only 373 objects instead of ~2,000.

Similarly, look at the total object count in the “Receiving objects” line for each git sparse-checkout command. When initializing the sparse-checkout feature, three objects are downloaded for the files at root. Finally, 26 more objects are downloaded when populating the client/android directory.

That’s the wonder of partial clone combined with sparse-checkout: you pay for what you need. Keep an eye out for the partial clone feature to become generally available[1].

Future plans

The sparse-checkout command is evolving. Help us improve the feature by reporting your findings to  the Git mailing list.

We plan to extend the feature to:

  1. Improve the experience cloning into sparse mode.
  2. Include add and remove subcommands to incrementally modify the cone.
  3. Add a stats subcommand to measure the size of your cone against the total size of a non-sparse working directory.
  4. Allow git sparse-checkout set to run with a non-clean git status, as long as the update won’t remove any of your current changes.

[1]: GitHub is still evaluating this feature internally while it’s enabled on a select few repositories (including the example used in this post). As the feature stabilizes and matures, we’ll keep you updated with its progress.

The post Bring your monorepo down to size with sparse-checkout appeared first on The GitHub Blog.

Source: GitHub Old