Bazel’s remote caching and remote execution explained

Posted
Updated
Also posted on BuildBuddy Blog

Lately I’ve been optimizing Bazel remote cache and remote execution[1] usage for a large iOS codebase. I’m collecting a lot of the insights I learned in another blog post I’m writing, but I realized that a post discussing the fundamentals of remote caching and remote execution would also be of value to some of you.

So that’s what this post is: a nuts and bolts (or rather actions and spawns 😄) overview of Bazel’s remote caching and remote execution capabilities.

Actions

Each rule target in the build graph of a requested build produces zero or more actions during the analysis phase. These actions form an action graph, representing all the actions that need to be performed during the execution phase.

An example action graph. Actions are grouped by the targets that created them. Arrows connect the actions, showing dependencies between them.

During the execution phase the action graph is traversed. For each action, Bazel determines if it has to be executed, either because the action result doesn’t exist in the output base’s action cache, or because the output in the output base doesn’t match the output listed in the action result.

Spawns

If Bazel has to execute an action, it creates a spawn, which encodes all the information needed to be able to execute the action, including the spawn’s “strategy”[2], which, among other things, determines if/how the spawn should utilize remote caching or remote execution.

A spawn’s strategy will dictate if Bazel has to do additional work before, after, or instead of executing an action locally. For example, if a spawn with the remote-cache strategy is executed, Bazel may check if the action result exists in the external cache’s action cache, and if it does, it might download the listed outputs instead of executing the action. I go over this in greater detail later.

After an action’s outputs are available in the output base, either because they were downloaded or created locally, dependent actions in the action graph can spawn.

Parallelism

The number of actions that can be executed locally at once is limited by the --local_cpu_resources flag. The number of spawns that can be executed at once is limited by the --jobs flag. Since a spawn’s execution can include more work, or different work, than an action’s local execution, it can be beneficial to have a --jobs value that is larger than --local_cpu_resources. When that is the case, and a spawn tries to execute an action locally, it might block waiting for CPU resources to free up. This is where remote execution can be beneficial.

Remote caching

To prevent confusion over Bazel’s concept of a “remote cache”, which can mean either a disk cache which is local to the machine, or a remote cache which uses networking protocols and is probably not local to the machine, I’m going to instead refer to both of these cache types as an “external cache”, as it’s external to the output base.

When using an external cache, Bazel will augment its output base with the action cache (AC) and content-addressable storage (CAS) of the external cache. This means if an action result for an action that Bazel wants to execute isn’t in the output base’s action cache, Bazel can check if the AC has it. The same is true for the action’s outputs; if the output base doesn’t have the expected output, then Bazel can check if the CAS has it.

remote-cache

Bazel achieves this behavior with the remote-cache spawn strategy, which is used alongside a local strategy (e.g. sandbox, worker, local, etc.).

A remote-cache spawn has the following steps:

Flags

Setting the --disk_cache flag causes Bazel to use that directory on the filesystem as an external cache. Setting the --remote_cache flag causes Bazel to connect via HTTP(S), gRPC(S), or UNIX sockets to an external cache. Setting both flags causes Bazel to use both the disk cache and the remote cache at the same time, forming a “combined cache”.

A combined cache reads from and writes to both the disk and remote caches, and is treated like a remote cache overall, unless the --incompatible_remote_results_ignore_disk flag is used. If that flag is used, the disk cache instead continues to behave like a local cache, allowing it to return results even if --noremote_accept_cached is set, store results even if --noremote_upload_local_results is set, and return/store results for no-remote-cache/no-remote actions.[6]

Setting the --experimental_guard_against_concurrent_changes flag helps protect the external cache from being poisoned by changes to input files that happen during a build. I highly recommend setting this flag if developers have an external cache enabled, even if it’s only the disk cache.

Most remote cache implementations will separate the AC, and some will separate the CAS, based on the value of the --remote_instance_name flag. This can used for numerous reasons, such as project separation, or a bandage for non-hermetic toolchains.

The --remote_cache_header flag causes Bazel to send extra headers in requests to the external cache. Multiple headers can be passed by specifying the flag multiple times. Multiple values for the same name will be converted to a comma-separated list. The --remote_header flag can be used instead of setting both --remote_cache_header and --remote_exec_header to the same value.

Remote execution

Bazel is able to execute actions on a remote executor, instead of executing them locally, using a concept called “remote execution”. Since these actions don’t use local resources, the number of actions that can be executed remotely in parallel is limited only by --jobs and the available remote resources, not --local_cpu_resources. If your builds are sufficiently parallel, this can result in them completing faster. The parallelism section goes into more detail.

remote

Bazel achieves the behavior of executing actions remotely with the remote spawn strategy, which includes most of the behavior of the remote-cache strategy, and is used instead of a local strategy (e.g. sandbox, worker, local, etc.).

A remote spawn has the following steps:

Disk cache

Starting in Bazel 5.0, the disk cache (or more specifically, the combined cache) can be used with remote execution. Prior to Bazel 5.0, if you also wanted to cache things locally, you would have to setup a remote cache proxy sidecar.

Dynamic execution

Bazel supports a mode of remote execution called “dynamic execution”, in which local and remote execution of the same action are started in parallel, using the output from the first branch that finishes, and cancelling the other branch.

I wanted to mention it for completeness, because when tuned properly it can result in faster builds than using either local execution or remote execution alone. However, it might not play well with Remote Build without the Bytes, as the local execution branches might need to download the outputs of previous remotely completed actions, and if tuned improperly, it can result in slower builds.

Flags

Setting the --remote_executor flag causes Bazel to connect via gRPC(S) or UNIX sockets to a remote executor. If --remote_cache isn’t set, it defaults to the value set for --remote_executor. Most remote execution setups will have the remote cache and remote executor at the same endpoint.

In addition to how it affects the remote cache, the --remote_instance_name flag might determine which remote execution cluster a build runs on. Some actions might need to target a specific subset of executors, possibly because they need certain hardware or software, and they can do that with platform properties.

Platform properties can be set globally with the --remote_default_exec_properties flag, but only if they aren’t set at the platform or target level. The action result that is stored in an action cache includes the platform properties. This is important to note, as it can affect action cache hit rates. If you conditionally use remote execution, and you use set platform properties, you might want to have them set non-conditionally, in order to be able to reuse the cached action results. Some remote execution implementations allow setting global platform properties with --remote_exec_header flags, as a way to prevent these cache misses.

The --remote_timeout flag controls how long Bazel will wait for a remote cache operation to complete. While the timeout doesn’t apply to the Execution.Execute call[8], using remote execution might involve uploading or downloading artifacts that a local build doesn’t, and the default value for this flag (60 seconds) might not be long enough.

The --remote_retries flag controls how many times Bazel will retry a remote operation on a transient error, such as a timeout. The flag defaults to 5, and depending on how you plan to use remote execution, you might want to increase it to a much larger value. Bazel uses an exponential backoff for retries, but currently caps the delay at 5 seconds between calls.

The --remote_exec_header flag causes Bazel to send extra headers in requests to the remote executor. Multiple headers can be passed by specifying the flag multiple times. Multiple values for the same name will be converted to a comma-separated list. The --remote_header flag can be used instead of setting both --remote_cache_header and --remote_exec_header to the same value.

Remote Build without the Bytes

For both remote caching and remote execution, Bazel supports a feature called “Remote Build without the Bytes” (BwtB). If enabled, Bazel will only download the direct outputs of the targets specified (--remote_download_toplevel), or the minimum needed to complete the build (--remote_download_minimal). This can result in greatly reduced network traffic, which can also result in faster builds.

The feature isn’t without its flaws though. Currently, BwtB requires remote caches to never evict outputs, can result in slower builds due to clumping of downloads, doesn’t allow specifying that other outputs should be downloaded, etc. Though, similar to dynamic scheduling, if used properly BwtB can result in faster builds. Just don’t apply it blindly.

Bonus topic: Build Event Service

Bazel can stream build results, specifically the build event protocol (BEP), to a build event service (BES). Depending on the capabilities of the service, this can have numerous benefits.

Here is a list of some benefits that I know various BES products[9] offer:

Flags

Setting the --bes_backend flag causes Bazel to connect via gRPC(S) to a BES backend and stream build results to it. Setting the --bes_results_url flag causes Bazel to output to the terminal a URL to the BES UI for the build underway.

When using BES, Bazel will upload all files referenced in the BEP, unless --experimental_build_event_upload_strategy=local[10] is set. Alternatively, if you set --incompatible_remote_build_event_upload_respect_no_cache[11], and have actions that are tagged with no-cache/no-remote-cache-upload/no-remote-cache/no-remote, then the output of those actions will still be excluded from upload.

The --bes_timeout flag controls how long Bazel will wait to finish uploading to BES after the build and tests have finished. By default there is no timeout, which might not be what you want. If you leave the default, you should consider changing the --bes_upload_mode flag, which controls if Bazel should block the build for BES uploads (the default), or if it should finish the uploads in the background.

The --bes_header[12] flag causes Bazel to send extra headers in requests to the BES backend. It behaves the same way as --remote_header.

Optimizing

In this post I covered the fundamentals of using remote caching, remote execution, and build event services with Bazel. To ensure that you actually achieve faster builds, or the fastest builds you can, you’ll have to do more than just the basics that I was able to go over here.

In future posts I’ll go over how to optimize your remote cache and remote execution usage. Subscribe to the RSS feed to be notified when.


  1. I talk a bit about remote execution on this episode of the Lyft Mobile Podcast. ↩︎

  2. Julio wrote a great summary on spawn strategies that I highly recommend reading. ↩︎

  3. Except possibly if --noremote_accept_cached is set. See the flags section. ↩︎

  4. Except if the action is tagged with no-remote-cache-upload, or possibly if --noremote_upload_local_results is set. See the flags section. ↩︎

  5. Available in Bazel 5.0. ↩︎

  6. Available in Bazel 5.0. ↩︎

  7. Except if --noremote_accept_cached is set. ↩︎

  8. If the --experimental_remote_execution_keepalive flag is set, the Execution.Execute and Execution.WaitExecute calls take into account the values of --remote_timeout and --remote_retries, but in a more complicated way. Even with that flag, the goal is for execution time to be unbounded, as it can vary greatly depending on the action being executed. ↩︎

  9. I’m a fan of BuildBuddy, which provides all of the BES benefits I listed. They also provide great remote cache and remote execution services.

    Disclosure: As of December 13th, 2021, I’m now an employee of BuildBuddy! ↩︎

  10. A warning though: setting --experimental_build_event_upload_strategy=local will prevent the uploading of some nice things, such as the timing profile, or test logs. ↩︎

  11. Available in Bazel 5.0. ↩︎

  12. Available in Bazel 5.0. ↩︎