Hero image credit: Markus Teich
From our partners:
At Google we run large production fleets that serve Google products like YouTube and Gmail. To support all our employees, including engineers, we also run a sizable corporate fleet with hundreds of thousands of devices across multiple platforms, models, and locations. To let each Googler work in the environment they are most productive in, we operate many OS-platforms including a Linux system. For a long time, our internal facing Linux distribution, Goobuntu, was based off of Ubuntu LTS releases. In 2018 we completed a move to a rolling release model based on Debian.
More than 15 years ago, Ubuntu was chosen as the base for the internal Linux distribution, as it was user-friendly, easy to use, and had lots of fancy extras. The Long Term Support (LTS) releases were picked as it was valued that Canonical provided 2+ years of security updates.
However, this two year release cycle for LTS releases also meant that we had to upgrade every machine in our fleet of over 100.000 devices before the end-of-life date of the OS. The complex nature of workloads run on corporate machines meant that reinstalling and fully customizing machines could be a difficult and time consuming operation. The productivity hit of having all engineers configure their workspace from scratch every two years was not a financially responsible option.
For each OS cycle, we had a rather large version jump in major packages that could require significant changes to software configuration. To automate this process, we wrote an unattended in-place upgrade tool that took care of a lot of the common case problems. This automation focused approach meant that most of the Google employees didn’t have to manually upgrade their machines by re-installing them and recreating all their configuration. To make this possible, however, we needed to do comprehensive testing of the upgrade process and check that all major packages that had changed kept working (in Ubuntu this could be up to several thousands packages to upgrade between major versions). Sometimes it was hard to provide automation in the cases where deprecations happened and engineers had to make decisions on how to move forward.
This effort to upgrade our Goobuntu fleet usually took the better part of a year. With a two year support window there was only one year left until we had to go through the same process all over again for the next LTS. This entire process was a huge stress factor for our team, as we got hundreds of bugs with requests for help for corner cases. Once one upgrade was done there was a general sense of being “close to burnout” in the team that we barely could recover from until the next round of updates came about. Running off an LTS version also meant that some bugs encountered by users of our distribution might’ve already been fixed upstream, but those improvements might’ve never been backported to the LTS version.
There was also a long tail of special-case upgrades that could sometimes drag on for several years. Handling this process was a huge change management challenge to get engineers to upgrade the machines that didn’t work in the automatic process. We got creative when motivating our users to upgrade their machines. Measures ranged from nagging messages on their UI, mails, scheduled reboots and even shutting down the machines, to raise awareness that there were still some machines in dire need of an upgrade. Sometimes this caught machines that people had totally forgotten about, like the one machine under a desk that was running a critical pipeline for something important, as it turned out.
When we designed gLinux Rodete (Rolling Debian Testing), we aimed at removing the two year upgrade cycle and instead spread out the load on the team throughout time. The general move to CI/CD in the industry has shown that smaller incremental changes are easier to control and rollback. Rolling releases with Linux distributions today are getting more common (Arch Linux, NixOS).
We considered going with other Linux distributions, but ended up choosing Debian because we again wanted to offer a smooth in-place migration. This included considerations towards the availability of packages in Debian, the large Debian community, and also the existing internal packages and tooling that were using the Debian format. While the Debian Stable track follows a roughly two-year jump between releases, the Debian testing track works as a rolling release, as it’s the pool of all packages ingested and built from upstream, waiting for the next stable release to happen.
The time from upstream release to availability in testing is often just a few days (although during freeze periods before a Debian stable release, it can sometimes lag a few months behind). This means we can get much more granular changes in general and provide the newest software to our engineers at Google without having to wait longer periods.
This frequency of updates required us to redesign a lot of systems and processes. While originally intending more frequent releases, we found that for us, weekly releases were a sweet spot between moving quickly and allowing for proper release qualification, limiting the disruption to developer productivity.
Whenever we start a new release, we take a snapshot of all the packages ingested from Debian at that time. After some acceptance tests, the new hermetic release candidate is then cautiously rolled out to a dedicated testing fleet and a 1% fleet wide canary. The canary is held intentionally over the course of a couple days to detect any problems with Debian packages or Google internal packages before it progresses to the entire fleet.
To manage all these complex tasks from building all upstream packages from source, we have built a workflow system called Sieve. Whenever we see any new version of a Debian package, we start a new build. We build packages in package groups, to take into account separate packages that need to be upgraded together. Once the whole group has been built, we run a virtualized test suite to make sure none of our core components and developer workflows are broken. Each group is tested separately with a full system installation, boot and local test suite run on that version of the operating system. While builds for individual packages usually complete within minutes, these tests can take up to an hour given the complexity of the package group.
Once the packages are built and all the tests passed, we merge all the new packages with our latest pool of packages. When we cut a new release, we snapshot that pool with each package version locked in for that release. We then proceed to carefully guide this release to the fleet utilizing SRE principles like incremental canarying and monitoring the fleet health.
But not all builds succeed on the first attempt. If a package fails to build, we usually check for any known bugs with the Debian bug tracker and potentially report it, should it not be known already. Sometimes our release engineers have to become creative and apply local workarounds/patches to get a package to build within our ecosystem and later on drop those workarounds once upstream has released a fix.
One issue that we’ve run into a few times, for example, is that in upstream Debian, packages are usually built in Debian unstable. After a few days, these already built packages migrate to Debian testing. In some cases it’s possible, however, that a build-dependency is stuck in unstable and thus building within testing might not (yet) be feasible. We generally try to work upstream first in these cases so we reduce the complexity and maintenance burden to keep these local patches, while also giving back to the community.
If any of the steps fail, Sieve has a toolbox of tricks to retry builds. For example, when it starts the initial build of a group of packages, the system makes an educated guess of which dependencies need to be built together. But sometimes the version information provided in Debian source packages can be incomplete and this guess is wrong. For this reason, Sieve periodically retries building groups that failed. As the latest snapshot of our packages is a moving target, it could happen that after a seemingly independent package group gets added to the snapshot, a previously broken group unexpectedly builds and passes tests correctly. All these workflows are mostly automatic and this highlights the importance of thinking as an SRE in this field. When facing a failure, it usually seems easier to just fix a failing build once, but if we need to apply the same workaround over and over, putting the workaround in code will reduce the overall burden put on our engineers.
There are also some security benefits to building all of our binaries from source and having additional source code provenance that verifies the origin of the running binary. During a security incident for example, we are able to rebuild quickly and have confidence in the build working with a temporary patch, as we have been building all packages before, that land in our distribution. Additionally, we also reduce the trust envelope that we have to place into upstream Debian and the binary build artifacts produced by their infrastructure. Instead once the source code is ingested and the binary built verifiably, we can cryptographically attest that the running binary originated from exactly that source code.
Upgrading to Rodete
The last Goobuntu release was based on Ubuntu 14.04 LTS (Codename Trusty). Development on Rodete started in 2015 and it was quickly clear that we couldn’t just drop support for Trusty and require the entire engineering population to install a fresh new distribution. From the previous experience of updating in-place between LTS versions, we already had some good experience of knowing what awaited us with this migration. Because Ubuntu is a derivative from Debian and uses a lot of the same packaging infrastructure/formats (apt), it wasn’t a totally crazy idea to upgrade the fleet from Goobuntu 14.04 to Debian in-place. We reused some parts of our previous in-place upgrade tool, and worked to make it more reliable, by adding more automation and a lot more testing.
To make it easier to create such a tool, test it and maintain it for the duration of the migration, we chose to temporarily freeze gLinux Rodete as a snapshot of Debian testing on a specific date which we call baseline. We can advance this baseline at our own choosing, to balance what packages Sieve ingests. To reduce friction, we intentionally set the baseline of Rodete at the current Debian stable release in 2016 which was much closer to the general state of Ubuntu Trusty. That way we could separate in-place upgrading from Trusty to Debian and major package version changes that happened in Debian at a later date.
In 2017, we started to migrate the machines to Rodete and completed the last in place migrations by the end of 2018. We however still had a baseline of packages which at that point dated almost two years in the past. To catch up with Debian Testing, we started a team wide effort to focus on optimizing Sieve behavior and speed up the time needed to build / test packages. Replaying the upgrades in this incremental fashion and having a moving rolling release target that we control eased the workload for Google engineers and our team.
In early 2019 we started to shut down the last remnants of Goobuntu machines. Our baseline has also advanced to only lag behind by ~250 days which at the time meant we were using most of the package versions that were part of buster. By mid-2020 we finally fully caught up at the same time when Debian bullseye was released. We continue to move ahead our baseline and will probably already be using a similar version of the next Debian Stable release, before its release in mid 2023.
Today, the life of a gLinux team member looks very different. We have reduced the amount of engineering time and energy required for releases to one on-duty release engineer that rotates among team members. We no longer have a big push to upgrade our entire fleet. No more need for multi stage alpha, betas and GAs for new LTS releases while simultaneously chasing down older machines that still were running Ubuntu Precise or Lucid.
We also dramatically improved our security stance by operating our fleet closer to upstream releases. While Debian provides a good source of security patches for the stable and oldstable tracks, we realized that not every security hole that gets patches, necessarily has a Debian Security Advisory (DSA) or CVE number. Our rolling release schedule makes sure we patch security holes on the entire fleet quickly without compromising on stability, while previously security engineers had to carefully review each DSA and make sure the fix has made it to our fleet.
Our improved testing suite and integration tests with key partner teams that run critical developer systems also yielded a more stable experience using a Linux distribution that provides the latest versions of the Linux Kernel. Our strong longing for automating everything in the pipeline has significantly reduced toil and stress within the team. It is now also possible for us to report bugs and incompatibilities with other library versions while making sure that Google tools work better within the Linux ecosystem.
If you are interested in making rolling releases in your company a success, then consider to balance the needs of the company against upgrade agility. Being in control of our own moving target and baseline has helped to slow down whenever we encountered too many problems and broke any of our team SLOs. Our journey has ultimately reinforced our belief that incremental changes are better manageable than big bang releases.
If you are able to control the influx of new work and keep that predictable, we have made the experience that our engineers stay happier and are less stressed out. This ultimately lowered the team churn and made sure that we can build expertise instead of dealing with multiple burning fires at the same time.
In the future, we are planning to work even more closely with upstream Debian and contribute more of our internal patches to maintain the Debian package ecosystem.
By: Kordian Bruck and Margarita Manterola and Sven Mueller
nSource: Google Cloud Blog
For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!
Our humans need coffee too! Your support is highly appreciated, thank you!