Multi-Repo, Multi-Node Gating at Massive Scale
Oct 28, 2017
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
To get an understanding of the subject, first lets introduce Monty.
Monty is a RedHatter who worked on several heavy OpenSource projects, and most notably OpenStack and Ansible. He leads a team that works on the [architecture and infrastructure that runs Continuous Integration and developer tooling for OpenStack.
While this does seem quite reasonable, what makes it worth a full talk is the scale at which he operates it.
In fact, OpenStack is:
- roughly 2’000 Git repositories
- 2’000 jobs per hour coming from 14 regions, 5 public clouds and 2 private clouds
- 10’000 changes merged per month
- around 2’000 committers, hundreds of companies of various sizes
- communities that rely on/are relied from OpenStack (eg. Ansible)
At its current scale, OpenStack had to adopt integration policies a little more restrictive than a lot of smaller projects. Two of these were an “egalitarian process”, which prevents anyone from pushing directly to a repository, and a need for performance to handle all change requests.
The first one is already a common sight in most projects that have more than five committers. Tools like Hudson and Jenkins were built to enable this.
In fact, OpenStack used both of them in the past. But as users asked for more and more custom based features and they obviously could not all have permission to push these on Hudson and Jenkins, a need for a tool that could be user-defined on each run appeared.
The second one might not be as straight forward as it seems. When changes are pushed to a repository, they go through all checks configured for it in the CI framework. If a new change is pushed on this same repository or has the previous repository as a dependency, it has to wait for the first change to succeed all checks before starting its own. A decent CI framework can start the second check job before the first one finishes, and wait for its success before submiting both changes or rollback if it fails.
But that is not enough for projects like OpenStack. With their huge number of repositories, most of which directly depend from each other, the 2’000+ unique committers, and various other projects that have dependencies with them, there is a need for at least some distribution of the workload.
This is why Zuul was built.
Zuul is nothing but a part of OpenStack’s workflow, so let’s begin with describing this one first.
This is Zuul’s developer workflow:
In his favorite development environment (here in blue), the user pushes his changes to a repository. The changes are intercepted by the review system (in orange), either Gerrit (native) or Github (new alternative), and sent to Zuul for testing (here in green). Zuul results are then pushed back into the review system, and made available to the user who pushed the changes in the first place.
For detailed information about Gerrit Code Review, see their official website.
This architecture enables the users to have access to the feedback of theirs and others changes (patch upload explainations, test results, discutions around code, etc.) without ever being impacted by Zuul’s complexity.
As explained in their documentation:
Zuul is a program that drives continu/availability zone/availability zoneous integration, delivery, and deployment systems with a focus on project gating and interrelated projects.
We already expained why they focused on interrelated projects. Project gating (which is actually the main word in the presentation’s title) in itself is just “to prevent changes that introduce regressions from being merged” (quote source).
In his presentation, Monty explains three main roles of gating:
- Gating: Every change proposed for a repository is tested before it merges.
- Co-gating: Changes to a set of repositories merge monotonically such that each change is tested with the current state of all other related repositories before it merges.
- Parallel Co-gating: Changes are serialized such that each change is tested with all of the changes ahead of it to satisfy the gating requirement while being able to run tests for multiple changes simultaneously.
Zuul was specifically built to handle all three of them.
Monty illustrated a very simplified version of Zuul’s internal actions through an example featuring four change requests for two repositories. This can obviously not be compared to the complexity of OpenStack’s 2’000 repositories and 10’000 change requests per month, but was enough to understand how Zuul implemented the three roles of gating previously discussed.
The example in this article is heavilly inspired by Monty’s one, but was changed to adapt the format of a blog instead of a talk’s slides.
For starters, there are just 3 change requests, and their respective repositories have been omitted as they have no influence on this explanation of Zuul’s workflow. The 3 check jobs generated by Zuul to test each change are as following:
- A: the first change pushed to Zuul
- A+B: the change A pre-merged with the change B, which is the second change to be pushed to Zuul
- A+B+C: the changes A and B pre-merged with the change C, which is the third change to have reached Zuul
By doing so, the first and second roles of gating are handled with the fact that each change is tested against their related repositories checks. The reason for running the A+B and A+B+C jobs instead of simply jobs for B and C is because Zuul would not be able to handle the third role of gating: parallel jobs. It would have to do each job after one another. But by pre-merging changes of A with B and both of them with C, Zuul can safely run these three jobs in parallel.
Below the three jobs are represented by blue filed squares, and each of them have 15 test checks they have to pass to succeed Zuul’s validation process and to be merged into the repository they belong to:
All three jobs run simultaneously, at various speeds depending on their checks and the infrastructure they are running on. In our example, the first job submitted is ahead of the two others, but this behaviour is not guaranteed:
If a check test of the A+B job fails. All other test tasks still running are cancelled, the job is marked as failed, and the result of the failure is transmitted to the user that requested the B change through the review system. Additionnaly, the A+B+C job is also cancelled, its check tests not being valid because of the failed B change merged with it. All its running tasks are stopped and cancelled aswell.
The third job is regenerated without the B change merged with it. By the time the job is rescheduled and restarted, the A job completed all checks with success, thus leading to Zuul validating the change and merging it into its repository. The new A+C job has to rerun all previously succeeded checks:
One of the limitations of Hudson and Jenkins previously stated were their lack to adapt to OpenStack’s 2’000+ committers unique use-cases and scenarios. Zuul on the other hand has a YAML based configuration system that enables it to be deployed and run in multiple environments and multiple conditions specified by the user, in one single batch.
To deploy these various user-defined environments, Zuul uses another tool of OpenStack Infra team’s framework: Nodepool. Nodepool’s documentation states this:
Nodepool is a service used by the OpenStack CI team to deploy and manage a pool of devstack images on a cloud server for use in OpenStack project testing.
Once per day, for every image type (and provider) configured by nodepool, a new image with cached data is built for use by devstack. Nodepool spins up new instances and tears down old as tests are queued up and completed, always maintaining a consistent number of available instances for tests up to the set limits of the CI infrastructure.
This allows for example users to run the same checks for their python repositories against several versions of the language within multiple pipelines. As nodepool hosts images, Zuul pipelines can be easelly built for multiples systems (RHEL 6/7, Debian X, …), configurations (Java 6/7/8, Python 2.6/2.7/3.2, …), and even user-defined environments (pre-installed software, with/without some dependencies, etc.).
In the application example previously discussed, each task of each job (represented by first white, then green, red, or grey squares) are executed on independent image instances provided by Nodepool.
This is how a Zuul yaml configuration file looks like for a unitests on python 2.7 and 3.5 environments, with dependencies to other repositories:
description: Run tox python 27 unittests against master of important libs
description: Run tox python 35 unittests against master of important libs
In a modulable fashion, these unitests can be then added to a project template:
This template in turn can be used to create a project:
This is just a subset of what is configurable in Zuul, as their is a lot more layers that can be added on top of it, like network informations, sofware to install at runtime, environment variables, reporting mechanisms, etc.
In short, this is how Zuul works:
- Jobs run on nodes from nodepool (static or dynamic)
- Metadata is defined in Zuul’s configuration
- Content is executed on nodes with Ansible (and is live streamed here)
- Puppet may be used, but is still in development phase
- Jobs may be defined centrally or in the repository being tested
- Jobs have contextual variants that simplify configuration
- Zuul job repositories can be directly shared between separated zuul installations
All in all, Monty’s talk was a really interesting one.
Continuous integration is a subject that almost every organisation has to consider at one point. Of course, the complexity of charge and diversity that OpenStack’s integration processes face is not something the everyday organisation has to deal with. In most cases, they will already have tools such as Jenkins and Hudson and they will work just fine.
Nonetheless, even if Zuul is probably too much for most usages you and me would encounter, the concerns it was built for and how it handles them is something we can use, alas not at OpenStack’s scale. The three roles of gating presented are one of them and every integration process should consider implementing them. Another aspect are the user-defined and driven environment for the tests. Nodepool’s way of caching most environments scenarios as images is a really interesting one, and using a custom built IaaS with other OpenSource tools like Docker/Kubernetes should be available to most organisations aswell.