Closed Bug 1491373 Opened 6 years ago Closed 6 years ago

Allow hgweb mirrors to replicate a subset of repositories

Categories

(Developer Services :: Mercurial: hg.mozilla.org, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sheehan, Assigned: sheehan)

References

(Blocks 1 open bug)

Details

Attachments

(4 files, 1 obsolete file)

The current implementation of vcsreplicator involves publishing repo events to 8 Kafka partitions, each mapped to a subset of the available repositories on hg.mozilla.org. There are 3 partitions for the "top level" repos (central, beta, project repos, etc), 4 partitions for user repos and a dedicated partition for try. For any given partition, if processing even a single message fails, the consumer process will raise an exception and cause a failure. This works well for our current setup with all web heads mirroring all repositories as we fail fast and hard when one of our assumptions is no longer valid. In the near future we would like to stand up dedicated hgweb mirrors for the private use of Firefox CI. These mirrors would only require a small subset of repositories to serve data to Taskcluster workers. Since several of the assumptions of the current in-datacentre setup are not met (namely we assume the machines are in the same physical proximity and thus data transfer latency is negligible), mirroring only a subset of repos is highly desirable for us. New instances will not need to do the unnecessary work of cloning repos which will never be served, and if we choose to enable auto-scaling at some point we will be able to bootstrap new instances *much* faster. I have no concrete numbers, but I'd guess a minimum of 75% of the ~5.5 hours it takes to bootstrap a new web head is spent cloning user repos and project repos. This is due to the fact that many of the user repos are essentially full clones of mozilla-central that are used as the equivalent of a GitHub fork. To avoid mirroring the user repos, we can simply not run consumers for those partitions on the hgweb instances. For the 3 partitions that hold the top level repos, we have a mix of messages we need to process for repositores we wish to mirror and messages which are irrelevant to these specialized instances. We need to add some exclude/include filter functionality to vcsreplicator that allows us to declare which repos are relevant and which repos can be ignored. We will define these filters in the "vcsreplicator.ini" config file found on all hgweb mirrors. During the bootstrap procedure, we can simply run the regular bootstrap process on hgssh, send the JSON over to the new hgweb instance and then filter out unnecessary repos from the list of repositories to replicate using the new config section. For the regular consumer daemons, we will need to add logic to "handle_message_main" to ignore repository messages which do not pass the inclusion criteria (or that do pass the exclusion criteria).
Assignee: nobody → sheehan
Status: NEW → ASSIGNED
This commit adds logic to the `config.Config` constructor for parsing a config section with rules for filtering repositories out of the set of repositories to replicate. The section must be named `replicationrules` and each entry takes the form `behaviour.name = syntax:rule`. "Behaviour" is one of {include, exclude}, indicating if the rule will force inclusion or exclusion of any matching repos. "Name" is an alias given to the rule, which will be used for debugging and logging during filtering (something like "repo X filtered by rule Y"). "Syntax" is the type of rule the "rule" string will be applied to. The current available "syntax" values are "re" for regular expression and "ex" for explicit (ie explicitly include repo X). An example section would look like: [replicationrules] exclude.user = re:\{moz\}/user.* include.releases = re:\{moz\}/releases/.* include.myuserrepo = ex:{moz}/users/cosheehan_mozilla.com/vct exclude.beta = re:{moz}/releases/mozilla-beta Follow up commits will specify the ordering of rule checking and how the replication processes will use these filters.
This commit adds a function which applies filter rules to a given repo name (assumed to be in wire path format) and returns a `RepoFilterResult` with a yes/no boolean indicating if the repo should be filtered or not, and a string name value which is the alias of the rule which did/did not filter the repo. The rules are applied in the following order 1. Explicit includes 2. Explicit excludes 3. Regex includes 4. Regex excludes Using this order will allow us to set general pattern matching rules to include/exclude repos, but also bypass those rules for specific repositories as we deem necessary.
This commit adds a `filter_repos` method which essentially aggregates the `filter` function over an iterable collection of repos. The returned value is a dict with separate entries for the filtered and unfiltered repos, each mapped to the rule which allowed/disallowed the repo to make it through the filter.
This commit makes the bootstrap procedure respect the repo filtering rules set in the hgweb replicator config file. The hgweb procedure receives the input object from the hgssh procedure and applies the aggregate filter function to the set of repos specified in the hgweb config file. Then the process continues as normal, except when a message is being processed for a repo which has been removed from the set of repos to replicate, it is ignored and the process continues. Immediately after removing filtered repos from the set of repos to be replicated, we also log which repos were filtered out and specify the offending rule for debugging purposes. There are a few advantages to this implementation. The hgssh step is unaware of the filtering rules applied to the individual hgweb instances. This means we could set up separate instances to be dedicated to different repositories in the future and bootstrap all of them with the result of a single hgssh bootstrap run. We also keep the hgssh and hgweb roles separate in terms of shared data since we won't need to pass the hgweb replication config to the hgssh instances for filtering. A disadvantage to this approach is that we will spend some time collecting and acknowledging Kafka messages for repos we wish to replicate. If we did pass hgweb filter rules to the hgssh instance, we could avoid aggregating the messages altogether and only produce bootstrap messages for repos we wish to replicate. This would have a disadvantage in being need to run multiple times if we wish to bootstrap multiple hgweb heads with different replication rules at the same time.
This commit creates a repository filtering decorator which wraps the message handler function to check if a message corresponds to a repo which is being filtered out on this instance. A decorator is used to separate the filtering logic out from the message handler logic, while allowing the filter logic to be defined in a single place.
Attachment #9012197 - Attachment is obsolete: true
Pushed by cosheehan@mozilla.com: https://hg.mozilla.org/hgcustom/version-control-tools/rev/77152e3998ad vcsreplicator: add logic to parse `replicationrules` config section r=gps https://hg.mozilla.org/hgcustom/version-control-tools/rev/45897525df27 vcsreplicator: add a `filter` function to the `Config` class r=gps https://hg.mozilla.org/hgcustom/version-control-tools/rev/779937c9fb31 vcsreplicator: apply repo filtering to hgweb bootstrap procedure r=gps https://hg.mozilla.org/hgcustom/version-control-tools/rev/d54b06ec9d16 vcsreplicator: create a `repofilter` decorator and apply to message handlers r=gps
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: