Closed Bug 1491373 Opened 6 years ago Closed 6 years ago

Allow hgweb mirrors to replicate a subset of repositories

Categories

(Developer Services :: Mercurial: hg.mozilla.org, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sheehan, Assigned: sheehan)

References

(Blocks 1 open bug)

Details

Attachments

(4 files, 1 obsolete file)

The current implementation of vcsreplicator involves publishing repo events to 8 Kafka partitions, each mapped to a subset of the available repositories on hg.mozilla.org. There are 3 partitions for the "top level" repos (central, beta, project repos, etc), 4 partitions for user repos and a dedicated partition for try. For any given partition, if processing even a single message fails, the consumer process will raise an exception and cause a failure. This works well for our current setup with all web heads mirroring all repositories as we fail fast and hard when one of our assumptions is no longer valid.

In the near future we would like to stand up dedicated hgweb mirrors for the private use of Firefox CI. These mirrors would only require a small subset of repositories to serve data to Taskcluster workers. Since several of the assumptions of the current in-datacentre setup are not met (namely we assume the machines are in the same physical proximity and thus data transfer latency is negligible), mirroring only a subset of repos is highly desirable for us. New instances will not need to do the unnecessary work of cloning repos which will never be served, and if we choose to enable auto-scaling at some point we will be able to bootstrap new instances *much* faster. I have no concrete numbers, but I'd guess a minimum of 75% of the ~5.5 hours it takes to bootstrap a new web head is spent cloning user repos and project repos. This is due to the fact that many of the user repos are essentially full clones of mozilla-central that are used as the equivalent of a GitHub fork. 

To avoid mirroring the user repos, we can simply not run consumers for those partitions on the hgweb instances. For the 3 partitions that hold the top level repos, we have a mix of messages we need to process for repositores we wish to mirror and messages which are irrelevant to these specialized instances.

We need to add some exclude/include filter functionality to vcsreplicator that allows us to declare which repos are relevant and which repos can be ignored. We will define these filters in the "vcsreplicator.ini" config file found on all hgweb mirrors. During the bootstrap procedure, we can simply run the regular bootstrap process on hgssh, send the JSON over to the new hgweb instance and then filter out unnecessary repos from the list of repositories to replicate using the new config section. For the regular consumer daemons, we will need to add logic to "handle_message_main" to ignore repository messages which do not pass the inclusion criteria (or that do pass the exclusion criteria).
Assignee: nobody → sheehan
Status: NEW → ASSIGNED
This commit adds logic to the `config.Config` constructor for
parsing a config section with rules for filtering repositories
out of the set of repositories to replicate. The section must
be named `replicationrules` and each entry takes the form
`behaviour.name = syntax:rule`. "Behaviour" is one of
{include, exclude}, indicating if the rule will force inclusion
or exclusion of any matching repos. "Name" is an alias given
to the rule, which will be used for debugging and logging
during filtering (something like "repo X filtered by rule Y").
"Syntax" is the type of rule the "rule" string will be applied
to. The current available "syntax" values are "re" for regular
expression and "ex" for explicit (ie explicitly include repo X).
An example section would look like:

    [replicationrules]
    exclude.user = re:\{moz\}/user.*
    include.releases = re:\{moz\}/releases/.*
    include.myuserrepo = ex:{moz}/users/cosheehan_mozilla.com/vct
    exclude.beta = re:{moz}/releases/mozilla-beta

Follow up commits will specify the ordering of rule checking and
how the replication processes will use these filters.
This commit adds a function which applies filter rules to a given
repo name (assumed to be in wire path format) and returns a
`RepoFilterResult` with a yes/no boolean indicating if the
repo should be filtered or not, and a string name value
which is the alias of the rule which did/did not filter
the repo. The rules are applied in the following order

    1. Explicit includes
    2. Explicit excludes
    3. Regex includes
    4. Regex excludes

Using this order will allow us to set general pattern
matching rules to include/exclude repos, but also bypass
those rules for specific repositories as we deem necessary.
This commit adds a `filter_repos` method which essentially
aggregates the `filter` function over an iterable collection
of repos. The returned value is a dict with separate entries
for the filtered and unfiltered repos, each mapped to the rule
which allowed/disallowed the repo to make it through the filter.
This commit makes the bootstrap procedure respect the
repo filtering rules set in the hgweb replicator config
file. The hgweb procedure receives the input object
from the hgssh procedure and applies the aggregate filter
function to the set of repos specified in the hgweb config
file. Then the process continues as normal, except when a
message is being processed for a repo which has been removed
from the set of repos to replicate, it is ignored and the
process continues. Immediately after removing filtered repos
from the set of repos to be replicated, we also log which
repos were filtered out and specify the offending rule for
debugging purposes.

There are a few advantages to this implementation. The hgssh
step is unaware of the filtering rules applied to the individual
hgweb instances. This means we could set up separate instances
to be dedicated to different repositories in the future and
bootstrap all of them with the result of a single hgssh bootstrap
run. We also keep the hgssh and hgweb roles separate in terms of
shared data since we won't need to pass the hgweb replication config
to the hgssh instances for filtering.

A disadvantage to this approach is that we will spend some time
collecting and acknowledging Kafka messages for repos we wish to
replicate. If we did pass hgweb filter rules to the hgssh instance,
we could avoid aggregating the messages altogether and only produce
bootstrap messages for repos we wish to replicate. This would have
a disadvantage in being need to run multiple times if we wish
to bootstrap multiple hgweb heads with different replication rules
at the same time.
This commit creates a repository filtering decorator which wraps
the message handler function to check if a message corresponds
to a repo which is being filtered out on this instance. A
decorator is used to separate the filtering logic out from the
message handler logic, while allowing the filter logic to be defined
in a single place.
Attachment #9012197 - Attachment is obsolete: true
Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/77152e3998ad
vcsreplicator: add logic to parse `replicationrules` config section r=gps
https://hg.mozilla.org/hgcustom/version-control-tools/rev/45897525df27
vcsreplicator: add a `filter` function to the `Config` class r=gps
https://hg.mozilla.org/hgcustom/version-control-tools/rev/779937c9fb31
vcsreplicator: apply repo filtering to hgweb bootstrap procedure r=gps
https://hg.mozilla.org/hgcustom/version-control-tools/rev/d54b06ec9d16
vcsreplicator: create a `repofilter` decorator and apply to message handlers r=gps
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: