Closed Bug 1470606 Opened 6 years ago Closed 6 years ago

Expose repository data more atomically

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: gps)

References

Details

Attachments

(7 files)

There is a known deficiency with the hg.mozilla.org replication mechanism where there is a window where each hgweb machine exposes different repository state due to race conditions during replication.

This is described at https://hg.mozilla.org/hgcustom/version-control-tools/file/536fa31ce353/docs/hgmo/replication.rst#l206.

The mirror inconsistency is known to cause issues like bug 1462323.

Ideally, we want each hgweb machine to expose new data atomically with every other machine: even if 1 hgweb machine is slow to replicate/receive new data, the fast machines shouldn't expose the new data until the slowest replicator has the data.

Since we're using separate filesystems to store repository data, we need another mechanism to keep the fully replicated data in sync.

In terms of ideal atomicity guarantees, the solution this problem is to use a network service to hold the fully replicated state. This service can be centralized (e.g. SQL) or decentralized (e.g. ZooKeeper). Or some kind of hybrid (like Redis).

However, we've lived without a "proper" solution for ~3 years and it hasn't caused many problems. Only once we swapped in slow machines and the replication inconsistency window was increased from tens to hundreds of milliseconds to seconds did it become a problem. I'm sure failures due to inconsistency did occur. But the failure rate was perceived to be so low that it wasn't a problem.

Given that a homogenous server environment with a non-atomic window of milliseconds seemed to cause few problems, there is a much simpler solution on the table that doesn't require a new network service and point of failure. And after connor and I consulted with bbangert, we decided simplicity triumphed and to go with that solution.

The proposed solution is to build on top of the existing Kafka-based replication system. We already have a process on the hgssh server that republishes Kafka/replication messages once all consumers have acknowledged / applied that message. This Kafka topic is what the Pulse and SNS notifiers are built on top of.

Our plan is to introduce a new Kafka consumer on each hgweb machine to consume the "fully replicated" Kafka topic. When it receives a message, it will atomically write a local file containing the state of the repo replicated up to that message (the file likely contains the set of heads in the changelog). We will introduce a repoview/filter that takes this file into account and serves a repository accordingly.

Essentially:

1) Push occurs and replication messages written to Kafka
2) Each hgweb machine applies replication messages and acknowledges Kafka offset
3) Once each hgweb consumer has applied a message, that message is republished in a new topic
4) Each hgweb machine pulls republished message in near real time and updates state on local filesystem
5) hgweb process starts exposing new repository state

There are still race conditions on the hgweb servers due to them not processing Kafka messages at the same rate. But from our experience with our low-volume Kafka cluster, messages are delivered across machines within a very tight window. As long as that holds and the work being done by the consumer is simple (e.g. writing a file) and that process is prioritized accordingly so it can't be preempted by other work on the system, the inconsistency window should be very short (single digit milliseconds) and the chances of anyone hitting a race condition should be very low.

We believe the simplicity of this design (building on top of existing infrastructure) - while not resulting in the ideal atomic guarantees - is a better trade-off than the complexity involved with standing up a new service to hold the atomic state. We can likely implement this solution in a day or 2 of work since it builds on top of existing services.

FWIW, once hgweb is moved to AWS or GCP, a hosted service like DynamoDB or Athena may be acceptable. We just don't want to stand up a new service in a Mozilla datacenter. Especially since we're in the middle of migrating datacenters and we plan to eventually move to a non-physical datacenter. It's not worth the hassle standing up a new network service in Mozilla's datacenter nor the latency or reliability concerns of making a network request outside the datacenter to provide the atomic replication feature.

glob: I would appreciate an evaluation of my thoughts here. I can hop on a video call to discuss in more detail. Service architecture decision making is hard to do in text!
Flags: needinfo?(glob)
as per our video chat, this looks good to me.
Flags: needinfo?(glob)
Depends on: 1471732
The docs change capture what will be implemented in subsequent commits.
This will allow us to more easily introduce another consumer that
reuses most of the existing code but handles individual messages
differently (e.g. processes a different set of messages).
INCOMPLETE: TESTS CURRENTLY FAIL. DO NOT APPROVE.

Our current replication mechanism is essentially:

1) (hgssh) Write message to pushdata topic
2) (hgweb) Apply message in pushdata, expose data to hgweb
3) (hgssh) Copy message to replicatedpushdata topic
4) (hgssh) React to fully replicated messages in replicatedpushdata

Our new mechanism to reduce the inconsistency window on hgweb machines
will involve a 2-phase replication:

1) (hgssh) Write message to pushdata topic
2) (hgweb) Apply message from pushdata
3) (hgssh) Copy message to replicatedpushdata-pending topic
4a) (hgweb) Write updated heads from message in replicatedpushdata-pending
4b) (hgweb) Start serving new heads via hgweb
5) (hgssh) Copy message to replicatedpushdata
6) (hgssh) React to fully replicated messages in replicatedpushdata

Essentially, we're inserting a new topic between "pushdata" and
"replicatedpushdata:" "replicatedpushdatapending."

This commit implements that new middle topic.

In message processing order, the following new functionality was
implemented:

* The "replicatedpushdatapending" topic is created in Ansible and in
  our test environment.
* A new "pushdataaggregator-pending" service has been added to hgssh. It
  works just like pushdataaggregator except it copies messages to
  "replicatedpushdatapending" instead of "replicatedpushdata." A nagios
  check for the new service is included.
* A new `vcsreplicator-headsconsumer` process has been added. It calls
  a function that no-ops on all messages (functionality will be
  implemented in a later commit). It shares most code with
  `vcsreplicator-consumer`.
* A new systemd service has been added to run
  `vcsreplicator-headsconsumer` on hgweb. It monitors the
  "replicatedpushdatapending" topic.
* The "pushdataaggregator" service on hgssh has been changed to monitor
  "replicatedpushdatapending" instead of "pushdata."
Comment on attachment 8988639 [details]
docs: document planned replication improvements to reduce inconsistency window (bug 1470606); r?glob, sheehan

Connor Sheehan [:sheehan] has approved the revision.

https://phabricator.services.mozilla.com/D1870
Attachment #8988639 - Flags: review+
Comment on attachment 8988640 [details]
vcsreplicator: allow different message processing functions to be used (bug 1470606); r?glob, sheehan

Connor Sheehan [:sheehan] has approved the revision.

https://phabricator.services.mozilla.com/D1872
Attachment #8988640 - Flags: review+
fubar: I just wanted to give a heads up that we'll be introducing 2 new daemon processes as part of this bug (1 on hgweb and another on hgssh). We'll need to hook up some Nagios monitoring of those. We may also need to tweak existing Nagios monitoring.
Attachment #8988641 - Attachment description: vcsreplicator: replicate messages to replicatedpushdatapending (bug 1470606) → vcsreplicator: replicate messages to replicatedpushdatapending (bug 1470606); r?glob, sheehan
The previous commit established a new aggregated "pending" topic
for all messages that had previously been replicated along with
a new consumer daemon for processing messages in it.

This commit teaches that new consumer daemon to handle the
"hg-heads-1" message. Whenever we receive a message listing
repository heads, we write a file in the repository containing
that list of heads.

We first write out to a temporary file then do an atomic rename.
This ensures that readers see an atomic snapshot of the file content.

A test for the new functionality has been added. It foreshadows
hgweb using the heads file to determine which changesets should
be exposed.
INCOMPLETE: there are still some test failures

Now that we write out a file containing the set of fully replicated
heads in a repository, it's now time to put it to use.

This commit implements a new Mercurial extension that provides a new
repoview/filter that takes the replicated-heads file into account.

A repoview is simply a function that returns a set of revisions that
should be hidden. By default, hgweb uses the "served" repoview. This
view filters out hidden/obsolete and secret revisions.

We implement a "replicatedserved" repoview. It starts by collecting
the revisions filtered by the "served" repoview. If the
".hg/replicated-heads" file exists, we perform a DAG traversal
starting at all DAG heads and stopping at revisions specified in
this file and revisions hidden by the "served" repoview. These
revisions are unioned with the revisions from the "served" repoview.
Comment on attachment 8988639 [details]
docs: document planned replication improvements to reduce inconsistency window (bug 1470606); r?glob, sheehan

Byron Jones ‹:glob› has approved the revision.

https://phabricator.services.mozilla.com/D1870
Attachment #8988639 - Flags: review+
Comment on attachment 8988950 [details]
vcsreplicator: only expose replicated revisions via hgweb (bug 1470606); r?glob, sheehan

Byron Jones ‹:glob› has approved the revision.

https://phabricator.services.mozilla.com/D1894
Attachment #8988950 - Flags: review+
Our intent is to use the heads message as a coordination event to
control what data is exposed on hgweb machines. As part of testing
this eventual mechanism, I discovered that the pushlog needs to
be integrated with this new mechanism because it assumes hidden
changesets are obsolete.

This commit adds the last push ID from the pushlog to the heads
replication message, if available.

Strictly speaking, this change is backwards incompatible and we
should introduce a new message to convey the new data. However,
nobody is using the heads message yet. So we can get away with this
change without introducing a new message type.
Attachment #8988950 - Attachment description: vcsreplicator: only expose replicated revisions via hgweb (bug 1470606) → vcsreplicator: only expose replicated revisions via hgweb (bug 1470606); r?glob, sheehan
Comment on attachment 8989559 [details]
vcsreplicator: add last push ID to heads message (bug 1470606); r?sheehan, glob

Connor Sheehan [:sheehan] has approved the revision.

https://phabricator.services.mozilla.com/D1931
Attachment #8989559 - Flags: review+
Comment on attachment 8988640 [details]
vcsreplicator: allow different message processing functions to be used (bug 1470606); r?glob, sheehan

Byron Jones ‹:glob› has approved the revision.

https://phabricator.services.mozilla.com/D1872
Attachment #8988640 - Flags: review+
Comment on attachment 8988641 [details]
vcsreplicator: replicate messages to replicatedpushdatapending (bug 1470606); r?glob, sheehan

Byron Jones ‹:glob› has approved the revision.

https://phabricator.services.mozilla.com/D1873
Attachment #8988641 - Flags: review+
Comment on attachment 8988949 [details]
vcsreplicator: implement heads message handling (bug 1470606); r?glob, sheehan

Byron Jones ‹:glob› has approved the revision.

https://phabricator.services.mozilla.com/D1893
Attachment #8988949 - Flags: review+
Comment on attachment 8988641 [details]
vcsreplicator: replicate messages to replicatedpushdatapending (bug 1470606); r?glob, sheehan

Connor Sheehan [:sheehan] has approved the revision.

https://phabricator.services.mozilla.com/D1873
Attachment #8988641 - Flags: review+
The replicated-data file contains the set of heads that are fully
replicated along with the most recently replicated pushlog ID.

This commit teaches the pushlog extension to take this data into
account.

Pushlog queries via hgweb now have pushes belonging to unreplicated
pushes filtered out by default. In other words, the pushlog won't
expose updated data until it has been fully replicated. This should
significantly narrow the window of inconsistency that pushlog pollers
see.

test-served-heads.t changes demonstrate the new, desired behavior.
fubar: the patches in this bug will require some new nagios alerts on hgssh and hgweb. What do you need from me to get those running?

(We're on track to have this deployed as early as Monday and we need monitoring once it is otherwise the replication system could silently break.)
Flags: needinfo?(klibby)
Comment on attachment 8988949 [details]
vcsreplicator: implement heads message handling (bug 1470606); r?glob, sheehan

Connor Sheehan [:sheehan] has approved the revision.

https://phabricator.services.mozilla.com/D1893
Attachment #8988949 - Flags: review+
Comment on attachment 8988950 [details]
vcsreplicator: only expose replicated revisions via hgweb (bug 1470606); r?glob, sheehan

Connor Sheehan [:sheehan] has approved the revision.

https://phabricator.services.mozilla.com/D1894
Attachment #8988950 - Flags: review+
(In reply to Gregory Szorc [:gps] from comment #19)
> fubar: the patches in this bug will require some new nagios alerts on hgssh
> and hgweb. What do you need from me to get those running?
> 
> (We're on track to have this deployed as early as Monday and we need
> monitoring once it is otherwise the replication system could silently break.)

depends on what you want monitored! :-) tell me what you've added and what needs looking after, and I'll have a go at things on Mon/Tues.
Flags: needinfo?(klibby)
Comment on attachment 8990148 [details]
pushlog: only expose replicated pushlog entries (bug 1470606); r?glob, sheehan

Connor Sheehan [:sheehan] has approved the revision.

https://phabricator.services.mozilla.com/D1990
Attachment #8990148 - Flags: review+
Keywords: leave-open
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/cd5d797eaec8
vcsreplicator: add last push ID to heads message ; r=sheehan
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/ea3ad31197f3
docs: document planned replication improvements to reduce inconsistency window ; r=glob,sheehan
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/bf125ca90e39
vcsreplicator: allow different message processing functions to be used ; r=glob,sheehan
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/25c655d42e6e
vcsreplicator: replicate messages to replicatedpushdatapending ; r=glob,sheehan
We'll need the following checks on hgssh master only:

Process existence for:

/var/hg/venv_tools/bin/python2.7 /var/hg/venv_tools/bin/vcsreplicator-aggregator --max-polls 1800 /etc/mercurial/pushdataaggregator-pending.ini
/var/hg/venv_tools/bin/python2.7 /var/hg/venv_tools/bin/vcsreplicator-aggregator --max-polls 1800 /etc/mercurial/pushdataaggregator.ini

I think checking for exactly 2 processes with "vcsreplicator-aggregator" in them would be fine.

The check_pushdataaggregator_pending_lag check should also be executed. We added this to the Nagios config as part of a recent push. It should be configured identically to check_pushdataaggregator_lag (it is a variation of that check).
On all hgweb nodes...

A process existence check for:

/var/hg/venv_replication/bin/python2.7 /var/hg/venv_replication/bin/vcsreplicator-headsconsumer /etc/mercurial/vcsreplicator-pending.ini

Looking for a single `vcsreplicator-headsconsumer` process should be fine. We should have existing checks for `vscreplicator-consumer` processes we can curb off of. But there is only 1 process for `vcsreplicator-headsconsumer`, not 8.

We'll also need the check_vcsreplicator_pending_lag configured similarly to check_vcsreplicator_lag. However, that check isn't yet defined (looks like I forgot to create its Nagios check variation). I'll get that deployed in the next ~15 minutes.
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/33ca0461ccdd
vcsreplicator: implement heads message handling ; r=glob, sheehan
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/7c0a6e43af5f
ansible/hg-web: add check_vcsreplicator_pending_lag check variation
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/e0ddaf0d493d
vcsreplicator: only expose replicated revisions via hgweb ; r=glob, sheehan
https://hg.mozilla.org/hgcustom/version-control-tools/rev/c44fcd524112
pushlog: only expose replicated pushlog entries ; r=glob, sheehan
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/d7c356958e2d
ansible/hg-web: disable vcsreplicator extension
Had to do a partial backout of this because of exceptions being thrown on hgweb machines.

[Mon Jul 09 21:09:35.741213 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694] Traceback (most recent call last):
[Mon Jul 09 21:09:35.741428 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/hgweb/hgwebdir_mod.py", line 229, in run_wsgi
[Mon Jul 09 21:09:35.741489 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]     for r in self._runwsgi(req):
[Mon Jul 09 21:09:35.741499 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/hgweb/hgweb_mod.py", line 318, in run_wsgi
[Mon Jul 09 21:09:35.747761 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]     for r in self._runwsgi(req, repo):
[Mon Jul 09 21:09:35.747795 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]   File "/var/hg/version-control-tools/hgext/serverlog/__init__.py", line 316, in _runwsgi
[Mon Jul 09 21:09:35.747836 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]     for what in super(hgwebwrapped, self)._runwsgi(req, repo):
[Mon Jul 09 21:09:35.747845 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/hgweb/hgweb_mod.py", line 370, in _runwsgi
[Mon Jul 09 21:09:35.747862 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]     return protocol.call(rctx.repo, req, cmd)
[Mon Jul 09 21:09:35.747876 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]   File "/var/hg/version-control-tools/hgext/serverlog/__init__.py", line 233, in protocolcall
[Mon Jul 09 21:09:35.747889 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]     return origcall(repo, req, cmd)
[Mon Jul 09 21:09:35.747896 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/hgweb/protocol.py", line 165, in call
[Mon Jul 09 21:09:35.750272 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]     rsp = wireproto.dispatch(repo, p, cmd)
[Mon Jul 09 21:09:35.750296 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/wireproto.py", line 586, in dispatch
[Mon Jul 09 21:09:35.758011 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]     return func(repo, proto, *args)
[Mon Jul 09 21:09:35.758058 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/wireproto.py", line 908, in getbundle
[Mon Jul 09 21:09:35.758099 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]     **pycompat.strkwargs(opts))
[Mon Jul 09 21:09:35.758108 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/exchange.py", line 1771, in getbundlechunks
[Mon Jul 09 21:09:35.764881 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]     **pycompat.strkwargs(kwargs))
[Mon Jul 09 21:09:35.764937 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/exchange.py", line 1818, in _getbundlechangegrouppart
[Mon Jul 09 21:09:35.764992 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]     if outgoing.missing:
[Mon Jul 09 21:09:35.765008 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/util.py", line 933, in __get__
[Mon Jul 09 21:09:35.773071 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]     result = self.func(obj)
[Mon Jul 09 21:09:35.773109 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/discovery.py", line 127, in missing
[Mon Jul 09 21:09:35.777627 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]     self._computecommonmissing()
[Mon Jul 09 21:09:35.777676 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/discovery.py", line 115, in _computecommonmissing
[Mon Jul 09 21:09:35.777721 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]     self.missingheads)
[Mon Jul 09 21:09:35.777733 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/revlog.py", line 976, in findcommonmissing
[Mon Jul 09 21:09:35.784386 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]     heads = [self.rev(n) for n in heads]
[Mon Jul 09 21:09:35.784426 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/changelog.py", line 358, in rev
[Mon Jul 09 21:09:35.788960 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]     r = super(changelog, self).rev(node)
[Mon Jul 09 21:09:35.788995 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/revlog.py", line 757, in rev
[Mon Jul 09 21:09:35.789038 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694]     raise LookupError(node, self.indexfile, _('no node'))
[Mon Jul 09 21:09:35.789080 2018] [wsgi:error] [pid 24463] [remote 10.22.74.212:35694] LookupError: 00changelog.i@d7c356958e2d: no node

[Mon Jul 09 21:10:17.941616 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222] Traceback (most recent call last):
[Mon Jul 09 21:10:17.941644 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/hgweb/hgwebdir_mod.py", line 229, in run_wsgi
[Mon Jul 09 21:10:17.941684 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]     for r in self._runwsgi(req):
[Mon Jul 09 21:10:17.941696 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/hgweb/hgweb_mod.py", line 318, in run_wsgi
[Mon Jul 09 21:10:17.942221 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]     for r in self._runwsgi(req, repo):
[Mon Jul 09 21:10:17.942241 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]   File "/var/hg/version-control-tools/hgext/serverlog/__init__.py", line 316, in _runwsgi
[Mon Jul 09 21:10:17.942266 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]     for what in super(hgwebwrapped, self)._runwsgi(req, repo):
[Mon Jul 09 21:10:17.942277 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/hgweb/hgweb_mod.py", line 446, in _runwsgi
[Mon Jul 09 21:10:17.942294 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]     content = getattr(webcommands, cmd)(rctx, req, tmpl)
[Mon Jul 09 21:10:17.942304 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/hgweb/webcommands.py", line 453, in changeset
[Mon Jul 09 21:10:17.943042 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]     ctx = webutil.changectx(web.repo, req)
[Mon Jul 09 21:10:17.943058 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/hgweb/webutil.py", line 306, in changectx
[Mon Jul 09 21:10:17.943725 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]     return changeidctx(repo, changeid)
[Mon Jul 09 21:10:17.943743 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/hgweb/webutil.py", line 289, in changeidctx
[Mon Jul 09 21:10:17.943765 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]     ctx = repo[changeid]
[Mon Jul 09 21:10:17.943777 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/localrepo.py", line 745, in __getitem__
[Mon Jul 09 21:10:17.943798 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]     return context.changectx(self, changeid)
[Mon Jul 09 21:10:17.943808 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/context.py", line 533, in __init__
[Mon Jul 09 21:10:17.944962 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]     self._node = repo.names.singlenode(repo, changeid)
[Mon Jul 09 21:10:17.944982 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/namespaces.py", line 104, in singlenode
[Mon Jul 09 21:10:17.945239 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]     n = v.namemap(repo, name)
[Mon Jul 09 21:10:17.945257 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/namespaces.py", line 54, in <lambda>
[Mon Jul 09 21:10:17.945279 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]     bnamemap = lambda repo, name: tolist(repo.branchtip(name, True))
[Mon Jul 09 21:10:17.945290 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/localrepo.py", line 964, in branchtip
[Mon Jul 09 21:10:17.945307 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]     return self.branchmap().branchtip(branch)
[Mon Jul 09 21:10:17.945316 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/localrepo.py", line 946, in branchmap
[Mon Jul 09 21:10:17.945337 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]     branchmap.updatecache(self)
[Mon Jul 09 21:10:17.945348 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]   File "/var/hg/venv_hgweb/lib/python2.7/site-packages/mercurial/branchmap.py", line 114, in updatecache
[Mon Jul 09 21:10:17.945952 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222]     assert partial.validfor(repo), filtername
[Mon Jul 09 21:10:17.945984 2018] [wsgi:error] [pid 24677] [remote 10.22.74.212:39222] AssertionError: replicatedserved

The 2nd one appears to be a bug in our new extension's code for branchmap cache. Should be easy to reproduce.

The 1st one occurs during the wire protocol. Essentially `hg pull` failed in the window between when hgweb had replicated the data and when its heads were being exposed. Perhaps the wire protocol isn't taking the repoview into account properly?

I have some work to do...
(In reply to Gregory Szorc [:gps] from comment #28)
> We'll need the following checks on hgssh master only:
> 
> Process existence for:
> 
> /var/hg/venv_tools/bin/python2.7
> /var/hg/venv_tools/bin/vcsreplicator-aggregator --max-polls 1800
> /etc/mercurial/pushdataaggregator-pending.ini
> /var/hg/venv_tools/bin/python2.7
> /var/hg/venv_tools/bin/vcsreplicator-aggregator --max-polls 1800
> /etc/mercurial/pushdataaggregator.ini
> 
> I think checking for exactly 2 processes with "vcsreplicator-aggregator" in
> them would be fine.

Initial add made, but we need to find a better solution. Right now, nagios is configured to run this check on the hg-ssh-master hostgroup, which only includes hgssh4. Need to come up with a clever way to handle failing over to other hosts.


> The check_pushdataaggregator_pending_lag check should also be executed. We
> added this to the Nagios config as part of a recent push. It should be
> configured identically to check_pushdataaggregator_lag (it is a variation of
> that check).

Added, with same caveat as above.


(In reply to Gregory Szorc [:gps] from comment #29)
> On all hgweb nodes...
> 
> A process existence check for:
> 
> /var/hg/venv_replication/bin/python2.7
> /var/hg/venv_replication/bin/vcsreplicator-headsconsumer
> /etc/mercurial/vcsreplicator-pending.ini
> 
> Looking for a single `vcsreplicator-headsconsumer` process should be fine.
> We should have existing checks for `vscreplicator-consumer` processes we can
> curb off of. But there is only 1 process for `vcsreplicator-headsconsumer`,
> not 8.
> 
> We'll also need the check_vcsreplicator_pending_lag configured similarly to
> check_vcsreplicator_lag. 

Both added.
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/91ac12c26323
vcsreplicator: disable branchmap integration 
https://hg.mozilla.org/hgcustom/version-control-tools/rev/e7b7b42b992c
ansible/hg-web: enable vcsreplicator extension
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/7e064ccae112
ansible/hg-web: disable vcsreplicator extension
Two problems with the last attempted deploy.

One was extensive memory consumption. I believe I isolated this to branchmap cache generation on initial repository load. Specifically on the Try repo. When the first push to the Try repo came in, CI bulk `hg pull`ing the repo resulted in multiple processes all generating the branchmap cache at the same time. This consumed a lot of CPU and memory and led to swapping.

I mitigated this issue by pregenerating the branchmap cache for all repositories on all hgweb machines. So we hopefully shouldn't encounter this issue again.

The second was the pushlog was still advertising new changesets before they were fully replicated. I'm unsure why. We have a test for this scenario. I have an upcoming patch for one theory (hasattr() not working properly). My other theory is code caching in persistent processes. Next time I do a deploy, I'll bounce server processes to force code eviction. If the issue still persists, something wonky is going on and I'll likely need to add forensic logging to see where things are failing.
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/56852940ac38
vcsreplicator: enable hgweb extension again 
https://hg.mozilla.org/hgcustom/version-control-tools/rev/c1727538a75d
vcsreplicator: use getattr() for testing for replicated_data attribute
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/d27e00d8832a
ansible/hg-web: disable vcsreplicator extension
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/64633fe1be05
vcsreplicator: use self.repo instead of repo
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/ae9fcd2023d2
vcsreplicator: enable hgweb extension again
So, I nearly tore out my hair the past few days  trying to debug things. tl;dr repo.replicated_data was intermittently available on repositories, causing the replicated state to intermittently be factored into what was being served. This didn't make any sense and was seemingly violating my understanding of how the world should work.

The fix was a one-liner: https://hg.mozilla.org/hgcustom/version-control-tools/rev/64633fe1be05. We were monkeypatching a repo class and a method was referencing a "repo" variable from the outer scope instead of "self." This definitely led to the memory leak. Why it resulted in repo.replicated_data not being reliably available, I'm still not sure.

But, testing of hg.mozilla.org reveals that things are working now. The pushlog and other APIs on https://hg.mozilla.org/ should no longer expose info about a changeset until it has been fully replicated to all servers!

There is still an issue with the "automationrelevance" API. I think it occurs whenever the set of local changesets is not the set of fully replicated changesets. That APU is used by decision tasks and only decisions tasks AFAIK. And the decision task retries after failure. I have yet to see a decision task fail as a result of this. So I'm inclined to address it as a follow-up. Will get a bug on file and try to debug it today.

But for all intents and purposes, I think we can close this bug!
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Keywords: leave-open
Resolution: --- → FIXED
Blocks: 1475656
Blocks: 1475666
As of ~16 hours ago, the lingering issue tracked in bug 1475656 is resolved. I'm pretty confident that this feature is now fully working as intended. https://hg.mozilla.org/ should now - for all intents and purposes - yield atomic snapshots of repositories without undesired side-effects.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: