Closed Bug 1267619 Opened 8 years ago Closed 8 years ago

vcs replication error on partition 6 (users)

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: fubar, Assigned: gps)

Details

replication for the users partition got wedged late last night:

hglib.error.ServerError: server exited with status 255: abort: repository /repo/hg/mozilla/users/gszorc_mozilla.com/mc-treemanifest not found!

repo didn't exist on hgssh3, nor any web head, so I'm not sure how we got to this point. 

fixed it by going to each web head, and running /var/hg/venv_replication/bin/vcsreplicator-consumer to dump the waiting messages to see what was there (only the two broken commits and heartbeats), skip the bad ones, and then restart with supervisorctl:

$ /var/hg/venv_replication/bin/vcsreplicator-consumer --partition 6 --dump /etc/mercurial/vcsreplicator.ini | less
$ /var/hg/venv_replication/bin/vcsreplicator-consumer --partition 6 --skip /etc/mercurial/vcsreplicator.ini
$ supervisorctl restart vcsreplicator:6
(fails, but comsumes heartbeats until next bad message)
$ /var/hg/venv_replication/bin/vcsreplicator-consumer --partition 6 --skip /etc/mercurial/vcsreplicator.ini
$ supervisorctl restart vcsreplicator:6
(succeeds!)

the two bad messages:

- _created: 1461646912.198106
  heads:
  - 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29
  name: hg-changegroup-1
  nodes:
  - 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29
  path: '{moz}/users/gszorc_mozilla.com/mc-treemanifest'
  source: serve

- _created: 1461647140.221636
  heads:
  - 35e6847637ef893f13828bf8b5b19ab56385011b
  name: hg-changegroup-1
  nodes:
  - 35e6847637ef893f13828bf8b5b19ab56385011b
  path: '{moz}/users/gszorc_mozilla.com/mc-treemanifest'
  source: serve


also, there were no other user replication messages queued, so this had no user impact, luckily.

assigning to :gps since he's the replication wizard *and* it was his repo that blew up :-)
and the pulse notifier is wedged, but I can't figure out how to fix that.

2016-04-26T14:01:40.208741+00:00 hgssh3.dmz.scl3.mozilla.com pulsenotifier[21238]: abort: repository /repo/hg/mozilla/users/gszorc_mozilla.com/mc-treemanifest not found!
I was playing around with "treemanifest" repos last night (these are repos that change Mercurial's storage mechanism to per-directory storage, which enables `hg log <dir>` to be insanely fast and makes individual directory clones perform better.

You have to pass a special config option during `hg init` to create repos in this format. So I abused my ssh powers to do so.

A few minutes later, I was experiencing errors pushing to the newly-created repo (treemanifests are a highly experimental feature), so I deleted the repo. This apparently caused the replication system to get wedged somehow. I /think/ it was having trouble applying the initial repo creation because of the unknown "treemanifest" repo capability that must have been in the repo create message. When it saw a message for new data on a repo that didn't exist, it barfed. This is by design: we don't want the replication system ignoring events because if things get in a weird state, ignoring events is no different than losing data.

fubar's actions of identifying "bad" messages then skipping them was the correct fix and made the replication system happy again.

It's worth noting that creating repos with unknown capabilities shouldn't be possible via the normal mechanism: you need ssh access to pass arguments to `hg init` to cause this failure.

There was also a secondary failure with the notification system. As part of processing notifications, it performs a `hg log` of a repo so it can put metadata in the notification message. When it tried to process my deleted repo, it crashed. Unlike the replication system, notification queue uses a single partition so we have more consistent ordering of messages. So when my user repo crashed it, all notifications were held up.

The fix for the notification service was simple: automatically skip events belonging to missing repos (https://hg.mozilla.org/hgcustom/version-control-tools/rev/c96ce58178da). Unlike the replication mirrors, the notification service runs on the master and thus has access to the canonical source of truth. If a repository is missing, it has been deleted, so we can skip the notification. And there are valid race conditions when this can occur (like this bug!). There is a slight chance we may ignore messages if the service isn't properly configured (e.g. path to local repos is wrong). But you can't prevent human error. That's a risk I'm willing to deal with.

Anyway, the mess is all cleaned up now. I learned my lesson to not create special repos when not logged into IRC. Had I been logged into IRC, I would have seen the issue almost immediately and fixed it last night. I also would have seen the alerts for Pulse last night and fixed that as well.

If there's a silver lining here, fubar was able to unwedge the replication system by following the docs and we found a potential serious bug in the pulse notifications before any important consumers (like build automation) started relying on it.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.