Open Bug 1386528 Opened 7 years ago Updated 5 years ago

fncache file corrupted due to `hg push` being killed via unifyrepo.service restart

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
blocker

Tracking

(Not tracked)

People

(Reporter: standard8, Unassigned)

References

Details

(Keywords: leave-open)

The last couple of pushes have been seeing this error on autoland. I've closed the tree for it. 

Doesn't seem to affect other trees at the moment.

https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=70cbfc193b9daf681729fcb395e2ee9fcf4d0129
https://hg.mozilla.org/integration/autoland/rev/469a6a99593ace50cead3a5927142582825f024e

[vcs 2017-08-02T06:40:33.044079Z] executing ['hg', 'robustcheckout', '--sharebase', '/home/worker/checkouts/hg-store', '--purge', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', '469a6a99593ace50cead3a5927142582825f024e', 'https://hg.mozilla.org/integration/autoland', '/home/worker/workspace/build/src']
[vcs 2017-08-02T06:40:33.102617Z] ensuring https://hg.mozilla.org/integration/autoland@469a6a99593ace50cead3a5927142582825f024e is available at /home/worker/workspace/build/src
[vcs 2017-08-02T06:40:33.102757Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[vcs 2017-08-02T06:40:33.664885Z] (sharing from existing pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
[vcs 2017-08-02T06:40:34.004476Z] searching for changes
[vcs 2017-08-02T06:40:34.047686Z] no changes found
[vcs 2017-08-02T06:40:34.234849Z] adding remote bookmark aurora
[vcs 2017-08-02T06:40:34.234899Z] adding remote bookmark beta
[vcs 2017-08-02T06:40:34.234936Z] adding remote bookmark central
[vcs 2017-08-02T06:40:34.234972Z] adding remote bookmark esr10
[vcs 2017-08-02T06:40:34.235004Z] adding remote bookmark esr17
[vcs 2017-08-02T06:40:34.235035Z] adding remote bookmark esr24
[vcs 2017-08-02T06:40:34.235066Z] adding remote bookmark esr31
[vcs 2017-08-02T06:40:34.235101Z] adding remote bookmark esr38
[vcs 2017-08-02T06:40:34.235134Z] adding remote bookmark esr45
[vcs 2017-08-02T06:40:34.235167Z] adding remote bookmark esr52
[vcs 2017-08-02T06:40:34.235199Z] adding remote bookmark fx-team
[vcs 2017-08-02T06:40:34.235232Z] adding remote bookmark inbound
[vcs 2017-08-02T06:40:34.235265Z] adding remote bookmark release
[vcs 2017-08-02T06:40:38.190676Z] 
[vcs 2017-08-02T06:40:39.191493Z] updating [=====>                                            ]  30900/234106 14s
[vcs 2017-08-02T06:40:39.373207Z] updating [========>                                         ]  46600/234106 13sabort: data/js/src/jit-test/lib/nightly-only.js.i@0e88d5e6231b: no match found!
[vcs 2017-08-02T06:40:40.192922Z] 
[vcs 2017-08-02T06:40:41.194089Z] updating [============>                                     ]  63200/234106 11s
[vcs 2017-08-02T06:40:42.213898Z] updating [===============>                                  ]  79000/234106 10s
[vcs 2017-08-02T06:40:43.240169Z] updating [===================>                              ]  94400/234106 09s
[vcs 2017-08-02T06:40:44.241483Z] updating [======================>                           ] 112000/234106 08s
[vcs 2017-08-02T06:40:45.258166Z] updating [=========================>                        ] 126000/234106 07s
[vcs 2017-08-02T06:40:46.262789Z] updating [===========================>                      ] 135100/234106 07s
[vcs 2017-08-02T06:40:47.264239Z] updating [==============================>                   ] 148200/234106 06s
[vcs 2017-08-02T06:40:48.277970Z] updating [===============================>                  ] 152200/234106 06s
[vcs 2017-08-02T06:40:49.783877Z] updating [==================================>               ] 166900/234106 05s
[vcs 2017-08-02T06:40:50.795515Z] updating [=====================================>            ] 181300/234106 04s
[vcs 2017-08-02T06:40:52.282581Z] updating [========================================>         ] 195300/234106 03s
[vcs 2017-08-02T06:40:53.291791Z] updating [===========================================>      ] 207700/234106 03s
[vcs 2017-08-02T06:40:53.749894Z] updating [=============================================>    ] 219986/234106 02s
[vcs 2017-08-02T06:40:53.749958Z]                                                                                 
[vcs 2017-08-02T06:40:54.245151Z] abort: No such file or directory: '/home/worker/workspace/build/src/js/src/jit-test/lib/nightly-only.js'
Discussing in #releng.

inbound & central are now closed.

Could be to do with mozilla-unified.
Summary: Autoland bustage: abort: No such file or directory: '/home/worker/workspace/build/src/js/src/jit-test/lib/nightly-only.js' when cloning → All central/integration trees bustage: abort: No such file or directory: '/home/worker/workspace/build/src/js/src/jit-test/lib/nightly-only.js' when cloning
Updated the clonebundles.manifest for mozilla-unified (as well as index.html and bundles.json in S3 for all regions) to use yesterday's bundles (dc186bad8bd537919a15722a1a79dd46660f95f1):

$ hg clone https://hg.mozilla.org/mozilla-unified
destination directory: mozilla-unified
applying clone bundle from https://hg.cdn.mozilla.net/mozilla-unified/dc186bad8bd537919a15722a1a79dd46660f95f1.zstd-max.hg
adding changesets
adding manifests
adding file changes
added 413306 changesets with 2301183 changes to 378337 files (+11 heads)
finished applying clone bundle
searching for changes
adding changesets
adding manifests
adding file changes
added 845 changesets with 4211 changes to 2268 files
updating to branch default
234073 files updated, 0 files merged, 0 files removed, 0 files unresolved
$ ls -F mozilla-unified/js/src/jit-test/lib/nightly-only.js
mozilla-unified/js/src/jit-test/lib/nightly-only.js

There are no error messages that I can find from the bundle generation, so leaving the NI on :gps to figure out where this went off the rails.
What we know so far is that the stream bundle[1], for some reason, didn't contain the nightly-only.js file. The other bundles weren't affected (I actually only checked the v2 gzip bundle, I'm only assuming the zstd one was good too).

1. https://s3-us-west-2.amazonaws.com/moz-hg-bundles-us-west-2/mozilla-unified/df3915693fa3bec56bacc17a4f2588f99ab635b4.packed1.hg
Thanks :glandium for suggesting this idea and :pmoore for pointing in the right direction for the tools to use:

We're about to use this[1] etherpad to clear the "level-3-checkouts-v1" caches on Prov/workerType of instances which failed for this reason. Hopefully this will make them repush that clone with the backed-out that :fubar pushed in 
https://bugzilla.mozilla.org/show_bug.cgi?id=1386528#c3

[1]: https://public.etherpad-mozilla.org/p/clearing_caches
level-1-checkouts-v1 needs a cleanup too, for try.
Used the method from https://bugzilla.mozilla.org/show_bug.cgi?id=1386528#c6 to purge cache for the following:

Provisioner ID: aws-provisioner-v1
Worker type:  gecko-decision
Cache Name: level-3-checkouts
Sample: https://tools.taskcluster.net/tasks/Y4U8Y-EIRXyrNWugp8U7Ow

The decision task went green after purging cache + rerun.
Following above pattern, please let me know what else you need to purge.
Whiteboard: [stockwell infra]
This is a really wonky failure.

https://hg.mozilla.org/mozilla-central/rev/3232054e0a93 introduced the missing file in question. The changeset itself appears to be well-formed.

The reason this broke stream clone bundles is because a file added in that commit (js/src/jit-test/lib/nightly-only.js) is missing from the mozilla-unified/.hg/store/fncache file.

This fncache file contains a list of all file paths to exist in the repo. It is a glorified cache that prevents Mercurial from having to walk the filesystem to gather the set of known file paths and to test if a file exists without incurring I/O. The file is managed with Mercurial's transaction semantics, which means it should be in lock-step with the store.

But the really wonky thing is that the fncache for autoland, inbound, and central all have nightly-only.js listed! Only mozilla-unified is missing it.

Digging into the system logs, I can find the log for the "unifyrepo" service/invocation that handled the original push to inbound. This process basically does an `hg pull` of a bunch of repos to a staging repo and then pushes the result to mozilla-unified. This staging repo also has a proper fncache by the way. But, just as that process went to `hg push` to mozilla-unified, my hg.mo Ansible deploy that was occurring at the same time triggered the restart of the service! The `hg` process mid-push declared that it was killed and was rolling back the transaction. It even printed the "rollback completed" message indicating the transaction was cleanly undone.

From here, I'm still sorting things out. But it looks like the fncache file got "corrupted" as part of aborting an `hg push` which was triggered by a process signal triggered by restarting the unifyrepo.service unit, which was triggered by the hg.mo deploy I was doing at the same time. So, this is an extremely unlikely race condition (hg.mo deploy racing with a ~5s window around a Firefox push) combined with either a) a bug in Mercurial transactions b) systemd unit config too aggressively force killing a process c) a custom extension on hg.mo preventing the transaction from being recovered from properly (possibly vcsreplicator).

I have some exploring to do.
I recovered from the corrupted fncache file by running the following command:

  $ sudo -u hg-aggregate /var/hg/venv_tools/bin/hg -R /repo/hg/mozilla/mozilla-unified debugrebuildfncache

After running for a few minutes, it told me it added the missing js/src/jit-test/lib/nightly-only.js file.

I have triggered bundle generation so latest bundles make their way out.
Component: Buildduty → Mercurial: hg.mozilla.org
Flags: needinfo?(gps)
Product: Release Engineering → Developer Services
QA Contact: catlee
Summary: All central/integration trees bustage: abort: No such file or directory: '/home/worker/workspace/build/src/js/src/jit-test/lib/nightly-only.js' when cloning → fncache file corrupted due to `hg push` being killed via unifyrepo.service restart
Keywords: leave-open
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/30003d8511de
docs: document how to recover from a corrupted fncache file 
https://hg.mozilla.org/hgcustom/version-control-tools/rev/dcecd28016ae
hgserver: create backup of clonebundles.manifest 
https://hg.mozilla.org/hgcustom/version-control-tools/rev/8c32d62e7f16
docs: add operational docs for clone bundles
Here is a more detailed series of events from the unified journal:

Aug 01 20:41:33.427411 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: pulling /repo/hg/mozilla/integration/mozilla-inbound into /repo/hg/mozilla/experimental/firefox-unifie
Aug 01 20:41:33.427648 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: searching for changes
Aug 01 20:41:33.427857 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: adding changesets
Aug 01 20:41:33.428073 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: adding manifests
Aug 01 20:41:33.428283 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: adding file changes
Aug 01 20:41:33.428501 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: added 3 changesets with 11 changes to 8 files
<time passes; ansible deploy continues>
Aug 01 20:41:38.451026 hgssh4.dmz.scl3.mozilla.com vcsreplicator[8273]: hg-aggregate {moz}/experimental/firefox-unified-stage PRETXNCLOSE_HEARTBEATSENDING
Aug 01 20:41:38.452926 hgssh4.dmz.scl3.mozilla.com vcsreplicator[8273]: hg-aggregate {moz}/experimental/firefox-unified-stage PRETXNCLOSE_HEARTBEATSENT
Aug 01 20:41:39.832867 hgssh4.dmz.scl3.mozilla.com vcsreplicator[8273]: hg-aggregate {moz}/experimental/firefox-unified-stage CHANGEGROUPHOOK_SENDING
Aug 01 20:41:39.836207 hgssh4.dmz.scl3.mozilla.com vcsreplicator[8273]: hg-aggregate {moz}/experimental/firefox-unified-stage CHANGEGROUPHOOK_SENT
<replication to mirrors starts>
Aug 01 20:41:40.453473 hgssh4.dmz.scl3.mozilla.com hgweb[10591]: d2417a36-76f9-11e7-8ffb-8cdcd4b427a8: BEGIN_SSH_SESSION experimental/firefox-unified-stage vcs-sync@mozill
Aug 01 20:41:40.453640 hgssh4.dmz.scl3.mozilla.com hgweb[10591]: d2417a36-76f9-11e7-8ffb-8cdcd4b427a8:d2418ea4-76f9-11e7-8ffb-8cdcd4b427a8 BEGIN_SSH_COMMAND hello
Aug 01 20:41:40.457310 hgssh4.dmz.scl3.mozilla.com hgweb[10591]: d2417a36-76f9-11e7-8ffb-8cdcd4b427a8:d2418ea4-76f9-11e7-8ffb-8cdcd4b427a8 END_SSH_COMMAND 0.004 0.004
Aug 01 20:41:40.457480 hgssh4.dmz.scl3.mozilla.com hgweb[10591]: d2417a36-76f9-11e7-8ffb-8cdcd4b427a8:d242253a-76f9-11e7-8ffb-8cdcd4b427a8 BEGIN_SSH_COMMAND between
Aug 01 20:41:40.457623 hgssh4.dmz.scl3.mozilla.com hgweb[10591]: d2417a36-76f9-11e7-8ffb-8cdcd4b427a8:d242253a-76f9-11e7-8ffb-8cdcd4b427a8 END_SSH_COMMAND 0.000 0.000
Aug 01 20:41:40.458564 hgssh4.dmz.scl3.mozilla.com hgweb[10609]: d242429a-76f9-11e7-a612-8cdcd4b427a8: BEGIN_SSH_SESSION experimental/firefox-unified-stage vcs-sync@mozill
Aug 01 20:41:43.618722 hgssh4.dmz.scl3.mozilla.com pushdataaggregator[18989]: vcsreplicator.aggregator 1 unacked messages in 1 partition: [1]
Aug 01 20:41:43.622090 hgssh4.dmz.scl3.mozilla.com pushdataaggregator[18989]: vcsreplicator.aggregator copying hg-changegroup-2 from partition 1
Aug 01 20:41:43.624969 hgssh4.dmz.scl3.mozilla.com snsnotifier[1059]: vcsreplicator.pushnotifications querying pushlog data for /repo/hg/mozilla/experimental/firefox-unifi
Aug 01 20:41:43.625246 hgssh4.dmz.scl3.mozilla.com pulsenotifier[1051]: vcsreplicator.pushnotifications querying pushlog data for /repo/hg/mozilla/experimental/firefox-uni
Aug 01 20:41:43.925523 hgssh4.dmz.scl3.mozilla.com pulsenotifier[1051]: vcsreplicator.pulsenotifier publishing message to exchange/hgpushes/v1#experimental/firefox-unified
Aug 01 20:41:43.938827 hgssh4.dmz.scl3.mozilla.com pulsenotifier[1051]: vcsreplicator.pulsenotifier connecting to pulse at pulse.mozilla.org:5671 as hgpushes
Aug 01 20:41:43.990403 hgssh4.dmz.scl3.mozilla.com pulsenotifier[1051]: vcsreplicator.pulsenotifier publishing message to exchange/hgpushes/v2#experimental/firefox-unified
Aug 01 20:41:44.115175 hgssh4.dmz.scl3.mozilla.com snsnotifier[1059]: vcsreplicator.snsnotifier sending SNS notification to 
Aug 01 20:41:44.322585 hgssh4.dmz.scl3.mozilla.com snsnotifier[1059]: vcsreplicator.snsnotifier finished processing message 15170874
<firefox-unified-stage replication is complete>
Aug 01 20:42:43.529573 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: YOU ARE PUSHING WITH AN OUT OF DATE MERCURIAL CLIENT!
Aug 01 20:42:43.529829 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: newer versions are faster and have numerous bug fixes
Aug 01 20:42:43.530048 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: upgrade instructions are at the following URL:
Aug 01 20:42:43.530270 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: https://mozilla-version-control-tools.readthedocs.io/en/latest/hgmozilla/installing.html
Aug 01 20:42:43.530494 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: recorded changegroup in replication log in 0.010s
Aug 01 20:42:43.539584 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: aggregating 414106/428607 nodes from 196629 original pushes
Aug 01 20:42:43.539810 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: 3/414106 nodes will be pulled
Aug 01 20:42:43.540022 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: consolidated into 1 pulls from 179250 unique pushes
Aug 01 20:42:43.540236 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: searching for changes
Aug 01 20:42:43.540456 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: adding changesets
Aug 01 20:42:43.540680 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: adding manifests
Aug 01 20:42:43.540911 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: adding file changes
Aug 01 20:42:43.541136 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: added 3 changesets with 11 changes to 8 files
Aug 01 20:42:43.672446 hgssh4.dmz.scl3.mozilla.com vcsreplicator[8273]: hg-aggregate {moz}/mozilla-unified PRETXNCLOSE_HEARTBEATSENDING
Aug 01 20:42:43.673144 hgssh4.dmz.scl3.mozilla.com pushdataaggregator[18989]: vcsreplicator.aggregator 1 unacked messages in 1 partition: [0]
Aug 01 20:42:43.674436 hgssh4.dmz.scl3.mozilla.com vcsreplicator[8273]: hg-aggregate {moz}/mozilla-unified PRETXNCLOSE_HEARTBEATSENT
Aug 01 20:42:43.675736 hgssh4.dmz.scl3.mozilla.com pushdataaggregator[18989]: vcsreplicator.aggregator copying heartbeat-1 from partition 0
Aug 01 20:42:43.677602 hgssh4.dmz.scl3.mozilla.com pulsenotifier[1051]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:43.677771 hgssh4.dmz.scl3.mozilla.com snsnotifier[1059]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:43.707472 hgssh4.dmz.scl3.mozilla.com pushdataaggregator[18989]: vcsreplicator.aggregator 1 unacked messages in 1 partition: [0]
Aug 01 20:42:43.710342 hgssh4.dmz.scl3.mozilla.com pushdataaggregator[18989]: vcsreplicator.aggregator copying heartbeat-1 from partition 0
Aug 01 20:42:43.712356 hgssh4.dmz.scl3.mozilla.com snsnotifier[1059]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:43.712641 hgssh4.dmz.scl3.mozilla.com pulsenotifier[1051]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
<request to restart all systemd units associated with master>
Aug 01 20:42:44.594852 hgssh4.dmz.scl3.mozilla.com ansible-service[21285]: Invoked with name=hg-master.target pattern=None enabled=None state=restarted sleep=None argument
Aug 01 20:42:44.612658 hgssh4.dmz.scl3.mozilla.com systemd[1]: Stopped target Target holding all services that should run on the hg master server.
Aug 01 20:42:44.612702 hgssh4.dmz.scl3.mozilla.com systemd[1]: Stopping Target holding all services that should run on the hg master server.
Aug 01 20:42:44.612719 hgssh4.dmz.scl3.mozilla.com systemd[1]: Stopped Schedules periodic generation of hg bundles.
Aug 01 20:42:44.612748 hgssh4.dmz.scl3.mozilla.com systemd[1]: Stopping Schedules periodic generation of hg bundles.
Aug 01 20:42:44.613028 hgssh4.dmz.scl3.mozilla.com systemd[1]: Stopping Aggregate Mercurial repositories...
Aug 01 20:42:44.627824 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: killed!
Aug 01 20:42:44.627919 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: transaction abort!
Aug 01 20:42:44.613157 hgssh4.dmz.scl3.mozilla.com systemd[1]: Stopping Notify Pulse about Mercurial pushes...
Aug 01 20:42:44.613281 hgssh4.dmz.scl3.mozilla.com systemd[1]: Stopping Aggregate replicated Mercurial pushes to a new Kafka topic...
Aug 01 20:42:44.613414 hgssh4.dmz.scl3.mozilla.com systemd[1]: Stopping Notify SNS about Mercurial events...
Aug 01 20:42:44.626087 hgssh4.dmz.scl3.mozilla.com systemd[1]: Started Schedules periodic generation of hg bundles.
Aug 01 20:42:44.626140 hgssh4.dmz.scl3.mozilla.com systemd[1]: Starting Schedules periodic generation of hg bundles.
Aug 01 20:42:44.717558 hgssh4.dmz.scl3.mozilla.com pulsenotifier[1051]: vcsreplicator.pushnotifications executing loop exiting gracefully
Aug 01 20:42:44.717942 hgssh4.dmz.scl3.mozilla.com pulsenotifier[1051]: vcsreplicator.pushnotifications process exiting code 0
Aug 01 20:42:44.718449 hgssh4.dmz.scl3.mozilla.com snsnotifier[1059]: vcsreplicator.pushnotifications executing loop exiting gracefully
Aug 01 20:42:44.718841 hgssh4.dmz.scl3.mozilla.com snsnotifier[1059]: vcsreplicator.pushnotifications process exiting code 0
Aug 01 20:42:45.073030 hgssh4.dmz.scl3.mozilla.com unifyrepo[22776]: rollback completed
Aug 01 20:42:45.877077 hgssh4.dmz.scl3.mozilla.com systemd[1]: Starting Ensure hg-master.target is in appropriate state...
Aug 01 20:42:45.878396 hgssh4.dmz.scl3.mozilla.com systemd[1]: Started Aggregate Mercurial repositories.
Aug 01 20:42:45.878574 hgssh4.dmz.scl3.mozilla.com systemd[1]: Starting Aggregate Mercurial repositories...
Aug 01 20:42:45.883951 hgssh4.dmz.scl3.mozilla.com hg-master-start-stop[21317]: hg-master.target not active; starting after 30s sleep...
Aug 01 20:42:45.972717 hgssh4.dmz.scl3.mozilla.com java[1054]: [2017-08-01 20:42:45,972] INFO Closing socket connection to /10.22.74.45. (kafka.network.Processor)
Aug 01 20:42:45.985867 hgssh4.dmz.scl3.mozilla.com systemd[1]: Started Aggregate replicated Mercurial pushes to a new Kafka topic.
Aug 01 20:42:45.986005 hgssh4.dmz.scl3.mozilla.com systemd[1]: Starting Aggregate replicated Mercurial pushes to a new Kafka topic...
Aug 01 20:42:45.986044 hgssh4.dmz.scl3.mozilla.com systemd[1]: Reached target Target holding all services that should run on the hg master server.
Aug 01 20:42:45.986084 hgssh4.dmz.scl3.mozilla.com systemd[1]: Starting Target holding all services that should run on the hg master server.
Aug 01 20:42:52.385389 hgssh4.dmz.scl3.mozilla.com pushdataaggregator[21325]: vcsreplicator.aggregator 8 unacked messages in 8 partition: [0, 1, 2, 3, 4, 5, 6, 7]
Aug 01 20:42:52.395715 hgssh4.dmz.scl3.mozilla.com pushdataaggregator[21325]: vcsreplicator.aggregator copying heartbeat-1 from partition 0
Aug 01 20:42:52.398032 hgssh4.dmz.scl3.mozilla.com snsnotifier[21307]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:52.398375 hgssh4.dmz.scl3.mozilla.com pulsenotifier[21302]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:52.399087 hgssh4.dmz.scl3.mozilla.com pushdataaggregator[21325]: vcsreplicator.aggregator copying heartbeat-1 from partition 1
Aug 01 20:42:52.400623 hgssh4.dmz.scl3.mozilla.com pulsenotifier[21302]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:52.400992 hgssh4.dmz.scl3.mozilla.com snsnotifier[21307]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:52.402009 hgssh4.dmz.scl3.mozilla.com pushdataaggregator[21325]: vcsreplicator.aggregator copying heartbeat-1 from partition 2
Aug 01 20:42:52.403738 hgssh4.dmz.scl3.mozilla.com pulsenotifier[21302]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:52.404129 hgssh4.dmz.scl3.mozilla.com snsnotifier[21307]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:52.405126 hgssh4.dmz.scl3.mozilla.com pushdataaggregator[21325]: vcsreplicator.aggregator copying heartbeat-1 from partition 3
Aug 01 20:42:52.406703 hgssh4.dmz.scl3.mozilla.com pulsenotifier[21302]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:52.407086 hgssh4.dmz.scl3.mozilla.com snsnotifier[21307]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:52.407971 hgssh4.dmz.scl3.mozilla.com pushdataaggregator[21325]: vcsreplicator.aggregator copying heartbeat-1 from partition 4
Aug 01 20:42:52.409342 hgssh4.dmz.scl3.mozilla.com pulsenotifier[21302]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:52.409657 hgssh4.dmz.scl3.mozilla.com snsnotifier[21307]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:52.410302 hgssh4.dmz.scl3.mozilla.com pushdataaggregator[21325]: vcsreplicator.aggregator copying heartbeat-1 from partition 5
Aug 01 20:42:52.411568 hgssh4.dmz.scl3.mozilla.com snsnotifier[21307]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:52.411699 hgssh4.dmz.scl3.mozilla.com pulsenotifier[21302]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:52.412491 hgssh4.dmz.scl3.mozilla.com pushdataaggregator[21325]: vcsreplicator.aggregator copying heartbeat-1 from partition 6
Aug 01 20:42:52.414271 hgssh4.dmz.scl3.mozilla.com snsnotifier[21307]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:52.414608 hgssh4.dmz.scl3.mozilla.com pulsenotifier[21302]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:52.415559 hgssh4.dmz.scl3.mozilla.com pushdataaggregator[21325]: vcsreplicator.aggregator copying heartbeat-1 from partition 7
Aug 01 20:42:52.417400 hgssh4.dmz.scl3.mozilla.com snsnotifier[21307]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring
Aug 01 20:42:52.417538 hgssh4.dmz.scl3.mozilla.com pulsenotifier[21302]: vcsreplicator.pushnotifications heartbeat-1 message not relevant; ignoring

I'm still trying to wrap my head around the timeline observed by journald here. The condensed timeline is:

20:42:43.540236 - push to mozilla-unified starts
20:42:43.674436 - pretxn "heartbeat" message acked by all mirrors
20:42:44.612658 - systemd starts shutdown of hg-master.target units
20:42:44.613028 - request to shut down unifyrepo.service/unifyrepo.timer
20:42:44.627824 - hg reports kill by signal
20:42:45.073030 - hg says transaction rollback complete
20:42:45.877077 - systemd message implying unifyrepo.service is stopped
20:42:45.878574 - unifyrepo.timer starts back up

Where exactly Mercurial died, I'm not sure. We know that vcsreplicator successfully delivered its "heartbeat" check to Kafka as part of closing the transaction. (It verifies the replication log is writable before finalizing the transaction.) But the delivery of the "changegroup" replication message was never delivered. Moreover, Mercurial reports a transaction rollback in its output. So the transaction never finalized.

I'm pretty confident saying this is a race condition around closing the Mercurial transaction. I'm just not sure if the problem resides with vcsreplicator or with Mercurial.

I want to emphasize that the window for this failure feels like it should be really small. Like something we should only hit once every few years. In fact, the replication code has been deployed since essentially January 2016 and the repo unification code since April 2016 and we haven't had a problem. Furthermore, since we never kill ssh-initiated hg processes as part of doing an hg.mo deploy, this failure mode should be unique to the unifyrepo service, which runs on hgssh itself.

What an esoteric bug.
:gps, this is marked as leave-open 9 months ago, we haven't seen issues with this for 9 months- is there followup work to do here?
Flags: needinfo?(gps)
Yes, there is follow-up work to mitigate the root cause. And the context in comments 9 and 12 is really useful. So I'd prefer to leave this bug open.
Flags: needinfo?(gps)
Whiteboard: [stockwell infra]
See Also: → 1541202
You need to log in before you can comment on or make changes to this bug.