hg.mozilla.org replication out of sync, no Try pushes created
Categories
(Developer Services :: Mercurial: hg.mozilla.org, defect, P1)
Tracking
(Not tracked)
People
(Reporter: aryx, Assigned: sheehan)
Details
Attachments
(2 files)
Wed 09:42:11 UTC [184240] [devsvcslist] hgweb4.dmz.mdc1.mozilla.com:procs - hg vcsreplicator consumer is CRITICAL PROCS CRITICAL: 7 processes with regex args 'vcsreplicator-consumer'
Wed 10:34:45 UTC [184252] [devsvcslist] hgweb3.dmz.mdc1.mozilla.com:hg vcsreplicator lag is CRITICAL CRITICAL - 1/8 partitions out of sync
Try pushes are not created because hg.mozilla.org replication is out of sync. Unknown if other repositories are affected, autoland has a push once minute after the first Nagios message.
Try pushes will be created by Lando once the issue has been resolved.
Looks like I may confirm, I had multiple more attempts at trial runs in try-comm-central that just didn't surface on hg-edge and/or Treeherder, but I got my access yesterday so there may be something wrong on my end.
| Assignee | ||
Comment 2•1 month ago
|
||
I rebooted the replication daemon for the try repos, and I can now see some try pushes going through.
It appears we simply hit the max number of retries after an SSH connection issue, and systemd put the process into a perma-fail state. I'll see if we can increase that max retry limit, as all I had to do was restart the daemon.
Unfortunately, it seems since the recovery, not a single try-comm-central job actually ran.
| Assignee | ||
Comment 5•1 month ago
|
||
vcsreplicator daemons were being left in systemd failed state
after bursts of transient hg pull errors, since some units inherited
the default rate limiter (5 restarts in 10s) and others had
inconsistent tuning. A new retry_on_failure decorator absorbs
transient exceptions with exponential backoff (up to 5 retries,
~62s total), so most hg pull flaps no longer cause a process exit.
Non-retryable errors bypass retry so poison messages exit fast.
All six vcsreplicator systemd units now share a uniform policy
so truly-broken services trip failed state (and OnFailure=
email on hg-ssh) while transient flaps no longer do. Tests set
VCSREPLICATOR_RETRY_MAX_RETRIES=0 to skip the production retry
sleep.
Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/6e0dbf95ef0a
replication: make retry policy more robust r=zeid
Comment 7•1 month ago
|
||
Description
•