Closed Bug 2034049 Opened 1 month ago Closed 1 month ago

hg.mozilla.org replication out of sync, no Try pushes created

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Assigned: sheehan)

Details

Attachments

(2 files)

Wed 09:42:11 UTC [184240] [devsvcslist] hgweb4.dmz.mdc1.mozilla.com:procs - hg vcsreplicator consumer is CRITICAL PROCS CRITICAL: 7 processes with regex args 'vcsreplicator-consumer'
Wed 10:34:45 UTC [184252] [devsvcslist] hgweb3.dmz.mdc1.mozilla.com:hg vcsreplicator lag is CRITICAL CRITICAL - 1/8 partitions out of sync

Try pushes are not created because hg.mozilla.org replication is out of sync. Unknown if other repositories are affected, autoland has a push once minute after the first Nagios message.

Try pushes will be created by Lando once the issue has been resolved.

Flags: needinfo?(zeid)
Flags: needinfo?(sheehan)
Flags: needinfo?(dkl)

Looks like I may confirm, I had multiple more attempts at trial runs in try-comm-central that just didn't surface on hg-edge and/or Treeherder, but I got my access yesterday so there may be something wrong on my end.

I rebooted the replication daemon for the try repos, and I can now see some try pushes going through.

It appears we simply hit the max number of retries after an SSH connection issue, and systemd put the process into a perma-fail state. I'll see if we can increase that max retry limit, as all I had to do was restart the daemon.

Assignee: nobody → sheehan
Flags: needinfo?(zeid)
Flags: needinfo?(sheehan)
Flags: needinfo?(dkl)

I just got two submissions notification e-mails.👍

Unfortunately, it seems since the recovery, not a single try-comm-central job actually ran.

vcsreplicator daemons were being left in systemd failed state
after bursts of transient hg pull errors, since some units inherited
the default rate limiter (5 restarts in 10s) and others had
inconsistent tuning. A new retry_on_failure decorator absorbs
transient exceptions with exponential backoff (up to 5 retries,
~62s total), so most hg pull flaps no longer cause a process exit.
Non-retryable errors bypass retry so poison messages exit fast.
All six vcsreplicator systemd units now share a uniform policy
so truly-broken services trip failed state (and OnFailure=
email on hg-ssh) while transient flaps no longer do. Tests set
VCSREPLICATOR_RETRY_MAX_RETRIES=0 to skip the production retry
sleep.

Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/6e0dbf95ef0a
replication: make retry policy more robust r=zeid

Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED
Pushed by jcristau@mozilla.com: https://hg.mozilla.org/hgcustom/version-control-tools/rev/a6872b7b949a update test after changeset 6e0dbf95ef0a "replication: make retry policy more robust" r=sheehan
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: