Closed Bug 2034049 Opened 1 month ago Closed 1 month ago

hg.mozilla.org replication out of sync, no Try pushes created

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: aryx, Assigned: sheehan)

Details

Attachments

(2 files)

replication: make retry policy more robust (Bug 2034049) r?jcristau,shtrom,zeid 1 month ago Connor Sheehan [:sheehan] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 2034049 - update test after changeset 6e0dbf95ef0a "replication: make retry policy more robust" r=sheehan 1 month ago Julien Cristau [:jcristau] 48 bytes, text/x-phabricator-request		Details \| Review

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Description

•

1 month ago

Wed 09:42:11 UTC [184240] [devsvcslist] hgweb4.dmz.mdc1.mozilla.com:procs - hg vcsreplicator consumer is CRITICAL PROCS CRITICAL: 7 processes with regex args 'vcsreplicator-consumer'
Wed 10:34:45 UTC [184252] [devsvcslist] hgweb3.dmz.mdc1.mozilla.com:hg vcsreplicator lag is CRITICAL CRITICAL - 1/8 partitions out of sync

Try pushes are not created because hg.mozilla.org replication is out of sync. Unknown if other repositories are affected, autoland has a push once minute after the first Nagios message.

Try pushes will be created by Lando once the issue has been resolved.

tracking-firefox150: ? → ---

tracking-firefox152: ? → ---

Flags: needinfo?(zeid)

Flags: needinfo?(sheehan)

Flags: needinfo?(dkl)

Max Emig

Comment 1

•

1 month ago

Looks like I may confirm, I had multiple more attempts at trial runs in try-comm-central that just didn't surface on hg-edge and/or Treeherder, but I got my access yesterday so there may be something wrong on my end.

Connor Sheehan [:sheehan]

Assignee

Comment 2

•

1 month ago

I rebooted the replication daemon for the try repos, and I can now see some try pushes going through.

It appears we simply hit the max number of retries after an SSH connection issue, and systemd put the process into a perma-fail state. I'll see if we can increase that max retry limit, as all I had to do was restart the daemon.

Assignee: nobody → sheehan

Flags: needinfo?(zeid)

Flags: needinfo?(sheehan)

Flags: needinfo?(dkl)

Max Emig

Comment 3

•

1 month ago

I just got two submissions notification e-mails.👍

Max Emig

Comment 4

•

1 month ago

Unfortunately, it seems since the recovery, not a single try-comm-central job actually ran.

Connor Sheehan [:sheehan]

Assignee

Comment 5

•

1 month ago

Attached file replication: make retry policy more robust (Bug 2034049) r?jcristau,shtrom,zeid — Details

vcsreplicator daemons were being left in systemd failed state
after bursts of transient hg pull errors, since some units inherited
the default rate limiter (5 restarts in 10s) and others had
inconsistent tuning. A new retry_on_failure decorator absorbs
transient exceptions with exponential backoff (up to 5 retries,
~62s total), so most hg pull flaps no longer cause a process exit.
Non-retryable errors bypass retry so poison messages exit fast.
All six vcsreplicator systemd units now share a uniform policy
so truly-broken services trip failed state (and OnFailure=
email on hg-ssh) while transient flaps no longer do. Tests set
VCSREPLICATOR_RETRY_MAX_RETRIES=0 to skip the production retry
sleep.

Pulsebot

Comment 6

•

1 month ago

Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/6e0dbf95ef0a
replication: make retry policy more robust r=zeid

Status: NEW → RESOLVED

Closed: 1 month ago

Resolution: --- → FIXED

Julien Cristau [:jcristau]

Comment 7

•

1 month ago

Attached file Bug 2034049 - update test after changeset 6e0dbf95ef0a "replication: make retry policy more robust" r=sheehan — Details

Pulsebot

Comment 8

•

1 month ago

Pushed by jcristau@mozilla.com: https://hg.mozilla.org/hgcustom/version-control-tools/rev/a6872b7b949a update test after changeset 6e0dbf95ef0a "replication: make retry policy more robust" r=sheehan

You need to log in before you can comment on or make changes to this bug.

Bugzilla

hg.mozilla.org replication out of sync, no Try pushes created

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect, P1)

Tracking

(Not tracked)

People

(Reporter: aryx, Assigned: sheehan)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Attachment

General

Description

File Name

Content Type