Closed
Bug 1332520
Opened 8 years ago
Closed 6 years ago
vcsreplicator lag check - flapping the past couple of days
Categories
(Developer Services :: Mercurial: hg.mozilla.org, defect)
Developer Services
Mercurial: hg.mozilla.org
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: sal, Unassigned)
References
Details
This has been alerting quite a bit this week, but they self recover right away.
We might need to tune the check?
@nagios-scl3> Thu 17:02:08 PST [5588] nagios2.private.scl3.mozilla.com:hg vcsreplicator lag is WARNING: CLUSTER WARNING: hg vcsreplicator lag: 1 ok, 2 warning, 0 unknown, 1 critical (http://m.mozilla.org/hg+vcsreplicator+lag)
Reporter | ||
Comment 1•8 years ago
|
||
:fubar did anything change here? it alerted a few times today
@nagios-scl3> Thu 17:06:39 PST [5311] nagios2.private.scl3.mozilla.com:hg vcsreplicator lag is WARNING: CLUSTER WARNING: hg vcsreplicator lag: 0 ok, 2 warning, 0 unknown, 2 critical (http://m.mozilla.org/hg+vcsreplicator+lag)
17:09:40 <@nagios-scl3> Thu 17:09:39 PST [5318] nagios2.private.scl3.mozilla.com:hg vcsreplicator lag is OK: CLUSTER OK: hg vcsreplicator lag: 4 ok, 0 warning, 0 unknown, 0 critical (http://m.mozilla.org/hg+vcsreplicator+lag)
Flags: needinfo?(klibby)
Comment 2•8 years ago
|
||
I'm surprised at how noisy it's been after the last round of changes; :gps, have we changed anything in on the replication side that might be causing it to alert more often?
We can change the thresholds, but we have to move in small steps. If replication is actually broken, we need to know quickly to prevent automation from burning, but we shouldn't be harassing the MOC unnecessarily. I'll look at the thresholds and make a tweak or two.
Flags: needinfo?(klibby) → needinfo?(gps)
Comment 3•8 years ago
|
||
Recent deployments shouldn't have changed replication any.
I did notice the other day that replicating the Try repo is taking a while. The way the check is currently implemented is that it will alert if a message goes N seconds without being acked/applied. Try replication may be hitting this threshold.
FWIW, I've wanted to refactor the check so it doesn't alert (or applies a higher threshold) when the replication client is actively processing a message. I've also wanted to improve the check to alert which repo(s) are causing the lag. For example, sometimes we do a reset (which is known to trip this alert). It would be nice to have confirmation of things like that.
Keeping needinfo for me so I can look at Try performance.
Comment 4•8 years ago
|
||
10:18 AM <•nagios-scl3> Tue 10:18:09 PST [5738] nagios2.private.scl3.mozilla.com:hg vcsreplicator lag is WARNING: CLUSTER WARNING: hg vcsreplicator lag: 0 ok, 3 warning, 0 unknown, 1 critical (http://m.mozilla.org/hg+vcsreplicator+lag)
10:21 AM <•nagios-scl3> Tue 10:21:09 PST [5740] nagios2.private.scl3.mozilla.com:hg vcsreplicator lag is OK: CLUSTER OK: hg vcsreplicator lag: 4 ok, 0 warning, 0 unknown, 0 critical (http://m.mozilla.org/hg+vcsreplicator+lag)
10:22 AM
Reporter | ||
Comment 5•8 years ago
|
||
this alerted a few times today
@nagios-scl3> Wed 16:16:50 PST [5137] nagios2.private.scl3.mozilla.com:hg vcsreplicator lag is WARNING: CLUSTER WARNING: hg vcsreplicator lag: 0 ok, 4 warning, 0 unknown, 0 critical (http://m.mozilla.org/hg+vcsreplicator+lag)
16:19:50 <@nagios-scl3> Wed 16:19:50 PST [5138] nagios2.private.scl3.mozilla.com:hg vcsreplicator lag is OK: CLUSTER OK: hg vcsreplicator lag: 3 ok, 1 warning, 0 unknown, 0 critical (http://m.mozilla.org/hg+vcsreplicator+lag)
Comment 6•8 years ago
|
||
Again,
18:16:58 <@nagios-scl3> Wed 18:16:57 PST [5283] nagios2.private.scl3.mozilla.com:hg vcsreplicator lag is WARNING: CLUSTER WARNING: hg vcsreplicator lag: 0 ok, 4 warning, 0 unknown, 0 critical (http://m.mozilla.org/hg+vcsreplicator+lag)
Comment 7•8 years ago
|
||
commit d5d1a24c54
Author: kendall libby <klibby@mozilla.com>
Date: Thu Feb 2 09:46:33 2017 -0500
Tweak hg vcsreplication check timers (#1332520)
Comment 8•8 years ago
|
||
The journal logs for vcsreplicator@4.service (the process dedicated to replicating changes to Try) indicate that Try pushes are taking 60+ seconds to replicate. The thresholds for this check are 30s to warn and 60s to critical. So, if this check runs soon after a Try push is performed, it will alert.
I will close old heads on the Try repo to see if that helps with perf. Will file a new bug shortly...
Flags: needinfo?(gps)
Comment 9•8 years ago
|
||
I mass merged old heads on the Try repo in bug 1336123. Replication for the Try repo is now taking 5-15s instead of 60+s.
So, the replication lag alert should fire less often. This means that the alerting threshold can be restored to its previous value.
Flags: needinfo?(klibby)
Comment 10•8 years ago
|
||
(In reply to Gregory Szorc [:gps] (disappearing for a month after 2017-02-10) from comment #9)
> So, the replication lag alert should fire less often. This means that the
> alerting threshold can be restored to its previous value.
Which can't happen until at least Monday, because the MOC is migrating to Nagios 4 and the configs are locked down. But I'll look at it next week.
Comment 11•6 years ago
|
||
We aren't seeing this any more.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(klibby)
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•