Closed Bug 1332520 Opened 8 years ago Closed 6 years ago

vcsreplicator lag check - flapping the past couple of days

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sal, Unassigned)

References

Details

This has been alerting quite a bit this week, but they self recover right away. We might need to tune the check? @nagios-scl3> Thu 17:02:08 PST [5588] nagios2.private.scl3.mozilla.com:hg vcsreplicator lag is WARNING: CLUSTER WARNING: hg vcsreplicator lag: 1 ok, 2 warning, 0 unknown, 1 critical (http://m.mozilla.org/hg+vcsreplicator+lag)
:fubar did anything change here? it alerted a few times today @nagios-scl3> Thu 17:06:39 PST [5311] nagios2.private.scl3.mozilla.com:hg vcsreplicator lag is WARNING: CLUSTER WARNING: hg vcsreplicator lag: 0 ok, 2 warning, 0 unknown, 2 critical (http://m.mozilla.org/hg+vcsreplicator+lag) 17:09:40 <@nagios-scl3> Thu 17:09:39 PST [5318] nagios2.private.scl3.mozilla.com:hg vcsreplicator lag is OK: CLUSTER OK: hg vcsreplicator lag: 4 ok, 0 warning, 0 unknown, 0 critical (http://m.mozilla.org/hg+vcsreplicator+lag)
Flags: needinfo?(klibby)
I'm surprised at how noisy it's been after the last round of changes; :gps, have we changed anything in on the replication side that might be causing it to alert more often? We can change the thresholds, but we have to move in small steps. If replication is actually broken, we need to know quickly to prevent automation from burning, but we shouldn't be harassing the MOC unnecessarily. I'll look at the thresholds and make a tweak or two.
Flags: needinfo?(klibby) → needinfo?(gps)
Recent deployments shouldn't have changed replication any. I did notice the other day that replicating the Try repo is taking a while. The way the check is currently implemented is that it will alert if a message goes N seconds without being acked/applied. Try replication may be hitting this threshold. FWIW, I've wanted to refactor the check so it doesn't alert (or applies a higher threshold) when the replication client is actively processing a message. I've also wanted to improve the check to alert which repo(s) are causing the lag. For example, sometimes we do a reset (which is known to trip this alert). It would be nice to have confirmation of things like that. Keeping needinfo for me so I can look at Try performance.
10:18 AM <•nagios-scl3> Tue 10:18:09 PST [5738] nagios2.private.scl3.mozilla.com:hg vcsreplicator lag is WARNING: CLUSTER WARNING: hg vcsreplicator lag: 0 ok, 3 warning, 0 unknown, 1 critical (http://m.mozilla.org/hg+vcsreplicator+lag) 10:21 AM <•nagios-scl3> Tue 10:21:09 PST [5740] nagios2.private.scl3.mozilla.com:hg vcsreplicator lag is OK: CLUSTER OK: hg vcsreplicator lag: 4 ok, 0 warning, 0 unknown, 0 critical (http://m.mozilla.org/hg+vcsreplicator+lag) 10:22 AM
this alerted a few times today @nagios-scl3> Wed 16:16:50 PST [5137] nagios2.private.scl3.mozilla.com:hg vcsreplicator lag is WARNING: CLUSTER WARNING: hg vcsreplicator lag: 0 ok, 4 warning, 0 unknown, 0 critical (http://m.mozilla.org/hg+vcsreplicator+lag) 16:19:50 <@nagios-scl3> Wed 16:19:50 PST [5138] nagios2.private.scl3.mozilla.com:hg vcsreplicator lag is OK: CLUSTER OK: hg vcsreplicator lag: 3 ok, 1 warning, 0 unknown, 0 critical (http://m.mozilla.org/hg+vcsreplicator+lag)
Again, 18:16:58 <@nagios-scl3> Wed 18:16:57 PST [5283] nagios2.private.scl3.mozilla.com:hg vcsreplicator lag is WARNING: CLUSTER WARNING: hg vcsreplicator lag: 0 ok, 4 warning, 0 unknown, 0 critical (http://m.mozilla.org/hg+vcsreplicator+lag)
commit d5d1a24c54 Author: kendall libby <klibby@mozilla.com> Date: Thu Feb 2 09:46:33 2017 -0500 Tweak hg vcsreplication check timers (#1332520)
The journal logs for vcsreplicator@4.service (the process dedicated to replicating changes to Try) indicate that Try pushes are taking 60+ seconds to replicate. The thresholds for this check are 30s to warn and 60s to critical. So, if this check runs soon after a Try push is performed, it will alert. I will close old heads on the Try repo to see if that helps with perf. Will file a new bug shortly...
Flags: needinfo?(gps)
Depends on: 1336123
I mass merged old heads on the Try repo in bug 1336123. Replication for the Try repo is now taking 5-15s instead of 60+s. So, the replication lag alert should fire less often. This means that the alerting threshold can be restored to its previous value.
Flags: needinfo?(klibby)
(In reply to Gregory Szorc [:gps] (disappearing for a month after 2017-02-10) from comment #9) > So, the replication lag alert should fire less often. This means that the > alerting threshold can be restored to its previous value. Which can't happen until at least Monday, because the MOC is migrating to Nagios 4 and the configs are locked down. But I'll look at it next week.

We aren't seeing this any more.

Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(klibby)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.