Closed Bug 1379653 Opened 8 years ago Closed 8 years ago

clean up scriptworker nagios alerts

Categories

(Release Engineering :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: mozilla)

References

Details

Attachments

(6 files, 1 obsolete file)

We've been getting more nagios alerts like this since working on the windows build migration: Wed 18:58:11 UTC [7842] [] signing-linux-4.srv.releng.usw2.mozilla.com:Pending Scriptworker Tasks is CRITICAL: PENDING_TASKS CRITICAL - 483/100 pending tasks for scriptworker-prov-v1:signing-linux-v1 (http://m.mozilla.org/Pending+Scriptworker+Tasks) We should figure out: a) what appropriate alert levels really are here b) how many workers we really need
This is largely bug 1377147 aiui; last I saw, we were clearing the queue out within an hour, so alerting after the hour would give more useful info. Other things that can help: - using the depsigning workerType for depsigning, though we'll need to ramp up that pool (currently 4) - more signing scriptworkers. I think this isn't blocking atm for nightly and release; we should probably get more data and expand the pool if wanted/needed.
Depends on: 1377147
No longer depends on: 1377147
Assignee: nobody → aki
Spun up signing-linux-{5..8}; let's monitor them.
Attachment #8885048 - Flags: review?(arich)
Attachment #8885048 - Flags: review?(arich) → review+
Attachment #8885063 - Flags: review?(rail) → review+
Landed the nagios (1f3f7134113047b2fdbb354ede1eede38add621e) + cloud-tools patches.
Spinning up depsigning-worker{5..12}. I need to add monitoring to them.
We were timing out signing xul.dll. Catlee found it took ~1.5gb of memory to detect if the signature; the t2.micro instances only have 1gb ram. Switched all depsigning-workers and signing-linux-* workers to t2.medium; we appear to be good. Catlee found a python module that detects signatures with significantly less ram: 13:56 <•catlee> yeah, I found out that pefile is creating millions of copies of data in memory 13:56 <•catlee> which is why it eats up 1.5G to validate a 60MB file <snip> 14:42 <@catlee> aki, hwine: different python module can check for the signature in 0.02s and 9MB memory 14:42 <@catlee> let's use that one :) 14:44 <hwine> meh - won't change the lambda cost for me ;) 14:44 <hwine> but which one? 14:44 <@catlee> it's also a somewhat active project, although the pefile support is py2 only for now 14:44 <@catlee> construct 14:44 <@catlee> https://github.com/construct/construct/blob/stable25/construct/formats/executable/pe32.py 14:46 <aki> hm, maybe py3 signtool can shell out for py2? 14:46 <hwine> what about pe64? 14:47 <aki> or we can port it 14:47 ⇐ mixedpuppy quit (mixedpuppy@moz-c6ssrl.sub-70-199-154.myvzw.com) Client exited 14:47 <@catlee> yeah, I'm going to submit a PR to fix up the py3 support 14:47 <@catlee> pe64 is just an extension - it looks like this finds the signature on a 64-bit binary just fine 14:48 <@catlee> peplus We can point py3 signtool at that when it supports py3. Depsigning nagios may involve some it-puppet and releng-puppet refactoring.
Amy: This patch adds depsigning scriptworker monitoring. I want to add balrog, beetmover, and pushapk monitoring as well, but I need to refactor the releng-puppet files first. I wasn't quite sure how to do the clustering, which I'm not sure is working atm (no alerts at all in the past 24hrs). The depsigning and signing queues are different queues, so I don't think they can be clustered together. We'll need to handle the balrog, beetmover, and pushapk queues as well. I wonder if these checks belong with the taskcluster queue monitoring that's happening for other queues instead? That's really all this is. If we need to refactor the cluster stuff, I'm happy to wait to land to avoid bitrotting those patches.
Attachment #8885544 - Flags: review?(arich)
Comment on attachment 8885544 [details] [diff] [review] [it-puppet] depsigning-nagios.diff Review of attachment 8885544 [details] [diff] [review]: ----------------------------------------------------------------- Ah, I missed the servicegroup on my first go around, that's why the alerts are missing. I've modified the puppet code to generate the config now. If you pull, you'll see the differences. You'll need to make similar modifications to your code. Also, in the check, you want to set your contact_group to nobody so you don't get alerts for the individual hosts and the cluster check, too.
Attachment #8885544 - Flags: review?(arich) → review-
Attachment #8885544 - Attachment is obsolete: true
Attachment #8885803 - Flags: review?(arich)
Comment on attachment 8885803 [details] [diff] [review] [it-puppet] depsigning nagios 2 Review of attachment 8885803 [details] [diff] [review]: ----------------------------------------------------------------- I think this looks good. At some point we should probably rename the cluster check and maybe service group to something shorter so that the display in the GUI and on IRC is more readable and clearer.
Attachment #8885803 - Flags: review?(arich) → review+
Summary: Resolve signing-linux-v1 alerts → clean up scriptworker nagios alerts
Attached patch one-nagios.diffSplinter Review
Hey Simon, The main intent here is to allow beetmover, balrog, and pushapk scriptworkers to also use these nagios checks. - rename the checks to be generic scriptworker checks, rather than signing specific - move the templates and config into scriptworker::nagios because it has access to scriptworker variables (@basedir works for templates; $username is passed in) - replace the hardcoded /builds/scriptworker/ and cltsign with @basedir and $username - remove the signing scriptworker pending tasks references, since I believe this is driven by it-puppet now. I did have to change one nrpe::plugin reference to a standard File, since nrpe::plugin pulls the template from the nrpe templates. How does this look? I tested against both beetmover scriptworker (got cltbld in the /etc/nagios/nrpe.d files) and signing scriptworker (got cltsign in the /etc/nagios/nrpe.d files)
Attachment #8885960 - Flags: review?(sfraser)
The followup patch, once it-puppet switches over to the new check names.
Attachment #8885963 - Flags: review?(sfraser)
Amy: This is a large patch. I have it in 3 different commits in git atm; I can split up the review if you prefer. This patch: - renames check_signing_file_age* to check_scriptworker_file_age*, because it's not signing specific - adds balrog and beetmover scriptworkers - adds balrog, beetmover, and pushapk to various checks. depsigning-scriptworker is intentionally skipped in the gpg checks, because it doesn't have the gpg functionality enabled. - renames the cluster checks - sets the cluster checks at 50 warning/100 critical, except for pushapk, which is set to 1/2. We only expect to see a handful of pushapk tasks per day at most, so 50/100 would be too high.
Attachment #8886012 - Flags: review?(arich)
Comment on attachment 8885960 [details] [diff] [review] one-nagios.diff Review of attachment 8885960 [details] [diff] [review]: ----------------------------------------------------------------- Looks good to me. One thing I've noticed with the alkerts, though, is that it's being checked against all instances in a class, which means that I think the signing alerts are duplicates - signing-linux-* will all alert at the same time. The redundancy is useful, but the spamminess is not. I'm unsure if there's a fallback option nagios can use.
Attachment #8885960 - Flags: review?(sfraser) → review+
Comment on attachment 8886012 [details] [diff] [review] [it-puppet] balrog, beetmover, pushapk checks Review of attachment 8886012 [details] [diff] [review]: ----------------------------------------------------------------- Remove all of the files for the nagios module, that's not being used anymore. We've cut over to the nagios4 servers. Other than that, I think all these changes look good.
Attachment #8886012 - Flags: review?(arich) → review+
We've been seeing lots of 'Scriptworker log age' alerts in #buildduty. From what I could see, there are 12 depsigning-worker instances and the thresholds are the following: - 2700 seconds -> 45 minutes for WARNING - 3600 seconds -> 60 minutes for CRITICAL The alerts recover at one point, but in many cases the log file is not updated too often and thus we get lots of alerts. Could we adjust these thresholds please?
https://hg.mozilla.org/build/puppet/rev/12864f34b9c25b6b8e047900f43d00e1ccf30bf5 bug 1379653 - allow for beetmover, balrog, pushapk nagios monitoring. r=sfraser
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #16) > We've been seeing lots of 'Scriptworker log age' alerts in #buildduty. From > what I could see, there are 12 depsigning-worker instances and the > thresholds are the following: > - 2700 seconds -> 45 minutes for WARNING > - 3600 seconds -> 60 minutes for CRITICAL > > The alerts recover at one point, but in many cases the log file is not > updated too often and thus we get lots of alerts. > Could we adjust these thresholds please? This is due to scriptworker 4.1.2 only logging when it claims a task. During idle times nagios alerts. Scriptworker 4.1.3 (coming today) logs whenever it polls for a task, every n seconds. The alerts should go away unless the machine is stuck. Sorry for the noise!
Landed the it-puppet patch to enable balrog, beetmover, pushapk monitoring as well: b9aa1571b68b2ec4bc2fd6271723b25c65dc837a I still need to: - make sure that works ok - fix anything that doesn't work ok - update the mana links with info - get r? and land https://bugzilla.mozilla.org/attachment.cgi?id=8885963&action=edit for cleanup - resolve this bug!
Comment on attachment 8885963 [details] [diff] [review] nagios-followup.diff irc r+ from sfraser
Attachment #8885963 - Flags: review?(sfraser) → review+
We look good. I have https://mana.mozilla.org/wiki/display/NAGIOS/Scriptworker+Log+Age and https://mana.mozilla.org/wiki/display/NAGIOS/Pending+Scriptworker+Tasks ; we can add more later. Once I merge the above build-puppet patch, I'm going to resolve this bug.
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #14) > One thing I've noticed with the alkerts, though, is that it's being checked > against all instances in a class, which means that I think the signing > alerts are duplicates - signing-linux-* will all alert at the same time. The > redundancy is useful, but the spamminess is not. I'm unsure if there's a > fallback option nagios can use. Amy fixed this by clustering the queue checks. The other checks are host-specific. Those were spammy because scriptworker 4.1.2 only logged when starting a new task, so during idle times the log would go stale. Scriptworker 4.1.3 spams the log with a claimWork attempt every ~30s, so if we hit an alert about the worker.log going stale, it's probably hung. I believe I've fixed the spamminess problem.
Merged. Resolving.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Blocks: 1387191
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: