Closed Bug 1379653 Opened 8 years ago Closed 8 years ago

clean up scriptworker nagios alerts

Categories

(Release Engineering :: General, enhancement)

Product:

Component:

Type:

enhancement

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: catlee, Assigned: mozilla)

References

Details

Attachments

(6 files, 1 obsolete file)

[it-puppet] signing-linux-{5..8} 8 years ago Aki Sasaki (not active) 3.93 KB, patch	arich : review+	Details \| Diff \| Splinter Review
automate depsigning worker creation in cloud tools 8 years ago Aki Sasaki (not active) 60 bytes, text/x-github-pull-request	rail : review+	Details \| Review
[it-puppet] depsigning-nagios.diff 8 years ago Aki Sasaki (not active) 23.57 KB, patch	arich : review-	Details \| Diff \| Splinter Review
[it-puppet] depsigning nagios 2 8 years ago Aki Sasaki (not active) 25.69 KB, patch	arich : review+	Details \| Diff \| Splinter Review
one-nagios.diff 8 years ago Aki Sasaki (not active) 9.53 KB, patch	sfraser : review+	Details \| Diff \| Splinter Review
nagios-followup.diff 8 years ago Aki Sasaki (not active) 1.45 KB, patch	mozilla : review+	Details \| Diff \| Splinter Review
[it-puppet] balrog, beetmover, pushapk checks 8 years ago Aki Sasaki (not active) 41.58 KB, patch	arich : review+	Details \| Diff \| Splinter Review

Chris AtLee [:catlee]

Reporter

Description

•

8 years ago

We've been getting more nagios alerts like this since working on the windows build migration: Wed 18:58:11 UTC [7842] [] signing-linux-4.srv.releng.usw2.mozilla.com:Pending Scriptworker Tasks is CRITICAL: PENDING_TASKS CRITICAL - 483/100 pending tasks for scriptworker-prov-v1:signing-linux-v1 (http://m.mozilla.org/Pending+Scriptworker+Tasks) We should figure out: a) what appropriate alert levels really are here b) how many workers we really need

Aki Sasaki (not active)

Assignee

Comment 1

•

8 years ago

This is largely bug 1377147 aiui; last I saw, we were clearing the queue out within an hour, so alerting after the hour would give more useful info. Other things that can help: - using the depsigning workerType for depsigning, though we'll need to ramp up that pool (currently 4) - more signing scriptworkers. I think this isn't blocking atm for nightly and release; we should probably get more data and expand the pool if wanted/needed.

Aki Sasaki (not active)

Assignee

Updated

•

8 years ago

Depends on: 1377147

Amy Rich [:arr] [:arich]

Updated

•

8 years ago

No longer depends on: 1377147

Aki Sasaki (not active)

Assignee

Updated

•

8 years ago

Assignee: nobody → aki

Aki Sasaki (not active)

Assignee

Comment 2

•

8 years ago

Attached patch [it-puppet] signing-linux-{5..8} — Details — Splinter Review

Spun up signing-linux-{5..8}; let's monitor them.

Attachment #8885048 - Flags: review?(arich)

Aki Sasaki (not active)

Assignee

Comment 3

•

8 years ago

Attached file automate depsigning worker creation in cloud tools — Details

Attachment #8885063 - Flags: review?(rail)

Amy Rich [:arr] [:arich]

Updated

•

8 years ago

Attachment #8885048 - Flags: review?(arich) → review+

Rail Aliiev [:rail]

Updated

•

8 years ago

Attachment #8885063 - Flags: review?(rail) → review+

Aki Sasaki (not active)

Assignee

Comment 4

•

8 years ago

Landed the nagios (1f3f7134113047b2fdbb354ede1eede38add621e) + cloud-tools patches.

Aki Sasaki (not active)

Assignee

Comment 5

•

8 years ago

Spinning up depsigning-worker{5..12}. I need to add monitoring to them.

Aki Sasaki (not active)

Assignee

Comment 6

•

8 years ago

We were timing out signing xul.dll. Catlee found it took ~1.5gb of memory to detect if the signature; the t2.micro instances only have 1gb ram. Switched all depsigning-workers and signing-linux-* workers to t2.medium; we appear to be good. Catlee found a python module that detects signatures with significantly less ram: 13:56 <•catlee> yeah, I found out that pefile is creating millions of copies of data in memory 13:56 <•catlee> which is why it eats up 1.5G to validate a 60MB file <snip> 14:42 <@catlee> aki, hwine: different python module can check for the signature in 0.02s and 9MB memory 14:42 <@catlee> let's use that one :) 14:44 <hwine> meh - won't change the lambda cost for me ;) 14:44 <hwine> but which one? 14:44 <@catlee> it's also a somewhat active project, although the pefile support is py2 only for now 14:44 <@catlee> construct 14:44 <@catlee> https://github.com/construct/construct/blob/stable25/construct/formats/executable/pe32.py 14:46 <aki> hm, maybe py3 signtool can shell out for py2? 14:46 <hwine> what about pe64? 14:47 <aki> or we can port it 14:47 ⇐ mixedpuppy quit (mixedpuppy@moz-c6ssrl.sub-70-199-154.myvzw.com) Client exited 14:47 <@catlee> yeah, I'm going to submit a PR to fix up the py3 support 14:47 <@catlee> pe64 is just an extension - it looks like this finds the signature on a 64-bit binary just fine 14:48 <@catlee> peplus We can point py3 signtool at that when it supports py3. Depsigning nagios may involve some it-puppet and releng-puppet refactoring.

Aki Sasaki (not active)

Assignee

Comment 7

•

8 years ago

Attached patch [it-puppet] depsigning-nagios.diff (obsolete) — Details — Splinter Review

Amy: This patch adds depsigning scriptworker monitoring. I want to add balrog, beetmover, and pushapk monitoring as well, but I need to refactor the releng-puppet files first. I wasn't quite sure how to do the clustering, which I'm not sure is working atm (no alerts at all in the past 24hrs). The depsigning and signing queues are different queues, so I don't think they can be clustered together. We'll need to handle the balrog, beetmover, and pushapk queues as well. I wonder if these checks belong with the taskcluster queue monitoring that's happening for other queues instead? That's really all this is. If we need to refactor the cluster stuff, I'm happy to wait to land to avoid bitrotting those patches.

Attachment #8885544 - Flags: review?(arich)

Amy Rich [:arr] [:arich]

Comment 8

•

8 years ago

Comment on attachment 8885544 [details] [diff] [review] [it-puppet] depsigning-nagios.diff Review of attachment 8885544 [details] [diff] [review]: ----------------------------------------------------------------- Ah, I missed the servicegroup on my first go around, that's why the alerts are missing. I've modified the puppet code to generate the config now. If you pull, you'll see the differences. You'll need to make similar modifications to your code. Also, in the check, you want to set your contact_group to nobody so you don't get alerts for the individual hosts and the cluster check, too.

Attachment #8885544 - Flags: review?(arich) → review-

Aki Sasaki (not active)

Assignee

Comment 9

•

8 years ago

Attached patch [it-puppet] depsigning nagios 2 — Details — Splinter Review

Attachment #8885544 - Attachment is obsolete: true

Attachment #8885803 - Flags: review?(arich)

Amy Rich [:arr] [:arich]

Comment 10

•

8 years ago

Comment on attachment 8885803 [details] [diff] [review] [it-puppet] depsigning nagios 2 Review of attachment 8885803 [details] [diff] [review]: ----------------------------------------------------------------- I think this looks good. At some point we should probably rename the cluster check and maybe service group to something shorter so that the display in the GUI and on IRC is more readable and clearer.

Attachment #8885803 - Flags: review?(arich) → review+

Aki Sasaki (not active)

Assignee

Updated

•

8 years ago

Summary: Resolve signing-linux-v1 alerts → clean up scriptworker nagios alerts

Aki Sasaki (not active)

Assignee

Comment 11

•

8 years ago

Attached patch one-nagios.diff — Details — Splinter Review

Hey Simon, The main intent here is to allow beetmover, balrog, and pushapk scriptworkers to also use these nagios checks. - rename the checks to be generic scriptworker checks, rather than signing specific - move the templates and config into scriptworker::nagios because it has access to scriptworker variables (@basedir works for templates; $username is passed in) - replace the hardcoded /builds/scriptworker/ and cltsign with @basedir and $username - remove the signing scriptworker pending tasks references, since I believe this is driven by it-puppet now. I did have to change one nrpe::plugin reference to a standard File, since nrpe::plugin pulls the template from the nrpe templates. How does this look? I tested against both beetmover scriptworker (got cltbld in the /etc/nagios/nrpe.d files) and signing scriptworker (got cltsign in the /etc/nagios/nrpe.d files)

Attachment #8885960 - Flags: review?(sfraser)

Aki Sasaki (not active)

Assignee

Comment 12

•

8 years ago

Attached patch nagios-followup.diff — Details — Splinter Review

The followup patch, once it-puppet switches over to the new check names.

Attachment #8885963 - Flags: review?(sfraser)

Aki Sasaki (not active)

Assignee

Comment 13

•

8 years ago

Attached patch [it-puppet] balrog, beetmover, pushapk checks — Details — Splinter Review

Amy: This is a large patch. I have it in 3 different commits in git atm; I can split up the review if you prefer. This patch: - renames check_signing_file_age* to check_scriptworker_file_age*, because it's not signing specific - adds balrog and beetmover scriptworkers - adds balrog, beetmover, and pushapk to various checks. depsigning-scriptworker is intentionally skipped in the gpg checks, because it doesn't have the gpg functionality enabled. - renames the cluster checks - sets the cluster checks at 50 warning/100 critical, except for pushapk, which is set to 1/2. We only expect to see a handful of pushapk tasks per day at most, so 50/100 would be too high.

Attachment #8886012 - Flags: review?(arich)

Simon Fraser [:sfraser] ⌚️GMT

Comment 14

•

8 years ago

Comment on attachment 8885960 [details] [diff] [review] one-nagios.diff Review of attachment 8885960 [details] [diff] [review]: ----------------------------------------------------------------- Looks good to me. One thing I've noticed with the alkerts, though, is that it's being checked against all instances in a class, which means that I think the signing alerts are duplicates - signing-linux-* will all alert at the same time. The redundancy is useful, but the spamminess is not. I'm unsure if there's a fallback option nagios can use.

Attachment #8885960 - Flags: review?(sfraser) → review+

Amy Rich [:arr] [:arich]

Comment 15

•

8 years ago

Comment on attachment 8886012 [details] [diff] [review] [it-puppet] balrog, beetmover, pushapk checks Review of attachment 8886012 [details] [diff] [review]: ----------------------------------------------------------------- Remove all of the files for the nagios module, that's not being used anymore. We've cut over to the nagios4 servers. Other than that, I think all these changes look good.

Attachment #8886012 - Flags: review?(arich) → review+

Alin Selagea [:aselagea]

Comment 16

•

8 years ago

We've been seeing lots of 'Scriptworker log age' alerts in #buildduty. From what I could see, there are 12 depsigning-worker instances and the thresholds are the following: - 2700 seconds -> 45 minutes for WARNING - 3600 seconds -> 60 minutes for CRITICAL The alerts recover at one point, but in many cases the log file is not updated too often and thus we get lots of alerts. Could we adjust these thresholds please?

Aki Sasaki (not active)

Assignee

Comment 17

•

8 years ago

https://hg.mozilla.org/build/puppet/rev/12864f34b9c25b6b8e047900f43d00e1ccf30bf5 bug 1379653 - allow for beetmover, balrog, pushapk nagios monitoring. r=sfraser

Aki Sasaki (not active)

Assignee

Comment 18

•

8 years ago

(In reply to Alin Selagea [:aselagea][:buildduty] from comment #16) > We've been seeing lots of 'Scriptworker log age' alerts in #buildduty. From > what I could see, there are 12 depsigning-worker instances and the > thresholds are the following: > - 2700 seconds -> 45 minutes for WARNING > - 3600 seconds -> 60 minutes for CRITICAL > > The alerts recover at one point, but in many cases the log file is not > updated too often and thus we get lots of alerts. > Could we adjust these thresholds please? This is due to scriptworker 4.1.2 only logging when it claims a task. During idle times nagios alerts. Scriptworker 4.1.3 (coming today) logs whenever it polls for a task, every n seconds. The alerts should go away unless the machine is stuck. Sorry for the noise!

Aki Sasaki (not active)

Assignee

Comment 19

•

8 years ago

https://hg.mozilla.org/build/puppet/rev/6f6cacabb672c07ebe8e82b3c48b1cca3b160a88 bug 1379653 - bump to scriptworker 4.1.3. r=callek

Aki Sasaki (not active)

Assignee

Comment 20

•

8 years ago

Landed the it-puppet patch to enable balrog, beetmover, pushapk monitoring as well: b9aa1571b68b2ec4bc2fd6271723b25c65dc837a I still need to: - make sure that works ok - fix anything that doesn't work ok - update the mana links with info - get r? and land https://bugzilla.mozilla.org/attachment.cgi?id=8885963&action=edit for cleanup - resolve this bug!

Aki Sasaki (not active)

Assignee

Comment 21

•

8 years ago

Comment on attachment 8885963 [details] [diff] [review] nagios-followup.diff irc r+ from sfraser

Attachment #8885963 - Flags: review?(sfraser) → review+

Aki Sasaki (not active)

Assignee

Comment 22

•

8 years ago

https://hg.mozilla.org/build/puppet/rev/a815d9b002cb3bbdf4da727c064ec594a9ebf2ff bug 1379653 - scriptworker nagios cleanup. r=sfraser

Aki Sasaki (not active)

Assignee

Comment 23

•

8 years ago

We look good. I have https://mana.mozilla.org/wiki/display/NAGIOS/Scriptworker+Log+Age and https://mana.mozilla.org/wiki/display/NAGIOS/Pending+Scriptworker+Tasks ; we can add more later. Once I merge the above build-puppet patch, I'm going to resolve this bug.

Aki Sasaki (not active)

Assignee

Comment 24

•

8 years ago

(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #14) > One thing I've noticed with the alkerts, though, is that it's being checked > against all instances in a class, which means that I think the signing > alerts are duplicates - signing-linux-* will all alert at the same time. The > redundancy is useful, but the spamminess is not. I'm unsure if there's a > fallback option nagios can use. Amy fixed this by clustering the queue checks. The other checks are host-specific. Those were spammy because scriptworker 4.1.2 only logged when starting a new task, so during idle times the log would go stale. Scriptworker 4.1.3 spams the log with a claimWork attempt every ~30s, so if we hit an alert about the worker.log going stale, it's probably hung. I believe I've fixed the spamminess problem.

Aki Sasaki (not active)

Assignee

Comment 25

•

8 years ago

Merged. Resolving.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Justin Wood (:Callek)

Updated

•

7 years ago

Blocks: 1387191

You need to log in before you can comment on or make changes to this bug.