Closed
Bug 1379653
Opened 8 years ago
Closed 8 years ago
clean up scriptworker nagios alerts
Categories
(Release Engineering :: General, enhancement)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: catlee, Assigned: mozilla)
References
Details
Attachments
(6 files, 1 obsolete file)
3.93 KB,
patch
|
arich
:
review+
|
Details | Diff | Splinter Review |
60 bytes,
text/x-github-pull-request
|
rail
:
review+
|
Details | Review |
25.69 KB,
patch
|
arich
:
review+
|
Details | Diff | Splinter Review |
9.53 KB,
patch
|
sfraser
:
review+
|
Details | Diff | Splinter Review |
1.45 KB,
patch
|
mozilla
:
review+
|
Details | Diff | Splinter Review |
41.58 KB,
patch
|
arich
:
review+
|
Details | Diff | Splinter Review |
We've been getting more nagios alerts like this since working on the windows build migration:
Wed 18:58:11 UTC [7842] [] signing-linux-4.srv.releng.usw2.mozilla.com:Pending Scriptworker Tasks is CRITICAL: PENDING_TASKS CRITICAL - 483/100 pending tasks for scriptworker-prov-v1:signing-linux-v1 (http://m.mozilla.org/Pending+Scriptworker+Tasks)
We should figure out:
a) what appropriate alert levels really are here
b) how many workers we really need
Assignee | ||
Comment 1•8 years ago
|
||
This is largely bug 1377147 aiui; last I saw, we were clearing the queue out within an hour, so alerting after the hour would give more useful info.
Other things that can help:
- using the depsigning workerType for depsigning, though we'll need to ramp up that pool (currently 4)
- more signing scriptworkers. I think this isn't blocking atm for nightly and release; we should probably get more data and expand the pool if wanted/needed.
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → aki
Assignee | ||
Comment 2•8 years ago
|
||
Spun up signing-linux-{5..8}; let's monitor them.
Attachment #8885048 -
Flags: review?(arich)
Assignee | ||
Comment 3•8 years ago
|
||
Attachment #8885063 -
Flags: review?(rail)
Updated•8 years ago
|
Attachment #8885048 -
Flags: review?(arich) → review+
Updated•8 years ago
|
Attachment #8885063 -
Flags: review?(rail) → review+
Assignee | ||
Comment 4•8 years ago
|
||
Landed the nagios (1f3f7134113047b2fdbb354ede1eede38add621e) + cloud-tools patches.
Assignee | ||
Comment 5•8 years ago
|
||
Spinning up depsigning-worker{5..12}. I need to add monitoring to them.
Assignee | ||
Comment 6•8 years ago
|
||
We were timing out signing xul.dll. Catlee found it took ~1.5gb of memory to detect if the signature; the t2.micro instances only have 1gb ram. Switched all depsigning-workers and signing-linux-* workers to t2.medium; we appear to be good.
Catlee found a python module that detects signatures with significantly less ram:
13:56 <•catlee> yeah, I found out that pefile is creating millions of copies of data in memory
13:56 <•catlee> which is why it eats up 1.5G to validate a 60MB file
<snip>
14:42 <@catlee> aki, hwine: different python module can check for the signature in 0.02s and 9MB memory
14:42 <@catlee> let's use that one :)
14:44 <hwine> meh - won't change the lambda cost for me ;)
14:44 <hwine> but which one?
14:44 <@catlee> it's also a somewhat active project, although the pefile support is py2 only for now
14:44 <@catlee> construct
14:44 <@catlee> https://github.com/construct/construct/blob/stable25/construct/formats/executable/pe32.py
14:46 <aki> hm, maybe py3 signtool can shell out for py2?
14:46 <hwine> what about pe64?
14:47 <aki> or we can port it
14:47 ⇐ mixedpuppy quit (mixedpuppy@moz-c6ssrl.sub-70-199-154.myvzw.com) Client exited
14:47 <@catlee> yeah, I'm going to submit a PR to fix up the py3 support
14:47 <@catlee> pe64 is just an extension - it looks like this finds the signature on a 64-bit binary just fine
14:48 <@catlee> peplus
We can point py3 signtool at that when it supports py3.
Depsigning nagios may involve some it-puppet and releng-puppet refactoring.
Assignee | ||
Comment 7•8 years ago
|
||
Amy:
This patch adds depsigning scriptworker monitoring.
I want to add balrog, beetmover, and pushapk monitoring as well, but I need to refactor the releng-puppet files first.
I wasn't quite sure how to do the clustering, which I'm not sure is working atm (no alerts at all in the past 24hrs). The depsigning and signing queues are different queues, so I don't think they can be clustered together. We'll need to handle the balrog, beetmover, and pushapk queues as well. I wonder if these checks belong with the taskcluster queue monitoring that's happening for other queues instead? That's really all this is.
If we need to refactor the cluster stuff, I'm happy to wait to land to avoid bitrotting those patches.
Attachment #8885544 -
Flags: review?(arich)
Comment 8•8 years ago
|
||
Comment on attachment 8885544 [details] [diff] [review]
[it-puppet] depsigning-nagios.diff
Review of attachment 8885544 [details] [diff] [review]:
-----------------------------------------------------------------
Ah, I missed the servicegroup on my first go around, that's why the alerts are missing. I've modified the puppet code to generate the config now. If you pull, you'll see the differences. You'll need to make similar modifications to your code. Also, in the check, you want to set your contact_group to nobody so you don't get alerts for the individual hosts and the cluster check, too.
Attachment #8885544 -
Flags: review?(arich) → review-
Assignee | ||
Comment 9•8 years ago
|
||
Attachment #8885544 -
Attachment is obsolete: true
Attachment #8885803 -
Flags: review?(arich)
Comment 10•8 years ago
|
||
Comment on attachment 8885803 [details] [diff] [review]
[it-puppet] depsigning nagios 2
Review of attachment 8885803 [details] [diff] [review]:
-----------------------------------------------------------------
I think this looks good. At some point we should probably rename the cluster check and maybe service group to something shorter so that the display in the GUI and on IRC is more readable and clearer.
Attachment #8885803 -
Flags: review?(arich) → review+
Assignee | ||
Updated•8 years ago
|
Summary: Resolve signing-linux-v1 alerts → clean up scriptworker nagios alerts
Assignee | ||
Comment 11•8 years ago
|
||
Hey Simon,
The main intent here is to allow beetmover, balrog, and pushapk scriptworkers to also use these nagios checks.
- rename the checks to be generic scriptworker checks, rather than signing specific
- move the templates and config into scriptworker::nagios because it has access to scriptworker variables (@basedir works for templates; $username is passed in)
- replace the hardcoded /builds/scriptworker/ and cltsign with @basedir and $username
- remove the signing scriptworker pending tasks references, since I believe this is driven by it-puppet now.
I did have to change one nrpe::plugin reference to a standard File, since nrpe::plugin pulls the template from the nrpe templates.
How does this look? I tested against both beetmover scriptworker (got cltbld in the /etc/nagios/nrpe.d files) and signing scriptworker (got cltsign in the /etc/nagios/nrpe.d files)
Attachment #8885960 -
Flags: review?(sfraser)
Assignee | ||
Comment 12•8 years ago
|
||
The followup patch, once it-puppet switches over to the new check names.
Attachment #8885963 -
Flags: review?(sfraser)
Assignee | ||
Comment 13•8 years ago
|
||
Amy:
This is a large patch. I have it in 3 different commits in git atm; I can split up the review if you prefer.
This patch:
- renames check_signing_file_age* to check_scriptworker_file_age*, because it's not signing specific
- adds balrog and beetmover scriptworkers
- adds balrog, beetmover, and pushapk to various checks. depsigning-scriptworker is intentionally skipped in the gpg checks, because it doesn't have the gpg functionality enabled.
- renames the cluster checks
- sets the cluster checks at 50 warning/100 critical, except for pushapk, which is set to 1/2. We only expect to see a handful of pushapk tasks per day at most, so 50/100 would be too high.
Attachment #8886012 -
Flags: review?(arich)
Comment 14•8 years ago
|
||
Comment on attachment 8885960 [details] [diff] [review]
one-nagios.diff
Review of attachment 8885960 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good to me.
One thing I've noticed with the alkerts, though, is that it's being checked against all instances in a class, which means that I think the signing alerts are duplicates - signing-linux-* will all alert at the same time. The redundancy is useful, but the spamminess is not. I'm unsure if there's a fallback option nagios can use.
Attachment #8885960 -
Flags: review?(sfraser) → review+
Comment 15•8 years ago
|
||
Comment on attachment 8886012 [details] [diff] [review]
[it-puppet] balrog, beetmover, pushapk checks
Review of attachment 8886012 [details] [diff] [review]:
-----------------------------------------------------------------
Remove all of the files for the nagios module, that's not being used anymore. We've cut over to the nagios4 servers. Other than that, I think all these changes look good.
Attachment #8886012 -
Flags: review?(arich) → review+
Comment 16•8 years ago
|
||
We've been seeing lots of 'Scriptworker log age' alerts in #buildduty. From what I could see, there are 12 depsigning-worker instances and the thresholds are the following:
- 2700 seconds -> 45 minutes for WARNING
- 3600 seconds -> 60 minutes for CRITICAL
The alerts recover at one point, but in many cases the log file is not updated too often and thus we get lots of alerts.
Could we adjust these thresholds please?
Assignee | ||
Comment 17•8 years ago
|
||
https://hg.mozilla.org/build/puppet/rev/12864f34b9c25b6b8e047900f43d00e1ccf30bf5
bug 1379653 - allow for beetmover, balrog, pushapk nagios monitoring. r=sfraser
Assignee | ||
Comment 18•8 years ago
|
||
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #16)
> We've been seeing lots of 'Scriptworker log age' alerts in #buildduty. From
> what I could see, there are 12 depsigning-worker instances and the
> thresholds are the following:
> - 2700 seconds -> 45 minutes for WARNING
> - 3600 seconds -> 60 minutes for CRITICAL
>
> The alerts recover at one point, but in many cases the log file is not
> updated too often and thus we get lots of alerts.
> Could we adjust these thresholds please?
This is due to scriptworker 4.1.2 only logging when it claims a task. During idle times nagios alerts.
Scriptworker 4.1.3 (coming today) logs whenever it polls for a task, every n seconds. The alerts should go away unless the machine is stuck.
Sorry for the noise!
Assignee | ||
Comment 19•8 years ago
|
||
https://hg.mozilla.org/build/puppet/rev/6f6cacabb672c07ebe8e82b3c48b1cca3b160a88
bug 1379653 - bump to scriptworker 4.1.3. r=callek
Assignee | ||
Comment 20•8 years ago
|
||
Landed the it-puppet patch to enable balrog, beetmover, pushapk monitoring as well: b9aa1571b68b2ec4bc2fd6271723b25c65dc837a
I still need to:
- make sure that works ok
- fix anything that doesn't work ok
- update the mana links with info
- get r? and land https://bugzilla.mozilla.org/attachment.cgi?id=8885963&action=edit for cleanup
- resolve this bug!
Assignee | ||
Comment 21•8 years ago
|
||
Comment on attachment 8885963 [details] [diff] [review]
nagios-followup.diff
irc r+ from sfraser
Attachment #8885963 -
Flags: review?(sfraser) → review+
Assignee | ||
Comment 22•8 years ago
|
||
https://hg.mozilla.org/build/puppet/rev/a815d9b002cb3bbdf4da727c064ec594a9ebf2ff
bug 1379653 - scriptworker nagios cleanup. r=sfraser
Assignee | ||
Comment 23•8 years ago
|
||
We look good.
I have https://mana.mozilla.org/wiki/display/NAGIOS/Scriptworker+Log+Age and https://mana.mozilla.org/wiki/display/NAGIOS/Pending+Scriptworker+Tasks ; we can add more later.
Once I merge the above build-puppet patch, I'm going to resolve this bug.
Assignee | ||
Comment 24•8 years ago
|
||
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #14)
> One thing I've noticed with the alkerts, though, is that it's being checked
> against all instances in a class, which means that I think the signing
> alerts are duplicates - signing-linux-* will all alert at the same time. The
> redundancy is useful, but the spamminess is not. I'm unsure if there's a
> fallback option nagios can use.
Amy fixed this by clustering the queue checks.
The other checks are host-specific. Those were spammy because scriptworker 4.1.2 only logged when starting a new task, so during idle times the log would go stale. Scriptworker 4.1.3 spams the log with a claimWork attempt every ~30s, so if we hit an alert about the worker.log going stale, it's probably hung. I believe I've fixed the spamminess problem.
Assignee | ||
Comment 25•8 years ago
|
||
Merged. Resolving.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•