Closed Bug 1377147 Opened 8 years ago Closed 7 years ago

update nagios warning/critical thresholds for check_pending_scriptworker_tasks

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sfraser, Assigned: arich)

Details

Could we adjust the thresholds for the nagios checks against the signing scriptworkers, so that check_pending_scriptworker_tasks checking warns at 50 and goes critical at 100? If it's possible to adjust it so that it only raises an alert if the condition has been met for 1 hour (or a certain number of checks) that would be great. Thank you!
Assignee: nobody → jlaz
Changed thresholds in git commit 9b6580f04723daee01394ed833e1bda667b12bb2: @@ -969,7 +969,7 @@ define service{ use generic-service service_description Pending Scriptworker Tasks - check_command check_pending_scriptworker_tasks!45!50 + check_command check_pending_scriptworker_tasks!50!100 contact_groups platformops notification_period 24x7 notification_options w,u,c,r,s I'll need a bit of time to look into the 'alert if condition lasts for X' situation.
Alerts only if they happen for more then X time can only be done in nagios if the logic is in the check or with hacks where you recheck at Y interval and Z number have to fail before it alerts so the alert comes in at Y*Z mins. This is part of the reason we're pushing prometheus out.
These alerts are noisy enough that they were downtimed (bug 1379653). Do we know when we can add the "recheck and alert after N failures"?
Blocks: 1379653
Flags: needinfo?(jlaz)
Should we move the check_pending_scriptworker_tasks check to modules/nagios/manifests/releng/services.pp ? That file defines normal_check_interval, retry_check_interval, max_check_attempts, notification_interval, etc.
This is pretty simple to do in nagios. commit 9721fd689d56b6a7e66e7fc4f7c6d58d0ed06d11: @@ -1516,16 +1516,19 @@ class nagios4::prod::releng::services { default => [ ] } }, "service_queue_age" => { service_description => "Pending Scriptworker Tasks", contact_groups => 'build', check_command => 'check_pending_scriptworker_tasks!50!100', + normal_check_interval => 10, + retry_check_interval => 5, + max_check_attempts => 12, hostgroups => $nagiosbot ? { 'nagios-releng' => [ 'signing-scriptworkers' ], default => [ ] } },
Assignee: jlaz → arich
No longer blocks: 1379653
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
(In reply to Amy Rich [:arr] [:arich] from comment #5) > This is pretty simple to do in nagios. commit > 9721fd689d56b6a7e66e7fc4f7c6d58d0ed06d11: 9721fd689d56b6a7e66e7fc4f7c6d58d0ed06d11 appears to be an ldap change. Did this change get committed?
Flags: needinfo?(jlaz) → needinfo?(arich)
Copied the wrong commit id: commit 06c2795babfaafcc1410c639f725b7f5662d2575
Flags: needinfo?(arich)
I modified this to be a cluster check in commit b6fa1d231d9f205a84746322d64157d5c696b87c The only thing I'm not certain about is the interaction between the cluster check notification and the max check attempts for the non-cluster check. aki: can you verify that this is working as expected the next time we drive the load up?
Status: RESOLVED → REOPENED
Component: MOC: Service Requests → RelOps
Flags: needinfo?(aki)
QA Contact: lypulong → arich
Resolution: FIXED → ---
I haven't seen any scriptworker queue alerts since yesterday, after this landed and I doubled the signing scriptworker pool from 4 to 8. I'm not sure if that's something wrong with the notifications, or if we're able to handle the current load without warnings.
It looks like we have cluster checks! It goes straight to critical, even if the load is normal, and we're not sure it waits the hour, but this is still an improvement. In the future we may want to move this check to live with the other taskcluster queue monitoring checks. Thanks Amy!
Flags: needinfo?(aki)
Status: REOPENED → RESOLVED
Closed: 8 years ago7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.