1377147 - update nagios warning/critical thresholds for check_pending_scriptworker_tasks

Reporter

Description

•

8 years ago

Could we adjust the thresholds for the nagios checks against the signing scriptworkers, so that check_pending_scriptworker_tasks checking warns at 50 and goes critical at 100? If it's possible to adjust it so that it only raises an alert if the condition has been met for 1 hour (or a certain number of checks) that would be great. Thank you!

Justin Lazaro [:jlaz] (use needinfo)

Updated

•

8 years ago

Assignee: nobody → jlaz

Justin Lazaro [:jlaz] (use needinfo)

Comment 1

•

8 years ago

Changed thresholds in git commit 9b6580f04723daee01394ed833e1bda667b12bb2: @@ -969,7 +969,7 @@ define service{ use generic-service service_description Pending Scriptworker Tasks - check_command check_pending_scriptworker_tasks!45!50 + check_command check_pending_scriptworker_tasks!50!100 contact_groups platformops notification_period 24x7 notification_options w,u,c,r,s I'll need a bit of time to look into the 'alert if condition lasts for X' situation.

Peter Radcliffe [:pir]

Comment 2

•

8 years ago

Alerts only if they happen for more then X time can only be done in nagios if the logic is in the check or with hacks where you recheck at Y interval and Z number have to fail before it alerts so the alert comes in at Y*Z mins. This is part of the reason we're pushing prometheus out.

Aki Sasaki (not active)

Comment 3

•

8 years ago

These alerts are noisy enough that they were downtimed (bug 1379653). Do we know when we can add the "recheck and alert after N failures"?

Blocks: 1379653

Flags: needinfo?(jlaz)

Aki Sasaki (not active)

Comment 4

•

8 years ago

Should we move the check_pending_scriptworker_tasks check to modules/nagios/manifests/releng/services.pp ? That file defines normal_check_interval, retry_check_interval, max_check_attempts, notification_interval, etc.

Amy Rich [:arr] [:arich]

Assignee

Comment 5

•

8 years ago

This is pretty simple to do in nagios. commit 9721fd689d56b6a7e66e7fc4f7c6d58d0ed06d11: @@ -1516,16 +1516,19 @@ class nagios4::prod::releng::services { default => [ ] } }, "service_queue_age" => { service_description => "Pending Scriptworker Tasks", contact_groups => 'build', check_command => 'check_pending_scriptworker_tasks!50!100', + normal_check_interval => 10, + retry_check_interval => 5, + max_check_attempts => 12, hostgroups => $nagiosbot ? { 'nagios-releng' => [ 'signing-scriptworkers' ], default => [ ] } },

Assignee: jlaz → arich

No longer blocks: 1379653

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Aki Sasaki (not active)

Comment 6

•

8 years ago

(In reply to Amy Rich [:arr] [:arich] from comment #5) > This is pretty simple to do in nagios. commit > 9721fd689d56b6a7e66e7fc4f7c6d58d0ed06d11: 9721fd689d56b6a7e66e7fc4f7c6d58d0ed06d11 appears to be an ldap change. Did this change get committed?

Flags: needinfo?(jlaz) → needinfo?(arich)

Amy Rich [:arr] [:arich]

Assignee

Comment 7

•

8 years ago

Copied the wrong commit id: commit 06c2795babfaafcc1410c639f725b7f5662d2575

Flags: needinfo?(arich)

Amy Rich [:arr] [:arich]

Assignee

Comment 8

•

8 years ago

I modified this to be a cluster check in commit b6fa1d231d9f205a84746322d64157d5c696b87c The only thing I'm not certain about is the interaction between the cluster check notification and the max check attempts for the non-cluster check. aki: can you verify that this is working as expected the next time we drive the load up?

Status: RESOLVED → REOPENED

Component: MOC: Service Requests → RelOps

Flags: needinfo?(aki)

QA Contact: lypulong → arich

Resolution: FIXED → ---

Aki Sasaki (not active)

Comment 9

•

8 years ago

I haven't seen any scriptworker queue alerts since yesterday, after this landed and I doubled the signing scriptworker pool from 4 to 8. I'm not sure if that's something wrong with the notifications, or if we're able to handle the current load without warnings.

Aki Sasaki (not active)

Comment 10

•

8 years ago

It looks like we have cluster checks! It goes straight to critical, even if the load is normal, and we're not sure it waits the hour, but this is still an improvement. In the future we may want to move this check to live with the other taskcluster queue monitoring checks. Thanks Amy!

Flags: needinfo?(aki)

Amy Rich [:arr] [:arich]

Assignee

Updated

•

7 years ago

Status: REOPENED → RESOLVED

Closed: 8 years ago → 7 years ago

Resolution: --- → FIXED

Bugzilla

update nagios warning/critical thresholds for check_pending_scriptworker_tasks

Categories

(Infrastructure & Operations :: RelOps: General, task)

Tracking

(Not tracked)

People

(Reporter: sfraser, Assigned: arich)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated