Closed
Bug 1377147
Opened 8 years ago
Closed 7 years ago
update nagios warning/critical thresholds for check_pending_scriptworker_tasks
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: sfraser, Assigned: arich)
Details
Could we adjust the thresholds for the nagios checks against the signing scriptworkers, so that check_pending_scriptworker_tasks checking warns at 50 and goes critical at 100? If it's possible to adjust it so that it only raises an alert if the condition has been met for 1 hour (or a certain number of checks) that would be great.
Thank you!
Updated•8 years ago
|
Assignee: nobody → jlaz
Comment 1•8 years ago
|
||
Changed thresholds in git commit 9b6580f04723daee01394ed833e1bda667b12bb2:
@@ -969,7 +969,7 @@
define service{
use generic-service
service_description Pending Scriptworker Tasks
- check_command check_pending_scriptworker_tasks!45!50
+ check_command check_pending_scriptworker_tasks!50!100
contact_groups platformops
notification_period 24x7
notification_options w,u,c,r,s
I'll need a bit of time to look into the 'alert if condition lasts for X' situation.
Comment 2•8 years ago
|
||
Alerts only if they happen for more then X time can only be done in nagios if the logic is in the check or with hacks where you recheck at Y interval and Z number have to fail before it alerts so the alert comes in at Y*Z mins.
This is part of the reason we're pushing prometheus out.
Comment 3•8 years ago
|
||
These alerts are noisy enough that they were downtimed (bug 1379653). Do we know when we can add the "recheck and alert after N failures"?
Blocks: 1379653
Flags: needinfo?(jlaz)
Comment 4•8 years ago
|
||
Should we move the check_pending_scriptworker_tasks check to modules/nagios/manifests/releng/services.pp ? That file defines normal_check_interval, retry_check_interval, max_check_attempts, notification_interval, etc.
Assignee | ||
Comment 5•8 years ago
|
||
This is pretty simple to do in nagios. commit 9721fd689d56b6a7e66e7fc4f7c6d58d0ed06d11:
@@ -1516,16 +1516,19 @@ class nagios4::prod::releng::services {
default => [
]
}
},
"service_queue_age" => {
service_description => "Pending Scriptworker Tasks",
contact_groups => 'build',
check_command => 'check_pending_scriptworker_tasks!50!100',
+ normal_check_interval => 10,
+ retry_check_interval => 5,
+ max_check_attempts => 12,
hostgroups => $nagiosbot ? {
'nagios-releng' => [
'signing-scriptworkers'
],
default => [
]
}
},
Assignee: jlaz → arich
No longer blocks: 1379653
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Comment 6•8 years ago
|
||
(In reply to Amy Rich [:arr] [:arich] from comment #5)
> This is pretty simple to do in nagios. commit
> 9721fd689d56b6a7e66e7fc4f7c6d58d0ed06d11:
9721fd689d56b6a7e66e7fc4f7c6d58d0ed06d11 appears to be an ldap change. Did this change get committed?
Flags: needinfo?(jlaz) → needinfo?(arich)
Assignee | ||
Comment 7•8 years ago
|
||
Copied the wrong commit id: commit 06c2795babfaafcc1410c639f725b7f5662d2575
Flags: needinfo?(arich)
Assignee | ||
Comment 8•8 years ago
|
||
I modified this to be a cluster check in commit b6fa1d231d9f205a84746322d64157d5c696b87c
The only thing I'm not certain about is the interaction between the cluster check notification and the max check attempts for the non-cluster check. aki: can you verify that this is working as expected the next time we drive the load up?
Status: RESOLVED → REOPENED
Component: MOC: Service Requests → RelOps
Flags: needinfo?(aki)
QA Contact: lypulong → arich
Resolution: FIXED → ---
Comment 9•8 years ago
|
||
I haven't seen any scriptworker queue alerts since yesterday, after this landed and I doubled the signing scriptworker pool from 4 to 8. I'm not sure if that's something wrong with the notifications, or if we're able to handle the current load without warnings.
Comment 10•8 years ago
|
||
It looks like we have cluster checks!
It goes straight to critical, even if the load is normal, and we're not sure it waits the hour, but this is still an improvement.
In the future we may want to move this check to live with the other taskcluster queue monitoring checks.
Thanks Amy!
Flags: needinfo?(aki)
Assignee | ||
Updated•7 years ago
|
Status: REOPENED → RESOLVED
Closed: 8 years ago → 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•