Closed
Bug 985767
Opened 11 years ago
Closed 9 years ago
Nagios alerting about load on buildbot-master66
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: coop)
References
Details
Attachments
(1 file)
3.66 KB,
patch
|
arich
:
review+
|
Details | Diff | Splinter Review |
eg:
Wed 17:02:46 PDT [4500] buildbot-master66.srv.releng.usw2.mozilla.com:load is WARNING: WARNING - load average: 20.25, 16.29, 11.45
It's coming from the two b2g bumper scripts doing a lot of 'git ls-remote' in parallel, so is noisy but harmless. Two scripts because one was added for b2g28-1.3t in bug 956631.
Options
* relax the load check on bm66, either the load value or the time it's in that state
* less parallelism
* <your cool idea here>
Reporter | ||
Comment 1•11 years ago
|
||
Correction - there's four b2g (master, 1.4, 1.3, 1.3t), as well as gaia, now.
Comment 2•11 years ago
|
||
Hi ashish,
Can we please bump the time and load value for this check on this host?
What is the current value?
Let me know what is the procedure for these type of requests; if you prefer to file a bug or poke someone else.
Thanks!
<nagios-releng> Fri 09:17:44 PDT [4134] buildbot-master66.srv.releng.usw2.mozilla.com:load is CRITICAL: CRITICAL - load average: 27.99, 20.75, 15.87 (http://m.mozilla.org/load)
* pmoore is now known as pmoore|dinner
<nagios-releng> Fri 09:22:43 PDT [4135] buildbot-master66.srv.releng.usw2.mozilla.com:load is WARNING: WARNING - load average: 17.43, 17.83, 15.54 (http://m.mozilla.org/load)
Flags: needinfo?(ashish)
Comment 3•11 years ago
|
||
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #2)
> Hi ashish,
> Can we please bump the time and load value for this check on this host?
> What is the current value?
>
Current thresholds are 10,10,10 for WARNING and 25,25,25 for CRITICAL. These thresholds are common across all releng hosts that are monitored. It's preferred to change thresholds across the hostgroup for better management rather than a single host. Not impossible though, just makes configs and alerts more straightforward :)
> Let me know what is the procedure for these type of requests; if you prefer
> to file a bug or poke someone else.
>
Bugs is fine.
Flags: needinfo?(ashish)
Comment 5•11 years ago
|
||
I've filed two related bugs:
Bug 990173 - Move b2g bumper to a dedicated host
Bug 990172 - Deprecate buildbot-master66
Note that bm66 is no longer running a buildbot master, so this high load is not impacting another service.
Comment 6•11 years ago
|
||
I've quietened nagios for 3 days...
From #buildduty IRC channel:
Wed 16 Apr 2014 14:38:54 CEST nagios-releng: Wed 05:38:50 PDT [4216] buildbot-master66.srv.releng.usw2.mozilla.com:load is WARNING: WARNING - load average: 7.30, 14.57, 15.21 (http://m.mozilla.org/load)
Wed 16 Apr 2014 14:40:20 CEST pmoore: nagios-releng: downtime 4216 3d bug 985767
Wed 16 Apr 2014 14:40:20 CEST nagios-releng: pmoore: Downtime for service buildbot-master66.srv.releng.usw2.mozilla.com:load scheduled for 3 days, 0:00:00
Comment 7•11 years ago
|
||
so will start alerting again around 6am on Saturday morning PDT (Sat 19 Apr 2014)...
Comment 8•11 years ago
|
||
(In reply to Pete Moore [:pete][:pmoore] from comment #7)
> so will start alerting again around 6am on Saturday morning PDT (Sat 19 Apr
> 2014)...
and it started nagging monday morning. I have downtimed it for today.
I am not sure the timeline on:
Bug 990173 - Move b2g bumper to a dedicated host
Bug 990172 - Deprecate buildbot-master66
but if it is not soon, I'd suggest we increase the warning and critical numbers for high load on this host alone. Granted it makes configs messy but this host does not have a similar role like the rest of our releng group.
Comment 9•11 years ago
|
||
due to inactivity on this, I am extending the downtime for load on this machine by 6mnths
Comment 10•11 years ago
|
||
Waiting on mentioned bugs in comment 8.
Nothing actionable for buildduty.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → INVALID
Assignee | ||
Updated•9 years ago
|
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Assignee | ||
Comment 11•9 years ago
|
||
This is way too noisy ATM. I'm going to create a special hostgroup for b2g-bumper and change the alerting thresholds.
Assignee: nobody → coop
Status: REOPENED → ASSIGNED
Assignee | ||
Comment 12•9 years ago
|
||
Attachment #8667750 -
Flags: review?(arich)
Updated•9 years ago
|
Attachment #8667750 -
Flags: review?(arich) → review+
Comment 13•9 years ago
|
||
In hostgroups.pp, you have 'b2g-bumper' => => {
You need to delete the extra =>
Assignee | ||
Comment 15•9 years ago
|
||
Much less noisy.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago → 9 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•