Closed Bug 985767 Opened 10 years ago Closed 9 years ago

Nagios alerting about load on buildbot-master66

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: coop)

References

Details

Attachments

(1 file)

eg:
Wed 17:02:46 PDT [4500] buildbot-master66.srv.releng.usw2.mozilla.com:load is WARNING: WARNING - load average: 20.25, 16.29, 11.45

It's coming from the two b2g bumper scripts doing a lot of 'git ls-remote' in parallel, so is noisy but harmless. Two scripts because one was added for b2g28-1.3t in bug 956631.

Options
* relax the load check on bm66, either the load value or the time it's in that state
* less parallelism
* <your cool idea here>
Correction - there's four b2g (master, 1.4, 1.3, 1.3t), as well as gaia, now.
No longer blocks: 956631
Hi ashish,
Can we please bump the time and load value for this check on this host?
What is the current value?
Let me know what is the procedure for these type of requests; if you prefer to file a bug or poke someone else.

Thanks!

<nagios-releng> Fri 09:17:44 PDT [4134] buildbot-master66.srv.releng.usw2.mozilla.com:load is CRITICAL: CRITICAL - load average: 27.99, 20.75, 15.87 (http://m.mozilla.org/load)
* pmoore is now known as pmoore|dinner
<nagios-releng> Fri 09:22:43 PDT [4135] buildbot-master66.srv.releng.usw2.mozilla.com:load is WARNING: WARNING - load average: 17.43, 17.83, 15.54 (http://m.mozilla.org/load)
Flags: needinfo?(ashish)
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #2)
> Hi ashish,
> Can we please bump the time and load value for this check on this host?
> What is the current value?
>
Current thresholds are 10,10,10 for WARNING and 25,25,25 for CRITICAL. These thresholds are common across all releng hosts that are monitored. It's preferred to change thresholds across the hostgroup for better management rather than a single host. Not impossible though, just makes configs and alerts more straightforward :)

> Let me know what is the procedure for these type of requests; if you prefer
> to file a bug or poke someone else.
> 
Bugs is fine.
Flags: needinfo?(ashish)
I've filed two related bugs:

Bug 990173 - Move b2g bumper to a dedicated host
Bug 990172 - Deprecate buildbot-master66

Note that bm66 is no longer running a buildbot master, so this high load is not impacting another service.
I've quietened nagios for 3 days...

From #buildduty IRC channel:

Wed 16 Apr 2014 14:38:54 CEST nagios-releng: Wed 05:38:50 PDT [4216] buildbot-master66.srv.releng.usw2.mozilla.com:load is WARNING: WARNING - load average: 7.30, 14.57, 15.21 (http://m.mozilla.org/load)
Wed 16 Apr 2014 14:40:20 CEST pmoore: nagios-releng: downtime 4216 3d bug 985767
Wed 16 Apr 2014 14:40:20 CEST nagios-releng: pmoore: Downtime for service buildbot-master66.srv.releng.usw2.mozilla.com:load scheduled for 3 days, 0:00:00
so will start alerting again around 6am on Saturday morning PDT (Sat 19 Apr 2014)...
(In reply to Pete Moore [:pete][:pmoore] from comment #7)
> so will start alerting again around 6am on Saturday morning PDT (Sat 19 Apr
> 2014)...

and it started nagging monday morning. I have downtimed it for today.

I am not sure the timeline on:

Bug 990173 - Move b2g bumper to a dedicated host
Bug 990172 - Deprecate buildbot-master66

but if it is not soon, I'd suggest we increase the warning and critical numbers for high load on this host alone. Granted it makes configs messy but this host does not have a similar role like the rest of our releng group.
due to inactivity on this, I am extending the downtime for load on this machine by 6mnths
Waiting on mentioned bugs in comment 8.
Nothing actionable for buildduty.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INVALID
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
This is way too noisy ATM. I'm going to create a special hostgroup for b2g-bumper and change the alerting thresholds.
Assignee: nobody → coop
Status: REOPENED → ASSIGNED
Attachment #8667750 - Flags: review?(arich) → review+
In hostgroups.pp, you have 'b2g-bumper' => => {

You need to delete the extra =>
Much less noisy.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago9 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: