Closed
Bug 672851
Opened 14 years ago
Closed 14 years ago
Test masters are hot hot hot
Categories
(Release Engineering :: General, defect, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Assigned: armenzg)
Details
[1:18pm] nagios-sjc1: [75] buildbot-master04.build.scl1:Ganglia IO is CRITICAL: CHECKGANGLIA CRITICAL: cpu_wio is 29.80
[1:19pm] nagios-sjc1: [76] buildbot-master06.build.scl1:Ganglia IO is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[1:20pm] nagios-sjc1: [77] buildbot-master11.build.scl1:Ganglia IO is CRITICAL: CHECKGANGLIA CRITICAL: cpu_wio is 37.30
[1:22pm] nagios-sjc1: buildbot-master06.build.scl1:Ganglia IO is OK: CHECKGANGLIA OK: cpu_wio is 13.60
[1:23pm] nagios-sjc1: [79] buildbot-master11.build.scl1:Ganglia IO is WARNING: CHECKGANGLIA WARNING: cpu_wio is 17.00
[1:24pm] nagios-sjc1: [81] buildbot-master04.build.scl1:Ganglia IO is WARNING: CHECKGANGLIA WARNING: cpu_wio is 17.80
[1:26pm] nagios-sjc1: buildbot-master11.build.scl1:Ganglia IO is OK: CHECKGANGLIA OK: cpu_wio is 1.50
[1:27pm] nagios-sjc1: buildbot-master04.build.scl1:Ganglia IO is OK: CHECKGANGLIA OK: cpu_wio is 2.50
bm{04,06,11} were complaining
bm5 was very cool about it.
At the moment of the spike we had 234 pending jobs & 486 running jobs.
| Assignee | ||
Updated•14 years ago
|
Assignee: nobody → armenzg
Severity: normal → major
Status: NEW → ASSIGNED
Priority: -- → P1
| Assignee | ||
Comment 1•14 years ago
|
||
Initial comment's timestamps are on EDT timezone.
I have also seen these messages that could or could not be related (on PDT timezone):
bm06-tests1:
############
2011-07-20 09:41:47-0700 [-] Unhandled Error
...
_mysql_exceptions.OperationalError: (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 110")
bm05-builder:
#############
2011-07-20 09:34:57-0700 [-] Unhandled Error
...
_mysql_exceptions.OperationalError: (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 110")
| Assignee | ||
Comment 2•14 years ago
|
||
bm5 has 4 CPUs instead of 2 CPU.
We saw great deal of WAIT CPU on bm{04,06,11} between 10:15-10:25 PDT which matches the initial comment.
More than 150 test jobs are queued from 10:00 AM to 10:44 AM.
I will go and set up buildbot-master02 to split the load on bug 672844.
| Assignee | ||
Comment 3•14 years ago
|
||
coop, dustin, catlee and zandr I want to keep you on the loop of were things are.
Two things:
#1) we are having kvm load alerts [1]
#2) bm15 & bm16 have no cpu_wio [2] even though they have ~57 slaves like the other 4 test masters
From previous experience cpu_wio makes masters not to hand jobs to the slaves in a timely manner.
This would explain the bad wait times we have recently seen.
To address #1 we are thinking of running automatic load balancing since we are not even using kmv3 [3] (it probably was down for repairs when VMs where being created). zandr thinks "the first thing we should do is try an automated balance, but do it in a maint. window so we can pause/reboot masters if need be." "the point of a rebalance now would be to make use of kvm3"
IT also thinks that more disks will help.
For #2, catlee mentioned that perhaps we should revisit the idea of having masters dedicated to a certain subset of OSes like bm15 and bm16 which are dedicated to xp and w7 slaves. Their memory usage is half, there is unnoticeable cpu_wio and kvm4 (where they are hosted) has not complained [4].
I am also bringing back buildbot-master2 back to action in bug 672844 which would bring # of slaves per master to ~46 from ~57. I will also evaluate to enable bm2-tests2 on Friday which would bring the count to ~38 slaves.
My measuring stick for success will be cpu_wio.
[1]
10:25 < nagios-sjc1> [82] kvm1.infra.scl1:avg load is CRITICAL: CRITICAL - load average: 7.77, 29.29, 22.34
[2]
http://ganglia1.build.scl1.mozilla.com/ganglia/graph.php?c=RelEngSCL1&h=buildbot-master04.build.scl1.mozilla.com&v=3.2&m=cpu_wio&r=week&z=medium&jr=&js=&st=1311186207&vl=%25&ti=CPU%20wio&z=large
http://ganglia1.build.scl1.mozilla.com/ganglia/graph.php?c=RelEngSCL1&h=buildbot-master15.build.scl1.mozilla.com&v=0.1&m=cpu_wio&r=week&z=medium&jr=&js=&st=1311186192&vl=%25&ti=CPU%20wio&z=large
[3] (among other non relevant VMs)
kvm1 - bm11
kvm2 - bm{4,6}
kvm4 - bm[13-17]
[4]
[2:18pm] armenzg_buildduty: bkero: has kvm4 complained?
[2:18pm] bkero: armenzg_buildduty: not to my knowledge, no
| Assignee | ||
Comment 4•14 years ago
|
||
We have addressed issue #2 by enabling another test master (bm2-tests1). This reduced the # of slaves that a test master needs to talk to (from ~57 to ~45).
I have looked at the load of bm{04,06,11} and it looks good.
I have added #1 to https://intranet.mozilla.org/Build:InfrastructurePriorities#Nominating
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
| Assignee | ||
Comment 5•14 years ago
|
||
bm02-tests2 also got enabled this morning.
It's not even sweating having 2 test masters attached to it.
http://ganglia1.build.scl1.mozilla.com/ganglia/?c=RelEngSCL1&h=buildbot-master2.build.scl1.mozilla.com&m=load_fifteen&r=day&s=descending&hc=4&mc=2
We now have 6 tests masters for non-Windows slaves (2 on bm2 & 1 on bm5).
This means that now there are ~37 slaves per master.
Currently, our worst enemy are the masters on the kvm cluster which still have a lot of wait on IO.
| Assignee | ||
Comment 6•14 years ago
|
||
FTR this captures the decrease on CPU wio and still shows that it happens below 5%:
http://ganglia1.build.scl1.mozilla.com/ganglia/graph.php?c=RelEngSCL1&h=buildbot-master04.build.scl1.mozilla.com&v=1.6&m=cpu_wio&r=week&z=medium&jr=&js=&st=1311619893&vl=%25&ti=CPU%20wio&z=large
Buildbot-master2's cpu wio is below 0.6%:
http://ganglia1.build.scl1.mozilla.com/ganglia/graph.php?c=RelEngSCL1&h=buildbot-master2.build.scl1.mozilla.com&v=1.6&m=cpu_wio&r=week&z=medium&jr=&js=&st=1311619893&vl=%25&ti=CPU%20wio&z=large
Comment 7•14 years ago
|
||
(In reply to comment #4)
> I have added #1 to
> https://intranet.mozilla.org/Build:InfrastructurePriorities#Nominating
Is there a new bug for this (kvm load alerts) if it's being nominated, since this bug is now marked FIXED?
| Assignee | ||
Comment 8•14 years ago
|
||
(In reply to comment #7)
> (In reply to comment #4)
> > I have added #1 to
> > https://intranet.mozilla.org/Build:InfrastructurePriorities#Nominating
>
> Is there a new bug for this (kvm load alerts) if it's being nominated, since
> this bug is now marked FIXED?
The high load is still there but I used this bug to deal with the peak we saw from Monday to Wednesday. In fact, bkero mentioned there were no more kvm nagios alerts after I enabled bm02-tests1.
Also:
zandr has requested to buy more disks (to improve wio).
We have also reduced the load on the kvm cluster by turning bm2-tests1 and bm2-tests2 back on.
We also spoke of rebalancing the cluster so kvm3 also gets to be used.
Comment 9•14 years ago
|
||
Filed bug 674144 to install disks.
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•