672851 - Test masters are hot hot hot

Assignee

Description

•

14 years ago

[1:18pm] nagios-sjc1: [75] buildbot-master04.build.scl1:Ganglia IO is CRITICAL: CHECKGANGLIA CRITICAL: cpu_wio is 29.80 [1:19pm] nagios-sjc1: [76] buildbot-master06.build.scl1:Ganglia IO is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [1:20pm] nagios-sjc1: [77] buildbot-master11.build.scl1:Ganglia IO is CRITICAL: CHECKGANGLIA CRITICAL: cpu_wio is 37.30 [1:22pm] nagios-sjc1: buildbot-master06.build.scl1:Ganglia IO is OK: CHECKGANGLIA OK: cpu_wio is 13.60 [1:23pm] nagios-sjc1: [79] buildbot-master11.build.scl1:Ganglia IO is WARNING: CHECKGANGLIA WARNING: cpu_wio is 17.00 [1:24pm] nagios-sjc1: [81] buildbot-master04.build.scl1:Ganglia IO is WARNING: CHECKGANGLIA WARNING: cpu_wio is 17.80 [1:26pm] nagios-sjc1: buildbot-master11.build.scl1:Ganglia IO is OK: CHECKGANGLIA OK: cpu_wio is 1.50 [1:27pm] nagios-sjc1: buildbot-master04.build.scl1:Ganglia IO is OK: CHECKGANGLIA OK: cpu_wio is 2.50 bm{04,06,11} were complaining bm5 was very cool about it. At the moment of the spike we had 234 pending jobs & 486 running jobs.

Armen [:armenzg]

Assignee

Updated

•

14 years ago

Assignee: nobody → armenzg

Severity: normal → major

Status: NEW → ASSIGNED

Priority: -- → P1

Armen [:armenzg]

Assignee

Comment 1

•

14 years ago

Initial comment's timestamps are on EDT timezone. I have also seen these messages that could or could not be related (on PDT timezone): bm06-tests1: ############ 2011-07-20 09:41:47-0700 [-] Unhandled Error ... _mysql_exceptions.OperationalError: (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 110") bm05-builder: ############# 2011-07-20 09:34:57-0700 [-] Unhandled Error ... _mysql_exceptions.OperationalError: (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 110")

Armen [:armenzg]

Assignee

Comment 2

•

14 years ago

bm5 has 4 CPUs instead of 2 CPU. We saw great deal of WAIT CPU on bm{04,06,11} between 10:15-10:25 PDT which matches the initial comment. More than 150 test jobs are queued from 10:00 AM to 10:44 AM. I will go and set up buildbot-master02 to split the load on bug 672844.

Armen [:armenzg]

Assignee

Comment 3

•

14 years ago

coop, dustin, catlee and zandr I want to keep you on the loop of were things are. Two things: #1) we are having kvm load alerts [1] #2) bm15 & bm16 have no cpu_wio [2] even though they have ~57 slaves like the other 4 test masters From previous experience cpu_wio makes masters not to hand jobs to the slaves in a timely manner. This would explain the bad wait times we have recently seen. To address #1 we are thinking of running automatic load balancing since we are not even using kmv3 [3] (it probably was down for repairs when VMs where being created). zandr thinks "the first thing we should do is try an automated balance, but do it in a maint. window so we can pause/reboot masters if need be." "the point of a rebalance now would be to make use of kvm3" IT also thinks that more disks will help. For #2, catlee mentioned that perhaps we should revisit the idea of having masters dedicated to a certain subset of OSes like bm15 and bm16 which are dedicated to xp and w7 slaves. Their memory usage is half, there is unnoticeable cpu_wio and kvm4 (where they are hosted) has not complained [4]. I am also bringing back buildbot-master2 back to action in bug 672844 which would bring # of slaves per master to ~46 from ~57. I will also evaluate to enable bm2-tests2 on Friday which would bring the count to ~38 slaves. My measuring stick for success will be cpu_wio. [1] 10:25 < nagios-sjc1> [82] kvm1.infra.scl1:avg load is CRITICAL: CRITICAL - load average: 7.77, 29.29, 22.34 [2] http://ganglia1.build.scl1.mozilla.com/ganglia/graph.php?c=RelEngSCL1&h=buildbot-master04.build.scl1.mozilla.com&v=3.2&m=cpu_wio&r=week&z=medium&jr=&js=&st=1311186207&vl=%25&ti=CPU%20wio&z=large http://ganglia1.build.scl1.mozilla.com/ganglia/graph.php?c=RelEngSCL1&h=buildbot-master15.build.scl1.mozilla.com&v=0.1&m=cpu_wio&r=week&z=medium&jr=&js=&st=1311186192&vl=%25&ti=CPU%20wio&z=large [3] (among other non relevant VMs) kvm1 - bm11 kvm2 - bm{4,6} kvm4 - bm[13-17] [4] [2:18pm] armenzg_buildduty: bkero: has kvm4 complained? [2:18pm] bkero: armenzg_buildduty: not to my knowledge, no

Armen [:armenzg]

Assignee

Comment 4

•

14 years ago

We have addressed issue #2 by enabling another test master (bm2-tests1). This reduced the # of slaves that a test master needs to talk to (from ~57 to ~45). I have looked at the load of bm{04,06,11} and it looks good. I have added #1 to https://intranet.mozilla.org/Build:InfrastructurePriorities#Nominating

Status: ASSIGNED → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Assignee

Comment 5

•

14 years ago

bm02-tests2 also got enabled this morning. It's not even sweating having 2 test masters attached to it. http://ganglia1.build.scl1.mozilla.com/ganglia/?c=RelEngSCL1&h=buildbot-master2.build.scl1.mozilla.com&m=load_fifteen&r=day&s=descending&hc=4&mc=2 We now have 6 tests masters for non-Windows slaves (2 on bm2 & 1 on bm5). This means that now there are ~37 slaves per master. Currently, our worst enemy are the masters on the kvm cluster which still have a lot of wait on IO.

Armen [:armenzg]

Assignee

Comment 6

•

14 years ago

FTR this captures the decrease on CPU wio and still shows that it happens below 5%: http://ganglia1.build.scl1.mozilla.com/ganglia/graph.php?c=RelEngSCL1&h=buildbot-master04.build.scl1.mozilla.com&v=1.6&m=cpu_wio&r=week&z=medium&jr=&js=&st=1311619893&vl=%25&ti=CPU%20wio&z=large Buildbot-master2's cpu wio is below 0.6%: http://ganglia1.build.scl1.mozilla.com/ganglia/graph.php?c=RelEngSCL1&h=buildbot-master2.build.scl1.mozilla.com&v=1.6&m=cpu_wio&r=week&z=medium&jr=&js=&st=1311619893&vl=%25&ti=CPU%20wio&z=large

Chris Cooper [:coop] (he/him)

Comment 7

•

14 years ago

(In reply to comment #4) > I have added #1 to > https://intranet.mozilla.org/Build:InfrastructurePriorities#Nominating Is there a new bug for this (kvm load alerts) if it's being nominated, since this bug is now marked FIXED?

Armen [:armenzg]

Assignee

Comment 8

•

14 years ago

(In reply to comment #7) > (In reply to comment #4) > > I have added #1 to > > https://intranet.mozilla.org/Build:InfrastructurePriorities#Nominating > > Is there a new bug for this (kvm load alerts) if it's being nominated, since > this bug is now marked FIXED? The high load is still there but I used this bug to deal with the peak we saw from Monday to Wednesday. In fact, bkero mentioned there were no more kvm nagios alerts after I enabled bm02-tests1. Also: zandr has requested to buy more disks (to improve wio). We have also reduced the load on the kvm cluster by turning bm2-tests1 and bm2-tests2 back on. We also spoke of rebalancing the cluster so kvm3 also gets to be used.

Zandr Milewski [:zandr]

Comment 9

•

14 years ago

Filed bug 674144 to install disks.

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

Bugzilla

Test masters are hot hot hot

Categories

(Release Engineering :: General, defect, P1)

Tracking

(Not tracked)

People

(Reporter: armenzg, Assigned: armenzg)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated