Closed Bug 672851 Opened 14 years ago Closed 14 years ago

Test masters are hot hot hot

Categories

(Release Engineering :: General, defect, P1)

x86
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: armenzg)

Details

[1:18pm] nagios-sjc1: [75] buildbot-master04.build.scl1:Ganglia IO is CRITICAL: CHECKGANGLIA CRITICAL: cpu_wio is 29.80 [1:19pm] nagios-sjc1: [76] buildbot-master06.build.scl1:Ganglia IO is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [1:20pm] nagios-sjc1: [77] buildbot-master11.build.scl1:Ganglia IO is CRITICAL: CHECKGANGLIA CRITICAL: cpu_wio is 37.30 [1:22pm] nagios-sjc1: buildbot-master06.build.scl1:Ganglia IO is OK: CHECKGANGLIA OK: cpu_wio is 13.60 [1:23pm] nagios-sjc1: [79] buildbot-master11.build.scl1:Ganglia IO is WARNING: CHECKGANGLIA WARNING: cpu_wio is 17.00 [1:24pm] nagios-sjc1: [81] buildbot-master04.build.scl1:Ganglia IO is WARNING: CHECKGANGLIA WARNING: cpu_wio is 17.80 [1:26pm] nagios-sjc1: buildbot-master11.build.scl1:Ganglia IO is OK: CHECKGANGLIA OK: cpu_wio is 1.50 [1:27pm] nagios-sjc1: buildbot-master04.build.scl1:Ganglia IO is OK: CHECKGANGLIA OK: cpu_wio is 2.50 bm{04,06,11} were complaining bm5 was very cool about it. At the moment of the spike we had 234 pending jobs & 486 running jobs.
Assignee: nobody → armenzg
Severity: normal → major
Status: NEW → ASSIGNED
Priority: -- → P1
Initial comment's timestamps are on EDT timezone. I have also seen these messages that could or could not be related (on PDT timezone): bm06-tests1: ############ 2011-07-20 09:41:47-0700 [-] Unhandled Error ... _mysql_exceptions.OperationalError: (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 110") bm05-builder: ############# 2011-07-20 09:34:57-0700 [-] Unhandled Error ... _mysql_exceptions.OperationalError: (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 110")
bm5 has 4 CPUs instead of 2 CPU. We saw great deal of WAIT CPU on bm{04,06,11} between 10:15-10:25 PDT which matches the initial comment. More than 150 test jobs are queued from 10:00 AM to 10:44 AM. I will go and set up buildbot-master02 to split the load on bug 672844.
coop, dustin, catlee and zandr I want to keep you on the loop of were things are. Two things: #1) we are having kvm load alerts [1] #2) bm15 & bm16 have no cpu_wio [2] even though they have ~57 slaves like the other 4 test masters From previous experience cpu_wio makes masters not to hand jobs to the slaves in a timely manner. This would explain the bad wait times we have recently seen. To address #1 we are thinking of running automatic load balancing since we are not even using kmv3 [3] (it probably was down for repairs when VMs where being created). zandr thinks "the first thing we should do is try an automated balance, but do it in a maint. window so we can pause/reboot masters if need be." "the point of a rebalance now would be to make use of kvm3" IT also thinks that more disks will help. For #2, catlee mentioned that perhaps we should revisit the idea of having masters dedicated to a certain subset of OSes like bm15 and bm16 which are dedicated to xp and w7 slaves. Their memory usage is half, there is unnoticeable cpu_wio and kvm4 (where they are hosted) has not complained [4]. I am also bringing back buildbot-master2 back to action in bug 672844 which would bring # of slaves per master to ~46 from ~57. I will also evaluate to enable bm2-tests2 on Friday which would bring the count to ~38 slaves. My measuring stick for success will be cpu_wio. [1] 10:25 < nagios-sjc1> [82] kvm1.infra.scl1:avg load is CRITICAL: CRITICAL - load average: 7.77, 29.29, 22.34 [2] http://ganglia1.build.scl1.mozilla.com/ganglia/graph.php?c=RelEngSCL1&h=buildbot-master04.build.scl1.mozilla.com&v=3.2&m=cpu_wio&r=week&z=medium&jr=&js=&st=1311186207&vl=%25&ti=CPU%20wio&z=large http://ganglia1.build.scl1.mozilla.com/ganglia/graph.php?c=RelEngSCL1&h=buildbot-master15.build.scl1.mozilla.com&v=0.1&m=cpu_wio&r=week&z=medium&jr=&js=&st=1311186192&vl=%25&ti=CPU%20wio&z=large [3] (among other non relevant VMs) kvm1 - bm11 kvm2 - bm{4,6} kvm4 - bm[13-17] [4] [2:18pm] armenzg_buildduty: bkero: has kvm4 complained? [2:18pm] bkero: armenzg_buildduty: not to my knowledge, no
We have addressed issue #2 by enabling another test master (bm2-tests1). This reduced the # of slaves that a test master needs to talk to (from ~57 to ~45). I have looked at the load of bm{04,06,11} and it looks good. I have added #1 to https://intranet.mozilla.org/Build:InfrastructurePriorities#Nominating
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
bm02-tests2 also got enabled this morning. It's not even sweating having 2 test masters attached to it. http://ganglia1.build.scl1.mozilla.com/ganglia/?c=RelEngSCL1&h=buildbot-master2.build.scl1.mozilla.com&m=load_fifteen&r=day&s=descending&hc=4&mc=2 We now have 6 tests masters for non-Windows slaves (2 on bm2 & 1 on bm5). This means that now there are ~37 slaves per master. Currently, our worst enemy are the masters on the kvm cluster which still have a lot of wait on IO.
(In reply to comment #4) > I have added #1 to > https://intranet.mozilla.org/Build:InfrastructurePriorities#Nominating Is there a new bug for this (kvm load alerts) if it's being nominated, since this bug is now marked FIXED?
(In reply to comment #7) > (In reply to comment #4) > > I have added #1 to > > https://intranet.mozilla.org/Build:InfrastructurePriorities#Nominating > > Is there a new bug for this (kvm load alerts) if it's being nominated, since > this bug is now marked FIXED? The high load is still there but I used this bug to deal with the peak we saw from Monday to Wednesday. In fact, bkero mentioned there were no more kvm nagios alerts after I enabled bm02-tests1. Also: zandr has requested to buy more disks (to improve wio). We have also reduced the load on the kvm cluster by turning bm2-tests1 and bm2-tests2 back on. We also spoke of rebalancing the cluster so kvm3 also gets to be used.
Filed bug 674144 to install disks.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.