Closed Bug 472411 Opened 16 years ago Closed 16 years ago

Need more build machine computing power

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sayrer, Assigned: joduinn)

Details

tracemonkey and mozilla-central are starved for build machines today, tracemonkey's windows build machine has fallen off the tinderbox. Does Nagios report this?
<nagios> [29] surf:Tinderbox - TraceMonkey is CRITICAL: =-=-= REMOVED: WINNT 5.2 tracemonkey build
To point to a specific example, there aren't any builds in the "Linux mozilla-central build" on http://tinderbox.mozilla.org/Firefox/ for the pushes at 15:21, 15:32, or 15:55; the next build after the one of 7f7c7c7a4afe started at 16:37:39. See the pushlog snippet at http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=cd085064b5d1&tochange=ccba5b556949
suggestion: buy way more capacity than you think you need, and avoid these problems moving forward.
The slaves moz2-win32-slave1 -> 18 are all shared between mozilla-central, mozilla-191 and tracemonkey. A quick glance shows that the win32 slaves are all busy doing other builds. I do also see m-c win32 build and m-c win32 unittest produced approx 4pm for mozilla-central, as well as tracemonkey builds completed in last hour. There are also lots of m-c builds in progress since then. I'm investigating; this is odd that we'd get overrun like this. Was there a sudden spike of checkins? Or are a bunch of win32 slaves down/missing?
Assignee: nobody → joduinn
OS: Mac OS X → All
Priority: -- → P1
Hardware: x86 → All
Summary: Need more build machine computing power → win32 slaves ignoring tracemonkey requests
This is likely worsened by the fact that slaves like to stick to the builder they're on, so as long as the builder has pending builds, the slaves don't give other builders a chance to run. I thought that bhearsum had a buildbot ticket on that, but I don't find it right now.
periodic scheduler was triggering a build on mozilla-central and on mozilla-191 every two hours, regardless of any other checkins or not. This is now reduced to every 10 hours, like we do on tracemonkey. This should help reduce the queue of jobs coming into the pool. All win32 slaves are busy, so I've killed and star-d some periodic win32 builds that were in progress to let "real" checkins get through.
Doesn't that mean (given that talos never retests the same build) that we'll have much less talos data?
Turns out that there were two problems here: 1) Before the holidays, we finished the changeover so now both builds and also unittests are being handled by the same pool-of-slaves. All the old unittest master and unittest pool-of-slaves were idled. However, we hadn't finished bringing over all the idle slaves to the one big "build+unittest" pool-of-slaves. This meant that the pool-of-slaves which was able to process builds on moz-191, m-c and tracemonkey just fine, was now also processing unittests for moz-191, m-c and tracemonkey. This worked fine for a while before and during the holidays because of overcapacity in the pool, and the low number of checkins. However, a spike in number of checkins on Tuesday quickly overran the capacity of the pool. While this was first noticed on win32 tracemonkey, this was actually a problem for all o.s on all moz2 branches (moz-191, m-c and tracemonkey). :-( No jobs were ever lost; during the backlog mid-Tuesday, jobs were just queued waiting for an available idle slave. Later in Tuesday night, as checkins slowed, the slaves were able to catch back up again, and processed all pending jobs. We didn't hit the backlog problem again Wednesday because of fewer checkins, and then later Wednesday, we finished adding 9 slaves to the pool (4linux, 2win, 3mac), details in bug#465868 #c35. We're still working on verifying a few other slaves which will be added to the pool today/monday. 2) Adding to the fun, we've been running "to keep tinderbox and talos busy, kick off a build every 2 hours" on both mozilla-central and mozilla-191. This is now reduced to "every 10 hours", like we do for tracemonkey. So far, all continues to look ok since late Tuesday, but we'll leave this bug open for a little longer just to watch. If you see any more occurrences of backlogs, please update this bug. (ps: I've put back rsayre's original summary, because its turned out to be quite accurate; the problem was not specific to win32 or tracemonkey after all)
Priority: P1 → P3
Summary: win32 slaves ignoring tracemonkey requests → Need more build machine computing power
(In reply to comment #7) > Doesn't that mean (given that talos never retests the same build) that we'll > have much less talos data? Actually, Talos was only able to keep up with the flow of incoming builds, by skipping over all queued builds (those with changes, and those triggered by periodic), and dealing with the last in the queue, just to keep up. No shortage of builds, or talos runs! We've reduced mozilla-central and mozilla-191 down to the same 10 hour intervals as tracemonkey; thats worked fine for them so far, and we believe with the volume of incoming checkins, it will work fine here on m-c/moz-191 also. How about we watch this for a while and if, for any reason, this is not ok on m-c/moz-191, please file a separate bug to change it?
the tracemonkey tree has no performance boxes
I think we really do still want builds every 2 hours so we get talos data. However, we can still make the timed builds interfere much less by making the 2-hour timer reset every time there's a push-triggered build (which is what we've wanted since they started). In other words, all we need is a timer that ensures there's a build if there hasn't been one for two hours.
(In reply to comment #11) > I think we really do still want builds every 2 hours so we get talos data. > However, we can still make the timed builds interfere much less by making the > 2-hour timer reset every time there's a push-triggered build (which is what > we've wanted since they started). In other words, all we need is a timer that > ensures there's a build if there hasn't been one for two hours. This should be addressed by bug 472930
Are there more computers coming?
The tracemonkey has only one windows machine running performance tests, so it's skipping revisions. What is the plan to resolve these issues?
Flags: blocking1.9.1?
The ticket I talked about in comment 5 is http://buildbot.net/trac/ticket/334. catlee seems to be on it.
(In reply to comment #14) > The tracemonkey has only one windows machine running performance tests, so it's > skipping revisions. All branches, not just tracemonkey, have only one triplicate set of talos performance machines running on them. Under load, Talos has always intentionally skipped to the last queued job, just to keep up. > What is the plan to resolve these issues? This is already being worked on in bug#457885.
(In reply to comment #13) > Are there more computers coming? Sorry for the delay updating this bug. Here's a quick list of the machines added to the production build+unittest pool as part of fixing this bug: 08jan2009 moz2-darwin9-slave05 moz2-darwin9-slave07 bm-xserve22 moz2-linux-slave13 moz2-win32-slave09 moz2-win32-slave10 09jan2009 moz2-darwin9-slave06 moz2-win32-slave07 moz2-win32-slave08 moz2-win32-slave14 14jan2009 moz2-win32-slave19 moz2-win32-slave20 19jan2009 moz2-linux-slave07 moz2-linux-slave08 moz2-linux-slave09 moz2-linux-slave10 At this time, we've been seeing idle slaves for over a week now, and have added even more since then, so believe we now have enough slaves to keep up with demand. Closing as FIXED. If you see situations where landed patches that were not landed at the same time are being bundled together in the same build, or where unittests are not being started within minutes of the patch landing, please file a new bug. We'll investigate and can add even more slaves if needed.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Bug (In reply to comment #16) > (In reply to comment #14) > > The tracemonkey has only one windows machine running performance tests, so it's > > skipping revisions. > > All branches, not just tracemonkey, have only one triplicate set of talos > performance machines running on them. Under load, Talos has always > intentionally skipped to the last queued job, just to keep up. Right, and we're saying that that's not acceptable, since we lose the ability to properly determine a regression range for performance problems. > > What is the plan to resolve these issues? > This is already being worked on in bug#457885. That bug sounds like it's about something different, and that when it's resolved talos still won't be able to keep up, and will still skip revisions on us (per bug 457885 comment 4). It's also not being actively worked, and this has been an active problem for us since September per that bug. I'll open a new ticket to get more hardware on the case here, as requested.
Flags: blocking1.9.1?
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.