Need more build machine computing power

RESOLVED FIXED

Status

Release Engineering
General
P3
normal
RESOLVED FIXED
9 years ago
5 years ago

People

(Reporter: Robert Sayre, Assigned: joduinn)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

9 years ago
tracemonkey and mozilla-central are starved for build machines today, tracemonkey's windows build machine has fallen off the tinderbox. Does Nagios report this?

Comment 1

9 years ago
<nagios> [29] surf:Tinderbox - TraceMonkey is CRITICAL: =-=-= REMOVED: WINNT 5.2 tracemonkey build
To point to a specific example, there aren't any builds in the "Linux mozilla-central build" on http://tinderbox.mozilla.org/Firefox/ for the pushes at 15:21, 15:32, or 15:55; the next build after the one of 7f7c7c7a4afe started at 16:37:39.  See the pushlog snippet at http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=cd085064b5d1&tochange=ccba5b556949
(Reporter)

Comment 3

9 years ago
suggestion: buy way more capacity than you think you need, and avoid these problems moving forward.
The slaves moz2-win32-slave1 -> 18 are all shared between mozilla-central, mozilla-191 and tracemonkey. A quick glance shows that the win32 slaves are all busy doing other builds. I do also see m-c win32 build and m-c win32 unittest produced approx 4pm for mozilla-central, as well as tracemonkey builds completed in last hour. There are also lots of m-c builds in progress since then.

I'm investigating; this is odd that we'd get overrun like this. Was there a sudden spike of checkins? Or are a bunch of win32 slaves down/missing?
Assignee: nobody → joduinn
OS: Mac OS X → All
Priority: -- → P1
Hardware: x86 → All
Summary: Need more build machine computing power → win32 slaves ignoring tracemonkey requests

Comment 5

9 years ago
This is likely worsened by the fact that slaves like to stick to the builder they're on, so as long as the builder has pending builds, the slaves don't give other builders a chance to run.

I thought that bhearsum had a buildbot ticket on that, but I don't find it right now.
periodic scheduler was triggering a build on mozilla-central and on mozilla-191 every two hours, regardless of any other checkins or not. This is now reduced to every 10 hours, like we do on tracemonkey. This should help reduce the queue of jobs coming into the pool.

All win32 slaves are busy, so I've killed and star-d some periodic win32 builds that were in progress to let "real" checkins get through.
Doesn't that mean (given that talos never retests the same build) that we'll have much less talos data?
Turns out that there were two problems here:

1) Before the holidays, we finished the changeover so now both builds and also unittests are being handled by the same pool-of-slaves. All the old unittest master and unittest pool-of-slaves were idled. 

However, we hadn't finished bringing over all the idle slaves to the one big "build+unittest" pool-of-slaves. This meant that the pool-of-slaves which was able to process builds on moz-191, m-c and tracemonkey just fine, was now also processing unittests for moz-191, m-c and tracemonkey. This worked fine for a while before and during the holidays because of overcapacity in the pool, and the low number of checkins. However, a spike in number of checkins on Tuesday quickly overran the capacity of the pool. While this was first noticed on win32 tracemonkey, this was actually a problem for all o.s on all moz2 branches (moz-191, m-c and tracemonkey). :-(

No jobs were ever lost; during the backlog mid-Tuesday, jobs were just queued waiting for an available idle slave. Later in Tuesday night, as checkins slowed, the slaves were able to catch back up again, and processed all pending jobs. 

We didn't hit the backlog problem again Wednesday because of fewer checkins, and then later Wednesday, we finished adding 9 slaves to the pool (4linux, 2win, 3mac), details in bug#465868 #c35. We're still working on verifying a few other slaves which will be added to the pool today/monday. 


2) Adding to the fun, we've been running "to keep tinderbox and talos busy, kick off a build every 2 hours" on both mozilla-central and mozilla-191. This is now reduced to "every 10 hours", like we do for tracemonkey.



So far, all continues to look ok since late Tuesday, but we'll leave this bug open for a little longer just to watch. If you see any more occurrences of backlogs, please update this bug.

(ps: I've put back rsayre's original summary, because its turned out to be quite accurate; the problem was not specific to win32 or tracemonkey after all)
Priority: P1 → P3
Summary: win32 slaves ignoring tracemonkey requests → Need more build machine computing power
(In reply to comment #7)
> Doesn't that mean (given that talos never retests the same build) that we'll
> have much less talos data?

Actually, Talos was only able to keep up with the flow of incoming builds, by skipping over all queued builds (those with changes, and those triggered by periodic), and dealing with the last in the queue, just to keep up. No shortage of builds, or talos runs!

We've reduced mozilla-central and mozilla-191 down to the same 10 hour intervals as tracemonkey; thats worked fine for them so far, and we believe with the volume of incoming checkins, it will work fine here on m-c/moz-191 also. How about we watch this for a while and if, for any reason, this is not ok on m-c/moz-191, please file a separate bug to change it?
(Reporter)

Comment 10

9 years ago
the tracemonkey tree has no performance boxes
I think we really do still want builds every 2 hours so we get talos data.  However, we can still make the timed builds interfere much less by making the 2-hour timer reset every time there's a push-triggered build (which is what we've wanted since they started).  In other words, all we need is a timer that ensures there's a build if there hasn't been one for two hours.
(In reply to comment #11)
> I think we really do still want builds every 2 hours so we get talos data. 
> However, we can still make the timed builds interfere much less by making the
> 2-hour timer reset every time there's a push-triggered build (which is what
> we've wanted since they started).  In other words, all we need is a timer that
> ensures there's a build if there hasn't been one for two hours.

This should be addressed by bug 472930
(Reporter)

Comment 13

9 years ago
Are there more computers coming?
(Reporter)

Comment 14

9 years ago
The tracemonkey has only one windows machine running performance tests, so it's skipping revisions.

What is the plan to resolve these issues?
Flags: blocking1.9.1?

Comment 15

9 years ago
The ticket I talked about in comment 5 is http://buildbot.net/trac/ticket/334. catlee seems to be on it.
(In reply to comment #14)
> The tracemonkey has only one windows machine running performance tests, so it's
> skipping revisions.

All branches, not just tracemonkey, have only one triplicate set of talos performance machines running on them. Under load, Talos has always intentionally skipped to the last queued job, just to keep up.

> What is the plan to resolve these issues?
This is already being worked on in bug#457885.
(In reply to comment #13)
> Are there more computers coming?

Sorry for the delay updating this bug. Here's a quick list of the machines added to the production build+unittest pool as part of fixing this bug:


08jan2009
moz2-darwin9-slave05
moz2-darwin9-slave07
bm-xserve22
moz2-linux-slave13
moz2-win32-slave09
moz2-win32-slave10

09jan2009 
moz2-darwin9-slave06
moz2-win32-slave07
moz2-win32-slave08
moz2-win32-slave14

14jan2009
moz2-win32-slave19
moz2-win32-slave20

19jan2009
moz2-linux-slave07
moz2-linux-slave08
moz2-linux-slave09
moz2-linux-slave10

At this time, we've been seeing idle slaves for over a week now, and have added even more since then, so believe we now have enough slaves to keep up with demand. Closing as FIXED.

If you see situations where landed patches that were not landed at the same time are being bundled together in the same build, or where unittests are not being started within minutes of the patch landing, please file a new bug. We'll investigate and can add even more slaves if needed.
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED
Bug (In reply to comment #16)
> (In reply to comment #14)
> > The tracemonkey has only one windows machine running performance tests, so it's
> > skipping revisions.
> 
> All branches, not just tracemonkey, have only one triplicate set of talos
> performance machines running on them. Under load, Talos has always
> intentionally skipped to the last queued job, just to keep up.

Right, and we're saying that that's not acceptable, since we lose the ability to properly determine a regression range for performance problems.

> > What is the plan to resolve these issues?
> This is already being worked on in bug#457885.

That bug sounds like it's about something different, and that when it's resolved talos still won't be able to keep up, and will still skip revisions on us (per bug 457885 comment 4).  It's also not being actively worked, and this has been an active problem for us since September per that bug.

I'll open a new ticket to get more hardware on the case here, as requested.
Flags: blocking1.9.1?
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.