Need additional computing power for tracemonkey talos

RESOLVED FIXED

Status

--
major
RESOLVED FIXED
10 years ago
6 years ago

People

(Reporter: shaver, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

+++ This bug was initially created as a clone of Bug #472411 +++

tracemonkey talos is skipping revisions, making it impossible to use the talos reports to effectively determine when performance regressions occurred, and dramatically slowing down work on some parts of 3.1.  The data about incoming job rate and test completion time isn't available to me that I can see, but I presume releng has it.  I'm requesting that additional Talos hardware be deployed to triple the bandwidth available to tracemonkey talos on at least Windows and Mac, as soon as it can be acquired.  (Idle talos machines are an insignificant waste compared to idle engineers or having to manually throttle submissions to ensure that we don't skip data points.  The difficulty of setting up an accurate Talos environment on an individual developer's machine compounds the problem in an unfortunate way.)

Can someone please update this bug with an ETA on hardware arrival and machine deployment?  We need to incorporate the impact of this on beta3 and rc1 schedules, as soon as possible.
Alice, do we have any machines setup that can be deployed to this task ?
Component: Release Engineering → Release Engineering: Talos
Can I get an idea of when this set up was under load?

When I went to check to see how far behind things had fallen I found several idle talos machine and no queue in sight.  There were, however, three talos boxes that had been marked as hidden on the waterfall and were not being displayed.  I've unhidden the boxes and ensured that they are being scraped.
(In reply to comment #2)
> Can I get an idea of when this set up was under load?

Over the weekend, mostly, I believe, when we were trying to track down talos crashes and so forth by bisecting our changeset history.

> When I went to check to see how far behind things had fallen I found several
> idle talos machine and no queue in sight.  There were, however, three talos
> boxes that had been marked as hidden on the waterfall and were not being
> displayed.  I've unhidden the boxes and ensured that they are being scraped.

How can we check this, so we don't have to guess if they're behind/hidden/crashed/etc.?
If they've been hidden it can be resolved through the tinderbox admin page - unfortunately, there's a bug in tinderbox wherein if machines have fallen off the waterfall due to idle time they are sometimes re-added as hidden when they do their new report.  Considering that tracemonkey is (was?) a pretty low activity tree, those machine probably fell off months ago and never got re-added to the tinderbox correctly.  Whomever is sheriff for tracemonkey should be able to check the admin page if they no longer see all 9 talos boxes being displayed.

Having hidden boxes does not mean that the talos boxes themselves were idle.  They were up and testing, their results were just not viewable on the waterfall.
(In reply to comment #4)
> If they've been hidden it can be resolved through the tinderbox admin page -
> unfortunately, there's a bug in tinderbox wherein if machines have fallen off
> the waterfall due to idle time they are sometimes re-added as hidden when they
> do their new report.

Er, huh? Bug in tinderbox server? I haven't seen any bug reporting such a thing. If there's indeed a problem with tinderbox server, please file a bug under Webtools :: Tinderbox.
We did, bug 390349, but had a credibility problem.
But no matter how many talos boxes the tree has, it'll still skip builds by design, won't it?

This afternoon, there were Linux builds that ran from 13:37-14:29, 14:02-14:19, and 14:07-14:24. What seems to have happened, that the one which started at 14:07 got a talos run and the other two didn't, fits pretty well with how it's been explained to me, that the poller runs every ten minutes, and takes only the newest of the builds which it hasn't seen before, so that with lucky polling those three might have produced two (but not three) talos runs, but they weren't that lucky. Faster polling would decrease the odds of that happening, but as I understand it the poller's designed not to even try to test every build.

The times I've looked at the tracemonkey pushlog, that's fairly typical: average of maybe six or seven pushes a day, but tightly grouped in one or two bursts with an outlier or two, so _probably_ it wouldn't overload a poller that queued up more than one build per poll. Probably.

(Then there's the question of whether cross-machine numbers are really good enough to say that if Tss was 173.74 before your patch, on qm-plinux-trunk04, and 177.38 with your patch, on qm-plinux-trunk05, then that means you regressed it as opposed to meaning that trunk05 is a hair slower or had one bad run, but luckily that's not my problem.)
Talos does skip builds, but at the moment not by design - see bug 457885.  That bug should be resolved this quarter.

Once we've fixed the random skipping of builds based upon build start times/polling interval issues, we'll test a lot more stuff.  We'll skip less, but I would still expect us to have buildbot queue compress out builds under heavy load.

Our goal is to test as many builds as we possibly can and to only skip builds if we are unable to get ahead.

If this bug is now about inappropriate skipping of builds it should be duped to the active ftppoller failures bug.
(In reply to comment #7)
> The times I've looked at the tracemonkey pushlog, that's fairly typical:
> average of maybe six or seven pushes a day, but tightly grouped in one or two
> bursts with an outlier or two, so _probably_ it wouldn't overload a poller that
> queued up more than one build per poll. Probably.
Interesting usage data, good to know. 

Alice's point about ftppoller in bug#457885 is one bug which is complicating the picture. Phil, when you said "by design" I think you were referring to how Talos uses queue-compression intentionally. After we fix bug#457885, we could try changing how queues are processed, turning off queue-compression for a while, and see if the talos machines do eventually catch up in the lulls between the bursts. Getting this wrong would be bad, and ironically, would fail worst under peak load times, so lets deal with that investigation separately.
(In reply to comment #3)
> (In reply to comment #2)
> How can we check this, so we don't have to guess if they're
> behind/hidden/crashed/etc.?

I've added text to the top of each tinderbox waterfalls for mozilla-central, mozilla-1.9.1 and tracemonkey giving a count of machines for each project branch. (There's *way* too many to list by names). 
* If the count does not match what you can see on the waterfall, then ask the sheriff to see if any machines are hidden. 
* If the count still does not match up, file a bug in mozilla.org:ServerOperations. Note that if you are missing *all* talos machines for a given o.s. on a given branch, then the bug should be marked as blocker. 

We currently are able to run 3 Talos runs concurrently, and have them all report to the tinderbox, so I'm going to close this as FIXED.

(To deal with peak-demand-scenarios like this in future, arising on different proj branches at different times, the work in bug#476099 is probably the way to go but its too early to tell yet.)
Status: NEW → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → FIXED
Fwiw, I just updated the branch for buildbot ticket 415, where we could make the merging of buildrequests load-dependent. Say, we could start merging if the pendingBuilds go beyond a threshold.
I meant http://mxr.mozilla.org/mozilla/source/tools/buildbot-configs/testing/talos/perfmaster/ftppoller.py#141 which if I read it right means that if a poll finds multiple new directories, it will only look in the newest one ignoring all the others (or possibly means it may or may not depending on the sort, but "only in the newest" is consistent with the mozilla-central Linux behavior, where if the Linux 64 bit build starts and/or finishes at the wrong time, then nothing gets run because they're going in the same directory as the 32 bit builds and a "newest" directory with the 64 bit build is the only place the poller looks).
Oh, which is bug 457885. But what I really meant was that if shaver and sayrer want every single tracemonkey build to get a talos test run, they don't need to throw money at hardware because that *will not work*, they need to throw time at alice.

Updated

10 years ago
Component: Release Engineering: Talos → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.