Closed Bug 443333 Opened 17 years ago Closed 17 years ago

Clarify tier level for fast Talos machines

Categories

(Release Engineering :: General, defect, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: reed, Assigned: joduinn)

Details

In bug 414456, three fast Talos machines were brought up to replace the aging tinderbox perf machines on the Firefox (1.9.0 branch) tree. These new machines have been doing a great job as their replacement, and my prior worries about their performance have been completely lifted. As a sheriff who has had to deal with tracking down performance regressions before, I count a lot on having these fast performance machines to give me better knowledge about which patch caused a regression. I was under the impression (from the comments in bug 414456) that the new fast Talos machines would be tier 1 for software and tier 2 for hardware (since they are minis), which would match exactly the old tinderbox perf machines they were replacing. However, recent discussions with alice and joduinn online have led to the knowledge that RE does not consider these machines tier 1. Fast performance machines are a real necessity for tracking down performance regressions, especially when lots of people commit at the same time. This is still quite noticeable with the use of test-on-commit in the new 1.9.1 / 2.0 trees, as the slow Talos machines test a range of commits while the fast Talos machines seem mostly to test each commit (with some exceptions). Sheriffs really need these fast Talos machines to be supported by IT as tier 1 so we can get any problems with the machines fixed as soon as possible. If these machines had redundant backups, then tier 1 wouldn't be as needed, but there are no backup machines for these, so we really need them to be IT supported as tier 1. Also, I'm only asking for tier 1 support for software, just like the old tinderbox perf machines. 90-99% of the time, any problems with a tinderbox is software-related, so if a mac mini (as these machines are) can't be supported as tier 1 for hardware reasons, I at least want to make sure that they are tier 1 supported for software. Tier 1 also means sheriffs can close the tree when these machines have problems, which is something that should be done, as these machines are our first line against performance regressions, and without them, it becomes very difficult to see which check-in caused a performance problem, as the main Talos machines take hours to complete their runs. Hope we can all come to an agreement about these machines that will allow sheriffs to count on them in the battle against performance regressions. Thanks.
We consider talos machines to be tier 2. The old perf machines were tier 1 because they provided the _only perf coverage per platform. While having a fast cycle machine go down is not ideal, there are still slower cycling talos machines providing coverage. Release engineering considers talos machines to be tier 1 blocking when we lose all coverage for a given platform on a given waterfall.
Status: NEW → ASSIGNED
Priority: -- → P1
Assignee: nobody → joduinn
Status: ASSIGNED → NEW
Release Engineering is organizing build and unittest and talos machines into common pools. This means we are less vulnerable to tree closure if a single machine fails, because we have another machine available to cover for the failed machine and continue - an important concern when dealing with 250-270 machines in a 24x7 situation. See my past blogs for further details. For support this pooling strategy means: - losing all machines for a given platform on a given branch closes the tree and is Tier1. - losing an individual talos/unittest machine on a given branch does not close the tree and is tier2, so long as there is at least one working machine still on that platform on that branch on that tree. Losing multiple machines in a pool is still Tier2, so long as there is still at least one remaining machine doing the work. There are two other orthogonal issues here: 1) From a support point of view, we consider all talos machines to be part of the same pool, and governed by the same Tier1/Tier2 rules above. We do not distinguish between fast-cycle or full-cycle machines in this pool. To be precise, checkins are determined to be safe by looking at full cycle results - the fast-cycle machines run such a tiny tiny subset, they are not even close to conclusive. 2) We make no distinction about the cause of the problem when deciding the support level for a machine. After all, when a machine goes offline, someone watching the tree wont know if its hardware/software/other problems - all they know is that it stopped working! Therefore, its not possible to have software=tier2, hardware=tier1 without having already treated it as a tier1 situation and paged everyone. In those situations, we need a clear "can I page IT in the middle of the night for this or not". Tier1 == close the tree, ok to page. Tier2 == dont close the tree, not ok to page.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
CC'ing some sheriffs that have had to track down performance regressions before to get their opinions. It's not a very fun task, and without the fast Talos machines, it makes it very difficult.
I don't know what the criteria are for Tier1 vs Tier2, but not having fast Talos machines available causes development to go substantially more slowly. (Not having any Talos machines causes development to also go substantially more slowly, in that nothing happens until they return.) If we don't want fast-Talos machines to be Tier 1, I can only assume that it's because we believe they will cause a large support burden due to breaking frequently -- otherwise the value of having the fast cycle restored ASAP would outweigh the IT/releng cost of the escalation. But if that's the case, then making those machines more reliable would seem to be the right investment! Tier2 tinderboxes were originally ones for platforms that weren't important enough to back out a patch for if they signalled failure, and sometimes weren't ones that we were in a position to make reliable. It doesn't seem like either of those cases apply here, but it's been a long decade, so the criteria for distinguishing might have changed.
Tier1 vs Tier2 indicate the turnaround time involved in the repair of a failing machine. If a tier 1 machine goes red we will attempt to fix immediately, including paging people on evenings/weekends. Tier 2 boxes are allowed to wait till regular business hours to be dealt with. Also, we aren't taking about leaving a given waterfall without coverage. As I stated in comment #2 if a full set of talos machines on a given platform goes red then we consider it to be a tier 1 to be handled immediately. The most likely situation would be a single fast talos machine behaving oddly on a Saturday or Sunday night, resulting in slower reporting of performance results until the issue is investigated on Monday. This is also the exact reason that the machines are in sets, if a single talos machine on a platform has gone AWOL (be they fast cycle, nochrome, etc) we maintain coverage on that platform through the other boxes. An yes, we are constantly putting effort into keeping talos machines up and running - including patches to buildbot and talos along with machine configuration tweaks. We've been operating internally to Release Eng with this understanding of talos tier status for some time now (at least since the first sets started to appear on the waterfall) and I don't believe that the community has been under served by it. For the most part, failing machines are resurrected quickly. We are mostly wrangling on a edge case ("OMG - I need to check in during this long weekend right now and a single fast cycle machine is orange!").
Are either of the following statements incorrect? 1) If it's not important to fix a tinderbox's machine problems on the weekend, then it shouldn't hold the tree closed if it's orange/red due to machine problems. 2) If a machine shouldn't hold the the tree closed due to orange, then it shouldn't be on the main tinderbox. In other words, I think what you're saying is incompatible with saying that these machines should be on the main tinderbox display. And I think they *should* be on the main tinderbox display.
The design of the talos testing platform is to consider machines not as atomic units but as sets - thus having a single machine in a set go orange/red is not enough to close the tree. It was designed in this way to decrease tree closures based upon single point of failure problems. If the waterfall is going to be held to only machine granularity and not sets then talos shouldn't be on it.
Can we show the set-status in a single column? That would seem to make it easier for everyone to tell when something is wrong that needs dealing with, and when we're just running on one engine so to speak.
We initially had sets report to a single column, but it was hard to read to say the least. It was at request to split them up to be able to make sense of the output - especially when some talos machines test the exact same build and thus report the same start time to the waterfall.
Ah, right, I remember now. I guess putting the talos data inside the waterfall-box for the build they're testing, and just having a "health column" might work, but that's hardly going to just be a config tweak. :/
(In reply to comment #7) > The design of the talos testing platform is to consider machines not as atomic > units but as sets - thus having a single machine in a set go orange/red is not > enough to close the tree. It was designed in this way to decrease tree > closures based upon single point of failure problems. I didn't know this, and it contradicts what is at the top of http://tinderbox.mozilla.org/Firefox/ . Could we revise the text at the top of http://tinderbox.mozilla.org/Firefox/ to reflect this? (Currently it says "Do not check in when the tree is broken (red or orange).")
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.