Closed Bug 1215766 Opened 9 years ago Closed 9 years ago

Unhide tc cpptests and crashtests when they have sufficient infra support to run when they are scheduled

Categories

(Tree Management Graveyard :: Visibility Requests, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

The tc cpptests and the freshly-added tc crashtests on b2g emulator builds run somewhere between 10 and 18 hours after they are scheduled, except when they get even more backlogged and just get cancelled after 24 hours. That's not acceptable for a tier-1, visible job, so they are hidden on m-c and integration branches until someone figures out who to persuade to provide adequate infra support to start them running in a reasonable amount of time.

I don't know who that someone who will do the persuading and the finding out who to persuade is, but I know it was Edgar who turned on the crashtests, so that's my best guess.
(In reply to Phil Ringnalda (:philor) from comment #0)
> The tc cpptests and the freshly-added tc crashtests on b2g emulator builds
> run somewhere between 10 and 18 hours after they are scheduled, except when
> they get even more backlogged and just get cancelled after 24 hours. That's

I also noticed this last week. And now it seems back to normal? (at least, I didn't see a lot of pending task)
ni Greg who might know what was happening on TC side.

> not acceptable for a tier-1, visible job, so they are hidden on m-c and
> integration branches until someone figures out who to persuade to provide
> adequate infra support to start them running in a reasonable amount of time.
> 
> I don't know who that someone who will do the persuading and the finding out
> who to persuade is, but I know it was Edgar who turned on the crashtests, so
> that's my best guess.
Flags: needinfo?(garndt)
Mozilla time is a bit fuzzy, since it's already Monday morning in NZ and Japan and China, but it's still basically Sunday night because the bulk of the load comes from North America.
(In reply to Edgar Chen [:edgar][:echen] from comment #1)
> (In reply to Phil Ringnalda (:philor) from comment #0)
> > The tc cpptests and the freshly-added tc crashtests on b2g emulator builds
> > run somewhere between 10 and 18 hours after they are scheduled, except when
> > they get even more backlogged and just get cancelled after 24 hours. That's
> 
> I also noticed this last week. And now it seems back to normal? (at least, I
> didn't see a lot of pending task)
> ni Greg who might know what was happening on TC side.
> 
> > not acceptable for a tier-1, visible job, so they are hidden on m-c and
> > integration branches until someone figures out who to persuade to provide
> > adequate infra support to start them running in a reasonable amount of time.
> > 
> > I don't know who that someone who will do the persuading and the finding out
> > who to persuade is, but I know it was Edgar who turned on the crashtests, so
> > that's my best guess.

This was looked into last week.  There were a few periods of very large amount of jobs being created (going from an average of a few dozen per hour to 6k+) which caused a nasty backlog.  That, along with the fact that jobs take 1 hour on average to run, it makes for clearing out the backlog a long process.  On Thursday I increased the max capacity for the worker type handling these tests and of this morning there is no backlog.
Flags: needinfo?(garndt)
Right now on b2g-inbound, B2G KK Emulator x86 debug (because it was the first one I loaded), the tip-most 30 jobs are pending, going back to one scheduled last night 1,385 minutes ago, then there is one green job, two which died after 5 and 7 minutes with no explanation (from the times, though, I'd guess they started 5 and 7 minutes before the 24 hour expiration of the job), then a scattering of green/purple/green/purple.

Sorry, that won't work.
Hi Greg, now all the emulator tests are running upon the workerType b2gtest-emulator. Do you think distributing those tests to more types of workerType could reduce the terrible pending situation? Thanks!
Flags: needinfo?(garndt)
Distributing the tests out to more worker types might cause some tests to return quicker than others (like those only for x86-kk vs ICS), but the load will not go down.  The larger issue is that the average suite run is 1hr and there are hundreds of tests that get triggered per push per branch.  We also see periods where there are large spikes of jobs being submitted which really causes the backlog to climb.

We could very well keep b2gtest-emulator for tests for non x86-kk tests, and create a b2gtest-emulator-x86 worker for those specific tests if that's the major focus right now.

I think the solution might be:
1. evaluate what tests are running
2. disable everything that should not be running
3. Evaluate the capacity of b2gtest-emulator (or any workers related to testing emulators) and the cost associated with increasing capacity vs waiting for the backlog to complete.
4. get execution times lower

Admittedly #4 being the extremely hard one...possibly even impossible.  Let me know how I can help out with this.  If we want a testing worker type specific to x86 I can make that happen.
Flags: needinfo?(garndt)
5. Implement branch priorities so that non-try jobs are done before try jobs.
6. Implement coalescing so that during busy times on non-try branches, rather than getting stuck doing every single job from 20 hours in the past where nobody at all will ever under any circumstance look at them, we drop down to only doing one set for every four pushes (or whatever we are coalescing at now for buildbot jobs) while things are overloaded.

Buildbot's scheduling isn't a *great* thing to hold up as an example right now, with the 19 hour backlogs for Windows tests on try, but for the most part it manages not to get backlogged in ways which cause tree closures or cause us to hide jobs due to the way they are being scheduled, so it's got that going for it.
(In reply to Greg Arndt [:garndt] from comment #6)
> Distributing the tests out to more worker types might cause some tests to
> return quicker than others (like those only for x86-kk vs ICS), but the load
> will not go down.  The larger issue is that the average suite run is 1hr and
> there are hundreds of tests that get triggered per push per branch.  We also
> see periods where there are large spikes of jobs being submitted which
> really causes the backlog to climb.
> 
> We could very well keep b2gtest-emulator for tests for non x86-kk tests, and
> create a b2gtest-emulator-x86 worker for those specific tests if that's the
> major focus right now.
> 
> I think the solution might be:
> 1. evaluate what tests are running
> 2. disable everything that should not be running

Bug 1219118 is the one I am thinking of.

> 3. Evaluate the capacity of b2gtest-emulator (or any workers related to
> testing emulators) and the cost associated with increasing capacity vs
> waiting for the backlog to complete.
> 4. get execution times lower
> 
> Admittedly #4 being the extremely hard one...possibly even impossible.  Let
> me know how I can help out with this.  If we want a testing worker type
> specific to x86 I can make that happen.

About #4, nested virtualization on AWS is beneficial to emu-x86-kk execution time. But I have to admit that I know too little about how to make that happen.
Depends on: 1219118
After bug 1219118 we shut down emu-ics tests on TC that should have saved resources. Do you still see pending symptoms?
Flags: needinfo?(philringnalda)
Do I see them, or am I likely to see them? I do see them, because that solution completely ignored comment 7, so when I reproduced someone's cat stepping on the r key of their keyboard while they had one of these jobs selected in treeherder, my try jobs blocked running jobs on integration branches. Had I done that with a buildbot job, a) it wouldn't have blocked integration branch jobs, only other try jobs, because branch priorities, b) it would have triggered a nagios alert when the number of pending jobs skyrocketed, c) anyone with an LDAP account would have been able to fix the problem by using the buildapi web page to cancel the jobs, d) even if it was on an integration branch rather than on try so that it did block other integration branch jobs, recovery would have been infinitely quicker thanks to coalescing meaning that the backlogged jobs would run once on the tipmost push when it was again possible to run them, rather than starting with the oldest push and running on every single push from oldest to newest.

But, luckily for you my main interest in these particular jobs is in getting out from under being the single person in charge of deciding whether or not they are acceptable, so I'll fight those battles over other taskcluster jobs. Unhidden.
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(philringnalda)
Resolution: --- → FIXED
Depends on: 1232575
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.