Closed Bug 873433 Opened 11 years ago Closed 11 months ago

Schedule "smoketest" suite to be run after build completes and only if successful, then scheduling usual test runs

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: ted, Unassigned)

References

(Depends on 1 open bug)

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2665] )

We're trying to develop a smoketest suite (bug 863838) that we could run to quickly determine whether it's worthwhile running the full suite of tests. joduinn points out that there are a number of reasons why we should run this as a separate test job and not part of the build. However, implementing this requires complicating scheduling, in that the smoketest job should be scheduled on the completion of the build job, but further test jobs should not be scheduled until the smoketest job completes successfully.

Our current plan is to get an estimate of work from RelEng here, and if it's tractable (a matter of weeks) we'll proceed with building it as a separate test job. If it looks like it's going to take months then we'll shoehorn it into the build job as a quick-and-dirty solution.
See Also: → 863838
for now could we just do this at the end of a build step?  I know it would add 5-10 minutes to a build, but it could save a lot in the future.
We discussed this, and joduinn had a number of reasons to prefer doing it this way, so as a compromise position I said that as long as RelEng could get this work done in a reasonable amount of time we'd do it this way.

Some of his reasons:
* Ability to retrigger jobs
* Run on test machines, not build machines, so more closely matches real test runs

There may have been others that I'm leaving out.
Component: Release Engineering → Release Engineering: Automation (General)
QA Contact: catlee
Product: mozilla.org → Release Engineering
Found in triage.

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #2)
> We discussed this, and joduinn had a number of reasons to prefer doing it
> this way, so as a compromise position I said that as long as RelEng could
> get this work done in a reasonable amount of time we'd do it this way.
> 
> Some of his reasons:
> * Ability to retrigger jobs
** retriggering a test job without having to wait for a complete re-build is important, and avoids tons of delay for sheriffs chasing intermittent oranges. We made this mistake with initial setup of unittests, lets not repeat.
 
> * Run on test machines, not build machines, so more closely matches real
> test runs
** The OS we build on does not match the OS we test on. To be really useful, these smoketests should be run on the same OS that we would later be running the usual full test suites. Bug#863838 tracks creating that smoketest suite, so adding as dep.bug.

> There may have been others that I'm leaving out.
It was long time ago, maybe?! :-)


Unassigning myself, so that others can add this schedule change to configs once bug#863838 is closed.
Assignee: joduinn → nobody
Depends on: 863838
Summary: Implement cascading test runs to support running smoketests before full test runs → Schedule "smoketest" suite to be run after build completes and only if successful, then scheduling usual test runs
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2658]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2658] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2665]
Priority: -- → P3
Component: General Automation → General

This general idea came up again recently in All Hands discussions, and I think it has a lot of potential. A well-crafted Firefox/Gecko smoke test should be able to identify some class of test failures much more quickly than our normal procedures, providing faster backouts and reduced test costs.

Is taskcluster in good shape now to allow for multi-tiered task dependencies, like Decision Task -> Build -> Smoke Test -> Normal Tests?

Or would it be better to add such smoke tests to reviewbot? (I think this might be ideal from a work-flow perspective, but I don't know anything about reviewbot implementation details or limitations.)

if we can pick the most relevant 50 individual test files (i.e. ML test selection) would that be a smoketest? I ask that as a small challenge, I find that many of the failures we have are harness specific, or platform specific- running a single test from each harness on each platform/config combo would probably yield great results. In the end, how different is that from a well picked set of tests based on the files changed?

(In reply to Geoff Brown [:gbrown] from comment #4)

This general idea came up again recently in All Hands discussions, and I think it has a lot of potential. A well-crafted Firefox/Gecko smoke test should be able to identify some class of test failures much more quickly than our normal procedures, providing faster backouts and reduced test costs.

I think this could have tremendous value on Try, where I expect we see many more fundamental failures that would be caught by smoketests.

CC :aki, who I know was doing some thinking about this last year.

Is taskcluster in good shape now to allow for multi-tiered task dependencies, like Decision Task -> Build -> Smoke Test -> Normal Tests?

Yes, absolutely. It's pretty simple to add. The big question mark for us has been what to put into the smoketest.

Or would it be better to add such smoke tests to reviewbot? (I think this might be ideal from a work-flow perspective, but I don't know anything about reviewbot implementation details or limitations.)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #5)

if we can pick the most relevant 50 individual test files (i.e. ML test selection) would that be a smoketest? I ask that as a small challenge, I find that many of the failures we have are harness specific, or platform specific- running a single test from each harness on each platform/config combo would probably yield great results. In the end, how different is that from a well picked set of tests based on the files changed?

I don't have a specific idea of what to put into the smoketest, yet, but I think we could determine an initial set of tests with a little bit of developer consultation, and then let that set be refined over time as we get experience with it. I think of a smoketest as "A non-exhaustive set of tests that aim at ensuring that the most important functions work. The result of this testing is used to decide if a build is stable enough to proceed with further testing." For an initial Firefox/Gecko smoketest, I think of, does app startup complete, can we load a page, can we shutdown? Surely that's at most tens of seconds of test time and already there's value in that test. I wouldn't start from an assumption of running every harness on every platform -- remember, non-exhaustive. And I'd really want to keep the smoketest short, completing normally in just a few minutes.

I think this smoketest idea would still be relevant when SETA task optimization is replaced by smarter task/test optimization (based on files changed, code coverage data, ML, etc), but of course that depends on how efficient that optimization turns out to be. If a typical push of the future with optimized tests is only going to run 30 minutes of tests, then maybe a 5 minute smoketest isn't worth the effort; if it is going to run even tens of hours of tests, then a 5 minute smoketest seems like a good investment.

How many of our pushes result in builds are successful and all tests are broken? I had trouble finding evidence of that and while looking I would find more evidence of all jobs of a specific harness broken, or a few harnesses on a given platform broken. Keep in mind 5 minutes of tests will result in 8 minutes of runtime and on at least 4 platforms- so I want to make sure that we have good data to say "out of the 2400 test regressions found on integration in 2019, XXX regressions would have been caught faster and saved YYY time"

Once we have data, I view the decision matrix like this:
If XXX < 50: continue # not worth pursuing
if XXX < 150: P3- revisit discussion after test path + smart scheduling
if XXX < 250: P2- lets experiment in the near future
else: P1- lets get this done in the next 2 weeks # >10% of regressions, a no brainer

the hiccup here is try server, I think anything that makes it to autoland is already at a point of passing a smoketest, yet try server we don't have a lot of data about historical failures.

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #8)

How many of our pushes result in builds are successful and all tests are broken?

We are not looking for all tests to be broken, just the ones in the - not yet designed - smoketest. What if we asked, "Of all the known test regressions in 2019, which tests failed most frequently?" and used that to inform our selection of tests for a smoketest? If it's not looking like there is any such subset of tests which failed frequently in the past, I suppose we are - again - gated on someone designing a smart smoketest.

At least we know TC can handle this now - thanks :catlee! - and reviewbot deserves more thought.

Switching hats for a moment, I have some concerns...

  • If going the reviewbot route, the smoketest will need builds to run against, and while we have some hope to restraining smoketest run-time to a few minutes, builds will take much longer: automatic builds per phab push seem like a non-starter.

  • If going the tiered test route on integration (and/or try), I worry about complications to work-flow. How would a smoketest affect backfills? What do we do if/when smoketests fail intermittently?

I think a smoketest specifically for try server is the best chance. try we don't worry about backfills, but intermittent fails do worry me- every job can fail intermittently- we should have a clear agreement on how to handle that and report it.

(In reply to Geoff Brown [:gbrown] from comment #10)

Switching hats for a moment, I have some concerns...

  • If going the reviewbot route, the smoketest will need builds to run against, and while we have some hope to restraining smoketest run-time to a few minutes, builds will take much longer: automatic builds per phab push seem like a non-starter.

Would https://bugzilla.mozilla.org/show_bug.cgi?id=1561423#c10 help? (making it easy to run tests against arbitrary builds). That way the tests could run against, say, the latest nightly.

  • If going the tiered test route on integration (and/or try), I worry about complications to work-flow. How would a smoketest affect backfills? What do we do if/when smoketests fail intermittently?

Smoketests should be held to a higher standard: they should be among our least flaky tests.

QA Contact: catlee → jlorenzo

The smoketest suite hasn't been implemented yet, and I don't see any any plans to finish it. Some of the gains here (lowered cost of running the full suite) are negated by our ML test scheduling on autoland.

Given all of this, I'm closing this bug, as I doubt it's ever going to become important enough to work on.

Status: NEW → RESOLVED
Closed: 11 months ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.