Closed Bug 1630982 Opened 5 years ago Closed 5 years ago

Terminate mochitests using exit(0) as soon as we have test results

Categories

(Testing :: Mochitest, task)

Version 3
task

Tracking

(firefox78 fixed)

RESOLVED FIXED
mozilla78
Tracking Status
firefox78 --- fixed

People

(Reporter: florian, Assigned: florian)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [ci-costs-2020:done])

Attachments

(3 files, 1 obsolete file)

Shutdown is slow because we write things to disk and do cleanups (eg. garbage collection). There's an ongoing effort (tracked in bug 1606879) to make shutdown faster in general by doing the writes during the beginning of shutdown, and then terminating the process with an exit(0) for optimized builds.

For browser chrome mochitests, we can terminate without worrying about what we write to disk, so we can be much more agressive.

See Also: → 1608454

Would this skip leak checking?

I see this is skipping debug, so leak checking would still be valid. we should validate ccov, tsan, asan needs; assuming this is successful, we should do the rest of mochitest, xpcshell, reftest, wpt as well.

Whiteboard: [ci-costs-2020:todo]

For ccov, we need to take the long shutdown road, as coverage counters are written during shutdown.

So far I’ve tested on try with the browser/base/content/test/webrtc folder of tests that I know well (as I wrote most of them years ago). Initial results seem encouraging: 53s instead of 75s on Win7, 63 instead of 73 on Win10, 55 instead of 66 on OS X. On linux 299s without the patch and 246s with it (but both runs lost 119s during startup on the initialization of dbus that seems to be terribly slow when startup profiling is enabled, as we discovered in bug 1624868).

How different is this to just sending SIGKILL to the browser? For wpt control has to return to the Python and we currently do a dance where we try to do a clean shutdown using marionette, then send a SIGTERM and then a SIGKILL. So for an implementation there there are two options: either allow marionette to perform a fast shutdown, or just kill the process agressively.

One behavior that I don't like with the current patch is that terminating abruptly the parent process without saying goodbye to child processes means they output noise in the terminal before also terminating themselves. Probably not much of a problem for runs on infra, but I would also like this patch to improve the experience for developers running mochitests locally. Sending an IPC messages to tell child processes to quietly exit could be a follow-up.

(In reply to James Graham [:jgraham] from comment #6)

How different is this to just sending SIGKILL to the browser?

If profiling is enabled, this patch dumps the profile to the disk before calling exit. I would like profiling to be available for all tests, so that we can see where they spend time, and find more opportunities for optimization.

I think we should allow marionette to perform a fast shutdown.

Overall I think this is a great initiative and I am happy to see the implementation in the proposed patch. However, I feel we should keep in mind -- especially if we are considering expanding the scope of changes -- that we want to test the entire browser life-cycle, and various tests have, on occasion, demonstrated shutdown hangs.

(In reply to Geoff Brown [:gbrown] from comment #10)

various tests have, on occasion, demonstrated shutdown hangs.

Were these hangs not happening on debug builds?

I'm not sure, but probably. As I said, I support the proposed change; I just don't want us to try to short cut shutdown everywhere.

Assignee: nobody → florian
Status: NEW → ASSIGNED

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #3)

I see this is skipping debug, so leak checking would still be valid. we should validate ccov, tsan, asan needs; assuming this is successful, we should do the rest of mochitest, xpcshell, reftest, wpt as well.

The patch already skips debug builds, and also skips the this._coverageCollector case, so I'm hoping it's good for ccov. Anything specific that needs to be done to validate that tsan and asan are not going to be negatively affected?

Flags: needinfo?(jmaher)

(In reply to Florian Quèze [:florian] from comment #14)

The patch already skips debug builds, and also skips the this._coverageCollector case, so I'm hoping it's good for ccov.

No, _coverageCollector is not used in normal coverage builds (it was used for coverage collected via the debugger). There is a MOZ_CODE_COVERAGE definition that you can use instead.

xpcshell profile of the same test without the exit(0) https://perfht.ml/2VxnujY and with it https://perfht.ml/2x9hHYE

let me check with :decoder about tsan (and he might know about asan).

:decoder, this patch is doing a quick browser shutdown when we can to save time. Is this ok for TSAN? we are not doing this for debug as we look for leaks. Are there similar things for TSAN? If so how can we look for them?

Flags: needinfo?(jmaher) → needinfo?(choller)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #17)

let me check with :decoder about tsan (and he might know about asan).

:decoder, this patch is doing a quick browser shutdown when we can to save time. Is this ok for TSAN? we are not doing this for debug as we look for leaks. Are there similar things for TSAN? If so how can we look for them?

Is this just for tests, or is this the same we do in reality in the release builds?

Flags: needinfo?(choller) → needinfo?(jmaher)

just for tests, hacked into the harnesses

Flags: needinfo?(jmaher)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #19)

just for tests, hacked into the harnesses

I don't think it's a good idea to do this (not just in TSan, but for all builds), in particular because we had and have problems with shutdown. Shutdown hangs or races have been a constant problem in Firefox and skipping this part just in tests greatly increases the risk that we ship a shutdown regression of some sort.

we could keep the normal exit process enabled for the m-c job (and disable it for try and autoland)
or some variance of this :)

and, anyway, if we regress too much, we can always come back to the previous behavior

(In reply to Christian Holler (:decoder) from comment #20)

I don't think it's a good idea to do this (not just in TSan, but for all builds), in particular because we had and have problems with shutdown. Shutdown hangs or races have been a constant problem in Firefox and skipping this part just in tests greatly increases the risk that we ship a shutdown regression of some sort.

Which kind of shutdown hangs would happen in opt builds but not in debug builds? I'm under the impression that debug builds are more likely to hang than optimized builds, so I don't think we would be losing significant test coverage.

Adding the needinfo back for the initial question:

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #17)

let me check with :decoder about tsan (and he might know about asan).

:decoder, this patch is doing a quick browser shutdown when we can to save time. Is this ok for TSAN? we are not doing this for debug as we look for leaks. Are there similar things for TSAN? If so how can we look for them?

Flags: needinfo?(choller)

(In reply to Florian Quèze [:florian] from comment #22)

Which kind of shutdown hangs would happen in opt builds but not in debug builds? I'm under the impression that debug builds are more likely to hang than optimized builds, so I don't think we would be losing significant test coverage.

That is correct because in opt builds, the content processes have a fast exit path, while in debug they don't (hence in total, you will see more hangs with debug builds, if the hang is caused by a child).

But there is no such thing for the parent process in debug vs. opt and since opt has a vastly different timing compared to debug, it is not viable to assume that opt has less hangs than debug. It could well be exactly the other way around.

Adding the needinfo back for the initial question:

With TSan we should not exit because we would not be able to catch shutdown races anymore. There is no debug TSan configuration running because that would be way too slow.

Flags: needinfo?(choller)
Attachment #9141272 - Attachment description: Bug 1630982 - Terminate mochitests with an exit(0), → Bug 1630982 - Terminate mochitests with an exit(0), r=dthayer,gbrown.

(In reply to Sylvestre Ledru [:Sylvestre] from comment #21)

we could keep the normal exit process enabled for the m-c job (and disable it for try and autoland)
or some variance of this :)

After a discussion on matrix with decoder, I think we should avoid the exit(0) shutdown path for RELEASE_OR_BETA. This is a good compromise between ensuring we test what we ship to users, and reducing CI costs.

Attachment #9141272 - Attachment description: Bug 1630982 - Terminate mochitests with an exit(0), r=dthayer,gbrown. → Bug 1630982 - Terminate browser-chrome mochitests with an exit(0), r=dthayer,gbrown.
Pushed by fqueze@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b77388326b1b Terminate browser-chrome mochitests with an exit(0), r=gbrown,dthayer.

I reduced the patch to only contain the part related to browser-chrome tests so that it can land sooner.

For the mochitest-plain tests, I have test failures on try when a profiler folder is re-used for the edge case of a test manifest specifying tests using both the http and https schemes.

For the xpcshell tests, I have failures for tests that run in a child process, I need to figure out how to handle this case.

I'll handle finishing both of these in follow-up bugs.

Summary: Terminate the browser process using exit(0) as soon as we have test results → Terminate browser-chrome mochitests using exit(0) as soon as we have test results
Pushed by fqueze@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ae9720f97abb Terminate browser-chrome mochitests with an exit(0), r=gbrown,dthayer. https://hg.mozilla.org/integration/autoland/rev/a8cd211f5393 terminate the process quickly for mochitest-plain tests, r=gbrown.
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla78

(In reply to Florian Quèze [:florian] from comment #26)

I reduced the patch to only contain the part related to browser-chrome tests so that it can land sooner.

For the mochitest-plain tests, I have test failures on try when a profiler folder is re-used for the edge case of a test manifest specifying tests using both the http and https schemes.

This part actually landed in this bug.

For the xpcshell tests, I have failures for tests that run in a child process, I need to figure out how to handle this case.

I fixed this case where xpcshell tests using a child process failed, but I still have 2 xpcshell tests that crash at shutdown and I haven't figured out why yet; I'll file a follow-up for the xpcshell part of my patches.

Flags: needinfo?(florian)
Summary: Terminate browser-chrome mochitests using exit(0) as soon as we have test results → Terminate mochitests using exit(0) as soon as we have test results
Blocks: 1637637

Comment on attachment 9143127 [details]
Bug 1630982 - Exit(0) at the end of xpcshell tests, r=gbrown.

Revision D72407 was moved to bug 1637637. Setting attachment 9143127 [details] to obsolete.

Attachment #9143127 - Attachment is obsolete: true
Regressions: 1637715
Whiteboard: [ci-costs-2020:todo] → [ci-costs-2020:done]

Did we ever consider doing this for other test suites, e.g. reftest?

Flags: needinfo?(florian)

(In reply to Marco Castelluccio [:marco] from comment #33)

Did we ever consider doing this for other test suites, e.g. reftest?

I did it for xpcshell tests too in bug 1637637. I haven't attempted to do it for reftests. I just grabbed a profile of running locally a random folder of reftests: https://share.firefox.dev/3NIFAHM On my machine we could save about 200ms with an exit(0) after the last test finishes. That seems small compared to the 1.6s it took to start running tests (we do a full Firefox startup, initializing a full browser window before we close it to open the reftests window instead).

Feel free to open bugs for reftests (or other test suites) if you think there are interesting potential wins to be had. I'm probably not going to take them myself, but happy to help.

Flags: needinfo?(florian)

Given the large number of manifests/groups that we run, 200 ms per manifest/group is still a pretty large saving.
Joel, other than reftest, where could we do this?

Flags: needinfo?(jmaher)

reftest and web-platform-tests might be interesting. I assume we have all mochitests (webgl, media, a11y, browser, chrome, devtools, plain, etc.) with this exit(0). I don't believe there are other harnesses where we would see the scale of large volumes of tests and browser restarts.

This bug would be a good bug for a contributor or new hire.

Flags: needinfo?(jmaher)

(In reply to Joel Maher ( :jmaher ) (UTC -0800) (back July 11th) from comment #36)

reftest and web-platform-tests might be interesting.

I filed bug 1778458 for reftest.
wpt has bug 1608454 implemented according to bug 1608454 comment 4, so IDK if it would benefit from this as well.

I assume we have all mochitests (webgl, media, a11y, browser, chrome, devtools, plain, etc.) with this exit(0). I don't believe there are other harnesses where we would see the scale of large volumes of tests and browser restarts.

The two patches in this bug mention browser-chrome and plain, so I don't know if the other mochitest are covered as well by them. Are they? If not, we should file something like bug 1778458 for the rest of mochitest.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: