Terminate mochitests using exit(0) as soon as we have test results
Categories
(Testing :: Mochitest, task)
Tracking
(firefox78 fixed)
| Tracking | Status | |
|---|---|---|
| firefox78 | --- | fixed |
People
(Reporter: florian, Assigned: florian)
References
(Blocks 1 open bug, Regressed 1 open bug)
Details
(Whiteboard: [ci-costs-2020:done])
Attachments
(3 files, 1 obsolete file)
Shutdown is slow because we write things to disk and do cleanups (eg. garbage collection). There's an ongoing effort (tracked in bug 1606879) to make shutdown faster in general by doing the writes during the beginning of shutdown, and then terminating the process with an exit(0) for optimized builds.
For browser chrome mochitests, we can terminate without worrying about what we write to disk, so we can be much more agressive.
| Assignee | ||
Comment 1•1 year ago
|
||
Comment 2•1 year ago
|
||
Would this skip leak checking?
Comment 3•1 year ago
|
||
I see this is skipping debug, so leak checking would still be valid. we should validate ccov, tsan, asan needs; assuming this is successful, we should do the rest of mochitest, xpcshell, reftest, wpt as well.
Updated•1 year ago
|
Comment 4•1 year ago
|
||
For ccov, we need to take the long shutdown road, as coverage counters are written during shutdown.
| Assignee | ||
Comment 5•1 year ago
|
||
So far I’ve tested on try with the browser/base/content/test/webrtc folder of tests that I know well (as I wrote most of them years ago). Initial results seem encouraging: 53s instead of 75s on Win7, 63 instead of 73 on Win10, 55 instead of 66 on OS X. On linux 299s without the patch and 246s with it (but both runs lost 119s during startup on the initialization of dbus that seems to be terribly slow when startup profiling is enabled, as we discovered in bug 1624868).
Comment 6•1 year ago
|
||
How different is this to just sending SIGKILL to the browser? For wpt control has to return to the Python and we currently do a dance where we try to do a clean shutdown using marionette, then send a SIGTERM and then a SIGKILL. So for an implementation there there are two options: either allow marionette to perform a fast shutdown, or just kill the process agressively.
| Assignee | ||
Comment 7•1 year ago
|
||
| Assignee | ||
Comment 8•1 year ago
|
||
One behavior that I don't like with the current patch is that terminating abruptly the parent process without saying goodbye to child processes means they output noise in the terminal before also terminating themselves. Probably not much of a problem for runs on infra, but I would also like this patch to improve the experience for developers running mochitests locally. Sending an IPC messages to tell child processes to quietly exit could be a follow-up.
| Assignee | ||
Comment 9•1 year ago
|
||
(In reply to James Graham [:jgraham] from comment #6)
How different is this to just sending SIGKILL to the browser?
If profiling is enabled, this patch dumps the profile to the disk before calling exit. I would like profiling to be available for all tests, so that we can see where they spend time, and find more opportunities for optimization.
I think we should allow marionette to perform a fast shutdown.
Comment 10•1 year ago
|
||
Overall I think this is a great initiative and I am happy to see the implementation in the proposed patch. However, I feel we should keep in mind -- especially if we are considering expanding the scope of changes -- that we want to test the entire browser life-cycle, and various tests have, on occasion, demonstrated shutdown hangs.
| Assignee | ||
Comment 11•1 year ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #10)
various tests have, on occasion, demonstrated shutdown hangs.
Were these hangs not happening on debug builds?
Comment 12•1 year ago
|
||
I'm not sure, but probably. As I said, I support the proposed change; I just don't want us to try to short cut shutdown everywhere.
Updated•1 year ago
|
| Assignee | ||
Comment 13•1 year ago
|
||
| Assignee | ||
Comment 14•1 year ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #3)
I see this is skipping debug, so leak checking would still be valid. we should validate ccov, tsan, asan needs; assuming this is successful, we should do the rest of mochitest, xpcshell, reftest, wpt as well.
The patch already skips debug builds, and also skips the this._coverageCollector case, so I'm hoping it's good for ccov. Anything specific that needs to be done to validate that tsan and asan are not going to be negatively affected?
Comment 15•1 year ago
|
||
(In reply to Florian Quèze [:florian] from comment #14)
The patch already skips debug builds, and also skips the
this._coverageCollectorcase, so I'm hoping it's good for ccov.
No, _coverageCollector is not used in normal coverage builds (it was used for coverage collected via the debugger). There is a MOZ_CODE_COVERAGE definition that you can use instead.
| Assignee | ||
Comment 16•1 year ago
|
||
xpcshell profile of the same test without the exit(0) https://perfht.ml/2VxnujY and with it https://perfht.ml/2x9hHYE
Comment 17•1 year ago
|
||
let me check with :decoder about tsan (and he might know about asan).
:decoder, this patch is doing a quick browser shutdown when we can to save time. Is this ok for TSAN? we are not doing this for debug as we look for leaks. Are there similar things for TSAN? If so how can we look for them?
Comment 18•1 year ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #17)
let me check with :decoder about tsan (and he might know about asan).
:decoder, this patch is doing a quick browser shutdown when we can to save time. Is this ok for TSAN? we are not doing this for debug as we look for leaks. Are there similar things for TSAN? If so how can we look for them?
Is this just for tests, or is this the same we do in reality in the release builds?
Comment 20•1 year ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #19)
just for tests, hacked into the harnesses
I don't think it's a good idea to do this (not just in TSan, but for all builds), in particular because we had and have problems with shutdown. Shutdown hangs or races have been a constant problem in Firefox and skipping this part just in tests greatly increases the risk that we ship a shutdown regression of some sort.
Comment 21•1 year ago
|
||
we could keep the normal exit process enabled for the m-c job (and disable it for try and autoland)
or some variance of this :)
and, anyway, if we regress too much, we can always come back to the previous behavior
| Assignee | ||
Comment 22•1 year ago
|
||
(In reply to Christian Holler (:decoder) from comment #20)
I don't think it's a good idea to do this (not just in TSan, but for all builds), in particular because we had and have problems with shutdown. Shutdown hangs or races have been a constant problem in Firefox and skipping this part just in tests greatly increases the risk that we ship a shutdown regression of some sort.
Which kind of shutdown hangs would happen in opt builds but not in debug builds? I'm under the impression that debug builds are more likely to hang than optimized builds, so I don't think we would be losing significant test coverage.
Adding the needinfo back for the initial question:
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #17)
let me check with :decoder about tsan (and he might know about asan).
:decoder, this patch is doing a quick browser shutdown when we can to save time. Is this ok for TSAN? we are not doing this for debug as we look for leaks. Are there similar things for TSAN? If so how can we look for them?
Comment 23•1 year ago
|
||
(In reply to Florian Quèze [:florian] from comment #22)
Which kind of shutdown hangs would happen in opt builds but not in debug builds? I'm under the impression that debug builds are more likely to hang than optimized builds, so I don't think we would be losing significant test coverage.
That is correct because in opt builds, the content processes have a fast exit path, while in debug they don't (hence in total, you will see more hangs with debug builds, if the hang is caused by a child).
But there is no such thing for the parent process in debug vs. opt and since opt has a vastly different timing compared to debug, it is not viable to assume that opt has less hangs than debug. It could well be exactly the other way around.
Adding the needinfo back for the initial question:
With TSan we should not exit because we would not be able to catch shutdown races anymore. There is no debug TSan configuration running because that would be way too slow.
Updated•1 year ago
|
| Assignee | ||
Comment 24•1 year ago
|
||
(In reply to Sylvestre Ledru [:Sylvestre] from comment #21)
we could keep the normal exit process enabled for the m-c job (and disable it for try and autoland)
or some variance of this :)
After a discussion on matrix with decoder, I think we should avoid the exit(0) shutdown path for RELEASE_OR_BETA. This is a good compromise between ensuring we test what we ship to users, and reducing CI costs.
Updated•1 year ago
|
Comment 25•1 year ago
|
||
Pushed by fqueze@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b77388326b1b Terminate browser-chrome mochitests with an exit(0), r=gbrown,dthayer.
| Assignee | ||
Comment 26•1 year ago
|
||
I reduced the patch to only contain the part related to browser-chrome tests so that it can land sooner.
For the mochitest-plain tests, I have test failures on try when a profiler folder is re-used for the edge case of a test manifest specifying tests using both the http and https schemes.
For the xpcshell tests, I have failures for tests that run in a child process, I need to figure out how to handle this case.
I'll handle finishing both of these in follow-up bugs.
Comment 27•1 year ago
|
||
Backed out changeset b77388326b1b (bug 1630982) for causing leaks on mochitest jobs. CLOSED TREE
Log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=301178919&repo=autoland&lineNumber=1666
Push with failures:
https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&revision=b77388326b1bf23441fad1c3aa22ad04443550f9\
Backout:
https://hg.mozilla.org/integration/autoland/rev/e2f1a6b3f0a865cbdbf6810d2278de290ce17773
| Assignee | ||
Comment 28•1 year ago
|
||
Depends on D71336
Comment 29•1 year ago
|
||
Pushed by fqueze@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ae9720f97abb Terminate browser-chrome mochitests with an exit(0), r=gbrown,dthayer. https://hg.mozilla.org/integration/autoland/rev/a8cd211f5393 terminate the process quickly for mochitest-plain tests, r=gbrown.
Comment 30•1 year ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/ae9720f97abb
https://hg.mozilla.org/mozilla-central/rev/a8cd211f5393
| Assignee | ||
Comment 31•1 year ago
|
||
(In reply to Florian Quèze [:florian] from comment #26)
I reduced the patch to only contain the part related to browser-chrome tests so that it can land sooner.
For the mochitest-plain tests, I have test failures on try when a profiler folder is re-used for the edge case of a test manifest specifying tests using both the http and https schemes.
This part actually landed in this bug.
For the xpcshell tests, I have failures for tests that run in a child process, I need to figure out how to handle this case.
I fixed this case where xpcshell tests using a child process failed, but I still have 2 xpcshell tests that crash at shutdown and I haven't figured out why yet; I'll file a follow-up for the xpcshell part of my patches.
Comment 32•1 year ago
|
||
Comment on attachment 9143127 [details]
Bug 1630982 - Exit(0) at the end of xpcshell tests, r=gbrown.
Revision D72407 was moved to bug 1637637. Setting attachment 9143127 [details] to obsolete.
Updated•11 months ago
|
Description
•