Permanent Windows web platform crashtest Hit MOZ_CRASH(Shutdown hanging after all known phases and workers finished.) at /builds/worker/checkouts/gecko/toolkit/components/terminator/nsTerminator.cpp:256 - DO NOT USE FOR CLASSIFICATION
Categories
(Core :: DOM: Core & HTML, defect, P5)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr91 | --- | unaffected |
firefox95 | --- | unaffected |
firefox96 | --- | unaffected |
firefox97 | --- | wontfix |
firefox98 | --- | wontfix |
People
(Reporter: intermittent-bug-filer, Unassigned)
References
(Regression)
Details
(Keywords: regression)
Filed by: archaeopteryx [at] coole-files.de
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=363342503&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Fy76AmC6S_aVGU-3H5x1tg/runs/0/artifacts/public/logs/live_backing.log
Since bug 1746186 landed, the Windows web platform crashtests log this shutdown crash.
Comment 1•2 years ago
|
||
Tom, could you take a look at this?
Updated•2 years ago
|
Comment 2•2 years ago
|
||
I don't know how to debug this. Bug 1741182 might actually be related.
Comment 3•2 years ago
|
||
Fwiw I doubt bug 1746186 actually introduced this issue, it's much more likely that the issue comes from the initial landing of reportError
. Or maybe the new test somehow triggers it.
Comment 4•2 years ago
|
||
web-locks/crashtests/after-worker-termination.https.html
sounds very suspicious!
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Comment 6•2 years ago
|
||
Kagami, could you take a look at this permanent Windows failure line which doesn't let the task fail?
Comment 7•2 years ago
|
||
Set release status flags based on info from the regressing bug 1746186
Comment 8•2 years ago
•
|
||
Is this still an issue? I can't repro the crash on my local environment (Windows debug build).
> ./mach wpt testing/web-platform/tests/web-locks/crashtests/after-worker-termination.https.html
(...)
web-platform-test
~~~~~~~~~~~~~~~~~
Ran 1 checks (1 tests)
Expected results: 1
Unexpected results: 0
OK
Comment 9•2 years ago
|
||
Yes, this is still permanent in CI for Windows 64-bit debug runs.
Comment 10•2 years ago
|
||
In that case maybe either the test is not the culprit or it's caused by some interaction of multiple tests...
Updated•2 years ago
|
Comment 12•2 years ago
|
||
Kagami, could it be that self.reportError(new Int16Array(2147483648))
is behaving such differently with JS::ErrorReportBuilder::NoSideEffects
that the test starts to behave strange? Without knowing any details but reading this I could imagine that passing a big array now causes a printout of the entire array where before we had maybe some more intelligent way to handle this? (Honestly I do not understand the purpose of that test at all, but that is for my lack of context, I assume.)
Comment 13•2 years ago
•
|
||
When I wrote that test, worker shutdown also shut reportError down (so that locks.request() can run after shutdown in consistent way, just as it did in the corresponding fuzzer testcase). I guess worker shutdown now cannot abort reportError anymore after bug 1746186.
I wonder mgaudet has better understanding since it's related to JS API.
Comment 14•2 years ago
|
||
I actually don't entirely understand what's going on here (I don't entirely follow the intent of the test case) also, as this seems to be related to worker shutdown as well, which I'm particularly ill-versed in.
I'd love to chime in with more, but alas, a little too outside my current understanding.
Comment 15•2 years ago
|
||
Okay, as this indeed is related to worker shutdown, I wonder Andrew has some idea here... If not I guess I'll have to dig into both APIs 👀
Comment 16•2 years ago
•
|
||
I don't entirely follow the intent of the test case
The reportError() is there only to wait for the worker shutdown, assuming the shutdown will abort the long-running reportError call. And it did when I wrote that test, which seemingly is now regressed.
(Yeah that's weird, but fuzzer wrote that, not me 😉)
Comment 17•2 years ago
•
|
||
It looks like Jens' observation in comment 12 and :saschanaz's supposition in comment 16 are likely correct. Based on my brief investigation of the pernosco trace from bug 1741186 that I link right here for simplicity, I'm seeing that in that trace, there are 2 calls to reportError, one of which never completes (and is still busy doing an apparently long-running loop of allocating strings and GC-ing), and the one which is on the thread where we see the error where it seems like it returns with "uncaught exception: unknown (can\'t convert to string)"
which is a result of the WorkerPrivate::InterruptCallback
being called and returning false.
Presumably if reportError was previously interruptible, it no longer is with the changes in bug 1746186. :evilpie, can you confirm if the changes in bug 1746186 would have changed this? It does seem like those changes were desirable, so maybe we can add interruptibility back in? If not, maybe reportError can be made to provide some kind of time and/or size bounds on the error it reports so self.reportError(new Int16Array(2147483648))
can't cause major problems for the browser?
Comment 18•2 years ago
|
||
Note that if reportError gets more efficient and we still want the coverage of blocking a worker, you can alternately wedge the worker by having it do a sync XHR to a helper server that just hangs the request until told to release it.
Comment 19•2 years ago
•
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #18)
Note that if reportError gets more efficient and we still want the coverage of blocking a worker, you can alternately wedge the worker by having it do a sync XHR to a helper server that just hangs the request until told to release it.
BTW, would this reduce also the computing power needed for this test? If I read "busy doing an apparently long-running loop of allocating strings and GC-ing" it reads as efficient as a while (true) {do some random alloc and free}
and not like a desirable pattern to use?
Edit: Sorry, "(Yeah that's weird, but fuzzer wrote that, not me 😉)" from comment 16 shows that there is no pattern, but copying fuzzer testcases to become regular tests needs some caution, probably.
Comment 20•2 years ago
|
||
Weirdly enough I can't reproduce this behavior? I tried putting reportError(new Int16Array(2147483648))
into a worker and it seemed to run quickly. It sounds like were are trying to stringify the Int16Array (which would allocate a string with "0," x 2147483648), but that really shouldn't be happening when running with NoSideEffects
. We should just produce the string "Object".
Comment 21•2 years ago
|
||
Yeah, reportError(new Int16Array(2147483648))
on the main thread also runs blazingly fast, but then everything got stuck when I tried new Error(new Int16Array(2147483648))
, I wonder we are somehow seeing that behavior here, although not sure why.
Comment 22•2 years ago
|
||
Maybe because WPT needs to stringify things to stdout?
Comment 23•2 years ago
•
|
||
BTW, Chrome throws Uncaught RangeError: Array buffer allocation failed
with new Int16Array(2147483648)
, and crashes early with new Error(new Int16Array(214748364))
(without the last digit).
Updated•2 years ago
|
Comment 24•2 years ago
|
||
Has anyone seen an instance of the failure where we get a stack trace for the worker thread in question? In all the logs I manually went to look at, it seemed like the test infrastructure was throwing errors and not providing such stacks. And a (cursory) investigation of the other artifacts seemed to indicate there was no artifact that had them.
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 30•2 years ago
|
||
IIUC, the test after-worker-termination.https.html is active on all platforms and we do not see any instances of this for 6 months now. Am I overlooking something that could prevent us from closing this bug?
Comment 31•2 years ago
•
|
||
I think nothing blocks us, but I wonder why the release management bot is not autoclosing this. Passing NI...
Comment 32•2 years ago
|
||
(In reply to Kagami :saschanaz from comment #31)
I think nothing blocks us, but I wonder why the release management bot is not autoclosing this. Passing NI...
The bot did not act on this bug because it does not have the intermittent-failure
keyword. It was dropped 10 months ago.
Comment 33•2 years ago
|
||
Thanks! Then I'll close this manually.
Updated•2 years ago
|
Description
•