Closed Bug 1749002 Opened 2 years ago Closed 2 years ago

Permanent Windows web platform crashtest Hit MOZ_CRASH(Shutdown hanging after all known phases and workers finished.) at /builds/worker/checkouts/gecko/toolkit/components/terminator/nsTerminator.cpp:256 - DO NOT USE FOR CLASSIFICATION

Categories

(Core :: DOM: Core & HTML, defect, P5)

defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox-esr91 --- unaffected
firefox95 --- unaffected
firefox96 --- unaffected
firefox97 --- wontfix
firefox98 --- wontfix

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Regression)

Details

(Keywords: regression)

Filed by: archaeopteryx [at] coole-files.de
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=363342503&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Fy76AmC6S_aVGU-3H5x1tg/runs/0/artifacts/public/logs/live_backing.log


Since bug 1746186 landed, the Windows web platform crashtests log this shutdown crash.
Summary: Intermittent Hit MOZ_CRASH(Shutdown hanging after all known phases and workers finished.) at /builds/worker/checkouts/gecko/toolkit/components/terminator/nsTerminator.cpp:256 → Permanent Windows web platform crashtest Hit MOZ_CRASH(Shutdown hanging after all known phases and workers finished.) at /builds/worker/checkouts/gecko/toolkit/components/terminator/nsTerminator.cpp:256

I don't know how to debug this. Bug 1741182 might actually be related.

Flags: needinfo?(evilpies)
See Also: → 1741182

Fwiw I doubt bug 1746186 actually introduced this issue, it's much more likely that the issue comes from the initial landing of reportError. Or maybe the new test somehow triggers it.

web-locks/crashtests/after-worker-termination.https.html sounds very suspicious!

Summary: Permanent Windows web platform crashtest Hit MOZ_CRASH(Shutdown hanging after all known phases and workers finished.) at /builds/worker/checkouts/gecko/toolkit/components/terminator/nsTerminator.cpp:256 → Permanent Windows web platform crashtest Hit MOZ_CRASH(Shutdown hanging after all known phases and workers finished.) at /builds/worker/checkouts/gecko/toolkit/components/terminator/nsTerminator.cpp:256 - DO NOT USE FOR CLASSIFICATION

Kagami, could you take a look at this permanent Windows failure line which doesn't let the task fail?

Flags: needinfo?(krosylight)

Set release status flags based on info from the regressing bug 1746186

Is this still an issue? I can't repro the crash on my local environment (Windows debug build).

> ./mach wpt testing/web-platform/tests/web-locks/crashtests/after-worker-termination.https.html
(...)
web-platform-test
~~~~~~~~~~~~~~~~~
Ran 1 checks (1 tests)
Expected results: 1
Unexpected results: 0
OK
Flags: needinfo?(krosylight)

In that case maybe either the test is not the culprit or it's caused by some interaction of multiple tests...

Has Regression Range: --- → yes

Jens, could bug 1741182 be related here?

Flags: needinfo?(jstutte)

Kagami, could it be that self.reportError(new Int16Array(2147483648)) is behaving such differently with JS::ErrorReportBuilder::NoSideEffects that the test starts to behave strange? Without knowing any details but reading this I could imagine that passing a big array now causes a printout of the entire array where before we had maybe some more intelligent way to handle this? (Honestly I do not understand the purpose of that test at all, but that is for my lack of context, I assume.)

Flags: needinfo?(jstutte) → needinfo?(krosylight)

When I wrote that test, worker shutdown also shut reportError down (so that locks.request() can run after shutdown in consistent way, just as it did in the corresponding fuzzer testcase). I guess worker shutdown now cannot abort reportError anymore after bug 1746186.

I wonder mgaudet has better understanding since it's related to JS API.

Flags: needinfo?(krosylight) → needinfo?(mgaudet)

I actually don't entirely understand what's going on here (I don't entirely follow the intent of the test case) also, as this seems to be related to worker shutdown as well, which I'm particularly ill-versed in.

I'd love to chime in with more, but alas, a little too outside my current understanding.

Flags: needinfo?(mgaudet)

Okay, as this indeed is related to worker shutdown, I wonder Andrew has some idea here... If not I guess I'll have to dig into both APIs 👀

Flags: needinfo?(bugmail)

I don't entirely follow the intent of the test case

The reportError() is there only to wait for the worker shutdown, assuming the shutdown will abort the long-running reportError call. And it did when I wrote that test, which seemingly is now regressed.

(Yeah that's weird, but fuzzer wrote that, not me 😉)

It looks like Jens' observation in comment 12 and :saschanaz's supposition in comment 16 are likely correct. Based on my brief investigation of the pernosco trace from bug 1741186 that I link right here for simplicity, I'm seeing that in that trace, there are 2 calls to reportError, one of which never completes (and is still busy doing an apparently long-running loop of allocating strings and GC-ing), and the one which is on the thread where we see the error where it seems like it returns with "uncaught exception: unknown (can\'t convert to string)" which is a result of the WorkerPrivate::InterruptCallback being called and returning false.

Presumably if reportError was previously interruptible, it no longer is with the changes in bug 1746186. :evilpie, can you confirm if the changes in bug 1746186 would have changed this? It does seem like those changes were desirable, so maybe we can add interruptibility back in? If not, maybe reportError can be made to provide some kind of time and/or size bounds on the error it reports so self.reportError(new Int16Array(2147483648)) can't cause major problems for the browser?

Flags: needinfo?(bugmail) → needinfo?(evilpies)

Note that if reportError gets more efficient and we still want the coverage of blocking a worker, you can alternately wedge the worker by having it do a sync XHR to a helper server that just hangs the request until told to release it.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #18)

Note that if reportError gets more efficient and we still want the coverage of blocking a worker, you can alternately wedge the worker by having it do a sync XHR to a helper server that just hangs the request until told to release it.

BTW, would this reduce also the computing power needed for this test? If I read "busy doing an apparently long-running loop of allocating strings and GC-ing" it reads as efficient as a while (true) {do some random alloc and free} and not like a desirable pattern to use?

Edit: Sorry, "(Yeah that's weird, but fuzzer wrote that, not me 😉)" from comment 16 shows that there is no pattern, but copying fuzzer testcases to become regular tests needs some caution, probably.

Weirdly enough I can't reproduce this behavior? I tried putting reportError(new Int16Array(2147483648)) into a worker and it seemed to run quickly. It sounds like were are trying to stringify the Int16Array (which would allocate a string with "0," x 2147483648), but that really shouldn't be happening when running with NoSideEffects. We should just produce the string "Object".

Flags: needinfo?(evilpies)

Yeah, reportError(new Int16Array(2147483648)) on the main thread also runs blazingly fast, but then everything got stuck when I tried new Error(new Int16Array(2147483648)), I wonder we are somehow seeing that behavior here, although not sure why.

Maybe because WPT needs to stringify things to stdout?

BTW, Chrome throws Uncaught RangeError: Array buffer allocation failed with new Int16Array(2147483648), and crashes early with new Error(new Int16Array(214748364)) (without the last digit).

Has anyone seen an instance of the failure where we get a stack trace for the worker thread in question? In all the logs I manually went to look at, it seemed like the test infrastructure was throwing errors and not providing such stacks. And a (cursory) investigation of the other artifacts seemed to indicate there was no artifact that had them.

IIUC, the test after-worker-termination.https.html is active on all platforms and we do not see any instances of this for 6 months now. Am I overlooking something that could prevent us from closing this bug?

Flags: needinfo?(krosylight)

I think nothing blocks us, but I wonder why the release management bot is not autoclosing this. Passing NI...

Flags: needinfo?(smujahid)
Flags: needinfo?(krosylight)
Flags: needinfo?(cdenizet)

(In reply to Kagami :saschanaz from comment #31)

I think nothing blocks us, but I wonder why the release management bot is not autoclosing this. Passing NI...

The bot did not act on this bug because it does not have the intermittent-failure keyword. It was dropped 10 months ago.

Flags: needinfo?(smujahid)
Flags: needinfo?(cdenizet)

Thanks! Then I'll close this manually.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.