1749002 - Permanent Windows web platform crashtest Hit MOZ_CRASH(Shutdown hanging after all known phases and workers finished.) at /builds/worker/checkouts/gecko/toolkit/components/terminator/nsTerminator.cpp:256 - DO NOT USE FOR CLASSIFICATION

Reporter

Description

•

4 years ago

treeherder

Filed by: archaeopteryx [at] coole-files.de
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=363342503&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Fy76AmC6S_aVGU-3H5x1tg/runs/0/artifacts/public/logs/live_backing.log

Since bug 1746186 landed, the Windows web platform crashtests log this shutdown crash.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 1

•

4 years ago

Tom, could you take a look at this?

status-firefox95: --- → unaffected

status-firefox96: --- → unaffected

status-firefox97: --- → affected

status-firefox-esr91: --- → unaffected

Flags: needinfo?(evilpies)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

4 years ago

Summary: Intermittent Hit MOZ_CRASH(Shutdown hanging after all known phases and workers finished.) at /builds/worker/checkouts/gecko/toolkit/components/terminator/nsTerminator.cpp:256 → Permanent Windows web platform crashtest Hit MOZ_CRASH(Shutdown hanging after all known phases and workers finished.) at /builds/worker/checkouts/gecko/toolkit/components/terminator/nsTerminator.cpp:256

Tom S. (please needinfo tschuster)

Comment 2

•

4 years ago

I don't know how to debug this. Bug 1741182 might actually be related.

Flags: needinfo?(evilpies)

Tom S. (please needinfo tschuster)

Updated

•

4 years ago

Comment 3

•

4 years ago

Fwiw I doubt bug 1746186 actually introduced this issue, it's much more likely that the issue comes from the initial landing of reportError. Or maybe the new test somehow triggers it.

Tom S. (please needinfo tschuster)

Comment 4

•

4 years ago

web-locks/crashtests/after-worker-termination.https.html sounds very suspicious!

Comment hidden (Intermittent Failures Robot)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

4 years ago

Keywords: intermittent-failure

Summary: Permanent Windows web platform crashtest Hit MOZ_CRASH(Shutdown hanging after all known phases and workers finished.) at /builds/worker/checkouts/gecko/toolkit/components/terminator/nsTerminator.cpp:256 → Permanent Windows web platform crashtest Hit MOZ_CRASH(Shutdown hanging after all known phases and workers finished.) at /builds/worker/checkouts/gecko/toolkit/components/terminator/nsTerminator.cpp:256 - DO NOT USE FOR CLASSIFICATION

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 6

•

4 years ago

Kagami, could you take a look at this permanent Windows failure line which doesn't let the task fail?

Flags: needinfo?(krosylight)

BugBot [:suhaib / :marco/ :calixte]

Comment 7

•

4 years ago

Set release status flags based on info from the regressing bug 1746186

status-firefox98: --- → affected

Kagami Rosylight [:saschanaz] (they/them)

Comment 8

•

4 years ago

•

Edited

Is this still an issue? I can't repro the crash on my local environment (Windows debug build).

> ./mach wpt testing/web-platform/tests/web-locks/crashtests/after-worker-termination.https.html
(...)
web-platform-test
~~~~~~~~~~~~~~~~~
Ran 1 checks (1 tests)
Expected results: 1
Unexpected results: 0
OK

Flags: needinfo?(krosylight)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 9

•

4 years ago

Yes, this is still permanent in CI for Windows 64-bit debug runs.

Kagami Rosylight [:saschanaz] (they/them)

Comment 10

•

4 years ago

In that case maybe either the test is not the culprit or it's caused by some interaction of multiple tests...

BMO Automation

Updated

•

4 years ago

Has Regression Range: --- → yes

Christian Holler (:decoder)

Comment 11

•

4 years ago

Jens, could bug 1741182 be related here?

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Comment 12

•

4 years ago

Kagami, could it be that self.reportError(new Int16Array(2147483648)) is behaving such differently with JS::ErrorReportBuilder::NoSideEffects that the test starts to behave strange? Without knowing any details but reading this I could imagine that passing a big array now causes a printout of the entire array where before we had maybe some more intelligent way to handle this? (Honestly I do not understand the purpose of that test at all, but that is for my lack of context, I assume.)

Flags: needinfo?(jstutte) → needinfo?(krosylight)

Kagami Rosylight [:saschanaz] (they/them)

Comment 13

•

4 years ago

•

Edited

When I wrote that test, worker shutdown also shut reportError down (so that locks.request() can run after shutdown in consistent way, just as it did in the corresponding fuzzer testcase). I guess worker shutdown now cannot abort reportError anymore after bug 1746186.

I wonder mgaudet has better understanding since it's related to JS API.

Flags: needinfo?(krosylight) → needinfo?(mgaudet)

Matthew Gaudet (he/him) [:mgaudet]

Comment 14

•

4 years ago

I actually don't entirely understand what's going on here (I don't entirely follow the intent of the test case) also, as this seems to be related to worker shutdown as well, which I'm particularly ill-versed in.

I'd love to chime in with more, but alas, a little too outside my current understanding.

Flags: needinfo?(mgaudet)

Kagami Rosylight [:saschanaz] (they/them)

Comment 15

•

4 years ago

Okay, as this indeed is related to worker shutdown, I wonder Andrew has some idea here... If not I guess I'll have to dig into both APIs 👀

Flags: needinfo?(bugmail)

Kagami Rosylight [:saschanaz] (they/them)

Comment 16

•

4 years ago

•

Edited

I don't entirely follow the intent of the test case

The reportError() is there only to wait for the worker shutdown, assuming the shutdown will abort the long-running reportError call. And it did when I wrote that test, which seemingly is now regressed.

(Yeah that's weird, but fuzzer wrote that, not me 😉)

Andrew Sutherland [:asuth] (he/him)

Comment 17

•

4 years ago

•

Edited

It looks like Jens' observation in comment 12 and :saschanaz's supposition in comment 16 are likely correct. Based on my brief investigation of the pernosco trace from bug 1741186 that I link right here for simplicity, I'm seeing that in that trace, there are 2 calls to reportError, one of which never completes (and is still busy doing an apparently long-running loop of allocating strings and GC-ing), and the one which is on the thread where we see the error where it seems like it returns with "uncaught exception: unknown (can\'t convert to string)" which is a result of the WorkerPrivate::InterruptCallback being called and returning false.

Presumably if reportError was previously interruptible, it no longer is with the changes in bug 1746186. :evilpie, can you confirm if the changes in bug 1746186 would have changed this? It does seem like those changes were desirable, so maybe we can add interruptibility back in? If not, maybe reportError can be made to provide some kind of time and/or size bounds on the error it reports so self.reportError(new Int16Array(2147483648)) can't cause major problems for the browser?

Flags: needinfo?(bugmail) → needinfo?(evilpies)

Andrew Sutherland [:asuth] (he/him)

Comment 18

•

4 years ago

Note that if reportError gets more efficient and we still want the coverage of blocking a worker, you can alternately wedge the worker by having it do a sync XHR to a helper server that just hangs the request until told to release it.

Jens Stutte [:jstutte]

Comment 19

•

4 years ago

•

Edited

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #18)

Note that if reportError gets more efficient and we still want the coverage of blocking a worker, you can alternately wedge the worker by having it do a sync XHR to a helper server that just hangs the request until told to release it.

BTW, would this reduce also the computing power needed for this test? If I read "busy doing an apparently long-running loop of allocating strings and GC-ing" it reads as efficient as a while (true) {do some random alloc and free} and not like a desirable pattern to use?

Edit: Sorry, "(Yeah that's weird, but fuzzer wrote that, not me 😉)" from comment 16 shows that there is no pattern, but copying fuzzer testcases to become regular tests needs some caution, probably.

Tom S. (please needinfo tschuster)

Comment 20

•

4 years ago

Weirdly enough I can't reproduce this behavior? I tried putting reportError(new Int16Array(2147483648)) into a worker and it seemed to run quickly. It sounds like were are trying to stringify the Int16Array (which would allocate a string with "0," x 2147483648), but that really shouldn't be happening when running with NoSideEffects. We should just produce the string "Object".

Flags: needinfo?(evilpies)

Kagami Rosylight [:saschanaz] (they/them)

Comment 21

•

4 years ago

Yeah, reportError(new Int16Array(2147483648)) on the main thread also runs blazingly fast, but then everything got stuck when I tried new Error(new Int16Array(2147483648)), I wonder we are somehow seeing that behavior here, although not sure why.

Kagami Rosylight [:saschanaz] (they/them)

Comment 22

•

4 years ago

Maybe because WPT needs to stringify things to stdout?

Kagami Rosylight [:saschanaz] (they/them)

Comment 23

•

4 years ago

•

Edited

BTW, Chrome throws Uncaught RangeError: Array buffer allocation failed with new Int16Array(2147483648), and crashes early with new Error(new Int16Array(214748364)) (without the last digit).

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

status-firefox97: affected → wontfix

Andrew Sutherland [:asuth] (he/him)

Comment 24

•

4 years ago

Has anyone seen an instance of the failure where we get a stack trace for the worker thread in question? In all the logs I manually went to look at, it seemed like the test infrastructure was throwing errors and not providing such stacks. And a (cursory) investigation of the other artifacts seemed to indicate there was no artifact that had them.

Comment hidden (Intermittent Failures Robot)

Pascal Chevrel:pascalc (PTO until Sept 2)

Updated

•

3 years ago

status-firefox98: affected → wontfix

Comment hidden (Intermittent Failures Robot)

Jens Stutte [:jstutte]

Comment 30

•

3 years ago

IIUC, the test after-worker-termination.https.html is active on all platforms and we do not see any instances of this for 6 months now. Am I overlooking something that could prevent us from closing this bug?

Flags: needinfo?(krosylight)

Kagami Rosylight [:saschanaz] (they/them)

Comment 31

•

3 years ago

•

Edited

I think nothing blocks us, but I wonder why the release management bot is not autoclosing this. Passing NI...

Flags: needinfo?(smujahid)

Flags: needinfo?(krosylight)

Flags: needinfo?(cdenizet)

Suhaib Mujahid [:suhaib]

Comment 32

•

3 years ago

(In reply to Kagami :saschanaz from comment #31)

I think nothing blocks us, but I wonder why the release management bot is not autoclosing this. Passing NI...

The bot did not act on this bug because it does not have the intermittent-failure keyword. It was dropped 10 months ago.

Flags: needinfo?(smujahid)

Flags: needinfo?(cdenizet)

Kagami Rosylight [:saschanaz] (they/them)

Comment 33

•

3 years ago

Thanks! Then I'll close this manually.

Kagami Rosylight [:saschanaz] (they/them)

Updated

•

3 years ago

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → INCOMPLETE