Closed Bug 1592227 Opened 5 years ago Closed 7 months ago

Worker busy counts probably erroneously reaching zero and causing worker garbage collection (GC) - WASM in WebWorker terminates (crashes?) without notice

Categories

(Core :: DOM: Workers, defect, P2)

70 Branch
defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox121 --- fixed
firefox122 --- fixed

People

(Reporter: mail, Unassigned)

References

(Blocks 1 open bug)

Details

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0

Steps to reproduce:

Running ffmpeg/WASM video encoding in a WebWorker.

Test case (use the demo file, the link labeled "this one"):
https://phoboslab.org/files/ffmpeg-mt/

Actual results:

Workers terminates after a few seconds without any error message. Reproducible in Firefox70/Win and macOS.

Expected results:

Workers should complete or at least produce an error message.

The call into the WASM _main function never returns and doesn't throw an error. I.e. neither "called _main" nor "error" is logged for:

try {
console.log('calling _main');
module.functions._main(argc, argv);
console.log('called _main');
} catch (e) {
console.log('error', e);
}

Opening the JS Debugger Console in Firefox (and especially setting break points) seems to delay or sometimes completely mitigate the bug.

Using just one encoding Worker makes no difference. The Worker just terminates after a few seconds.

The ffmpeg version used here is compiled to WASM with emcc -O3. It's also reproducible with -Os or -O2. The test case completes just fine when compiled without optimizations or with SAFE_HEAP=1 or STACK_OVERFLOW_CHECK=1, making it hard to debug.

However, I believe this is not a problem with the WASM or JS-glue code. It works fine on Chrome/Win, Chrome/macOS, Chrome/Android, Safari/macOS, Safari/iOS and Edge/Win.

Component: Untriaged → DOM: Workers
Product: Firefox → Core

Thank you for reporting this and providing a page that makes it easy to attempt to reproduce the problem!

Analysis

By inference, it seems like the workers must be getting erroneously garbage collected. This implies the (internal implementation detail) worker's busy count is erroneously reaching zero when the Worker is still active. This allows the cycle collector to decide that the "message" event listener on the worker added via onmessage has a cyclical reference to the Worker that can be ignored and the Worker can be garbage collected. (The "clever" busy tracking is why there's no call to DOMEventTargetHelper::KeepAliveIfHasListenersFor for "message" events.)

This would explain why devtools would eliminate or at least mitigate the problem.

Looking at https://phoboslab.org/files/ffmpeg-mt/ffmpeg-worker-mp4.js it appears that the steady-state execution occurs in the self.onmessage handler for type=="run" after the initial self.postMessage({type: "ready") was sent as a ping that the "run" pongs. The Worker does also echo its own {type: "run"} when starting.

The worker binding's postMessage is explicitly WorkerThreadModifyBusyCount at https://searchfox.org/mozilla-central/rev/1fe0cf575841dbf3b7e159e88ba03260cd1354c0/dom/workers/Worker.cpp#88 so it's not immediately obvious what could be going wrong. The STR should be able to reproduce this under RR without too much hassle (although the inherent serialization of rr will slow things down; thankfully the computation is deterministic!).

Workaround

The workaround for the time being is that you would want to do something like have a global that tracks the workers that should be alive. So something like:

let activeWorkers = new Set();
...
let worker = new Worker("ffmpeg-worker-mp4.js");
activeWorkers.add(worker);
...
worker.terminate();
activeWorkers.delete(worker);
Priority: -- → P2
Summary: WASM in WebWorker terminates (crashes?) without notice → Worker busy counts probably erroneously reaching zero and causing worker garbage collection (GC) - WASM in WebWorker terminates (crashes?) without notice
Severity: normal → S3
Status: UNCONFIRMED → NEW
Ever confirmed: true

I believe this was fixed by our comprehensive overhaul of worker busy tracking in bug 1836700 and its follow-ups (and its predecessors). I am unable to reproduce with the fantastic test case at https://phoboslab.org/files/ffmpeg-mt/ using the "this one" link from comment 0 at this time, and it appears that the page is not attempting to mitigate the problem.

I am going to mark this as fixed rather than WFM or duping the bug because of the test case and because the bug was absolutely valid.

Dominic, thank you again for the test case and reporting this. I'm setting a needinfo if you could help confirm that you don't see this problem or variants of it and/or that your linked test case did not have any mitigations put in place. Note that the fix is not in release yet, but will be soon as the fixes are in Firefox 121 which should begin release rollout on December 19th.

Status: NEW → RESOLVED
Closed: 7 months ago
Depends on: 1836700
Flags: needinfo?(mail)
Resolution: --- → FIXED

The test case at https://phoboslab.org/files/ffmpeg-mt/ is indeed unaltered from when I first reported this problem (I made a separate version with your proposed workaround here https://phoboslab.org/files/ffmpeg-mt-fixed/ )

I'm not sure what to make of this, but for me it works fine with Firefox 120 on Linux/Mac/Win over a number of runs. So I guess there were some mitigations in the meantime!?

Worth noting that the test case consistently failed with Firefox 70. So if it works for you with 121, I believe that we can call it fixed. Thanks!

Flags: needinfo?(mail)
Blocks: 1869046

(In reply to Dominic Szablewski from comment #5)

I'm not sure what to make of this, but for me it works fine with Firefox 120 on Linux/Mac/Win over a number of runs. So I guess there were some mitigations in the meantime!?

It's quite possible that other fixes related to workers may have fixed the problem incidentally for your use case, yes.

Worth noting that the test case consistently failed with Firefox 70. So if it works for you with 121, I believe that we can call it fixed. Thanks!

Great, thank you!

You need to log in before you can comment on or make changes to this bug.