chess.com post-game analysis gets stuck after bug 1728741
Categories
(Core :: DOM: Core & HTML, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr91 | --- | unaffected |
firefox95 | --- | unaffected |
firefox96 | + | fixed |
firefox97 | + | fixed |
People
(Reporter: emilio, Unassigned)
References
(Regression, )
Details
(Keywords: regression)
Attachments
(1 obsolete file)
STR: On a many-core machine (128 cores in this case) go to a random chess.com game like the one from the URL.
AR:
- The "Checking for mistakes..." progress bar gets stuck.
ER:
- The "Checking for mistakes..." process finishes soon enough.
Reporter | ||
Comment 1•3 years ago
|
||
It doesn't happen in Chromium, which also reports navigator.hardwareConcurrency==128
for me, so presumably is our bug, somewhere.
Comment 2•3 years ago
|
||
That's interesting, I thought this analysis happened server-side, but if it regressed after bug 1728741 then I guess not.
I have contacts there if we need specific information.
Comment 3•3 years ago
|
||
Workaround: don't make chess mistake ;)
Comment 4•3 years ago
|
||
emilio, did bug 1745428 fix this or why was the dependency added?
Reporter | ||
Comment 5•3 years ago
|
||
It did not, I added it because it was a pre-requisite I landed for easier debugging... After taking a closer look, it seems like it's trying to launch a bunch of workers and they are getting stuck or something... Can you repro this? If not, do you know how should I go about debugging this?
Comment 6•3 years ago
|
||
I don't really have time to debug this right now, so I was thinking to back out bug 1728741.
But of course if we can sort this out quickly, then no need to back out.
hmm, except that the problematic patch is in beta.
Updated•3 years ago
|
Comment 7•3 years ago
|
||
Set release status flags based on info from the regressing bug 1728741
Comment 8•3 years ago
|
||
I thought it might have been because of dom.workers.maxPerDomain, but that is already quite large
Comment 9•3 years ago
|
||
Backed out Bug 1728741 for causing this regression.
Fixed by backout: https://hg.mozilla.org/releases/mozilla-beta/rev/9f3f27a1bac7f1655a0f0d62a914efd4fd5b9d61
Reporter | ||
Comment 10•3 years ago
|
||
I'll try to poke a bit, but pointers to debug this would be appreciated. Looking at MOZ_LOG=WorkerPrivate:5
seems to imply workers are doing stuff, but not sure any useful stuff, I see a bunch of:
[Child 174156: DOM Worker]: D/WorkerPrivate Worker 7f53f9ed5a00 canceled GC timer because periodic
[Child 174156: DOM Worker]: D/WorkerPrivate Worker 7f53f9ed6000 canceled GC timer because idle
[Child 174156: DOM Worker]: D/WorkerPrivate Worker 7f53f9ed6000 scheduled idle GC timer
[Child 174156: DOM Worker]: D/WorkerPrivate Worker 7f53f9ed5a00 scheduled periodic GC timer
[Child 174156: DOM Worker]: D/WorkerPrivate Worker 7f53f2b6be00 canceled GC timer because periodic
[Child 174156: DOM Worker]: D/WorkerPrivate Worker 7f53f9ed5a00 canceled GC timer because idle
[Child 174156: DOM Worker]: D/WorkerPrivate Worker 7f53f9ed5a00 scheduled idle GC timer
[Child 174156: DOM Worker]: D/WorkerPrivate Worker 7f53f2b6be00 scheduled periodic GC timer
Even way after the page has loaded. Maybe they have some synchronization in their worker code that is falling apart with many workers? But I would hope that'd break similarly in Chrome, and that's not the case, so there must be something else going on...
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Comment 11•3 years ago
|
||
So they create 64 dedicated workers with /bundles/app/js/engine/stockfish-single.830cf9cc.js#/bundles/app/js/engine/stockfish-single.8ffa2b70.wasm
. In the end, a bunch of them are still getting scheduled all the time... But I'm not sure why, and debugging minified/wasm-compiled stockfish isn't particularly trivial.
Gian-Carlo, perhaps you can ping them to see if they can take a quick look and know what might be going on? They probably can log what's going inside of the worker quite more easily than me...
Reporter | ||
Comment 13•3 years ago
|
||
I did some extra digging, though nothing conclusive so far (disclaimer: I'm not the most familiar with our workers code). Andrew mentioned bug 1592227 as a potential cause of this, but the symptoms here are quite different (and I tried to hackily keep the worker alive by artificially bumping by one the count and that made no difference). The symptoms here are just a worker getting scheduled over and over but not terminating, I think.
Something that has made this rather annoying to debug is the issue not reproducing under rr, btw... So it might be some sort of race condition (maybe in our code, maybe in the worker code, not sure), but I'll have to add lots of extra instrumentation to debug this I guess because I can't catch it in rr :/
Reporter | ||
Comment 14•3 years ago
|
||
Ok so digging a bit more before sleeping, I see the worker threads are still doing stuff. In particular, they seem to get stuck always scheduling new timeouts. We process a timeout and then the worker schedules another one 20ms timeout from:
[Child 323820: DOM Worker]: D/WorkerTimeouts Worker 7f2581abba00 has new timeout: delay=20 interval=no
0 ccall() ["https://www.chess.com/bundles/app/js/engine/stockfish-single.830cf9cc.js#/bundles/app/js/engine/stockfish-single.8ffa2b70.wasm":4:2231961]
So we really need to know why are they scheduling that 20ms timeout. Are they waiting for something that hasn't happened yet? Something else? It'd be great to be able to have the unminified source or some sort of debug logging... I've taken a look at https://github.com/nmrugg/stockfish.js but no obvious clue.
Reporter | ||
Comment 15•3 years ago
|
||
Ah, found it by prettifying just the chess.com stuff. It looks something like this:
workerObj = {
postMessage: function sendMessage(str, sync) {
function ccall() {
if (loaded) {
Module.ccall("uci_command", "number", ["string"], [cmds.shift()]);
} else {
setTimeout(ccall, 20);
}
}
cmds.push(str);
if (sync) {
ccall();
} else {
wait(ccall, 1);
}
},
};
wait(function() {
Module = load_stockfish(myConsole, WasmPath);
Module.onRuntimeInitialized = function() {
loaded = true;
};
if (Module.print) {
Module.print = myConsole.log;
}
if (Module.printErr) {
Module.printErr = myConsole.log;
}
}, 1);
So it seems onRuntimeInitialized
is never called on some worker threads... Why, I don't know yet.
Comment 16•3 years ago
|
||
Hi guys, how can I reproduce this issue? Do I need a 128 core machine?
It sounds like we may be getting a little too greedy with the number of engines we are spawning. There may be cases where we spawn more workers than needed on these high core machines and they sit idle. That sounds like an opportunity on our side.
Reporter | ||
Comment 17•3 years ago
|
||
Presumably you can use 128 in the places where right now you use navigator.hardwareConcurrency
and it should reproduce.
Comment 19•3 years ago
|
||
[Tracking Requested - why for this release]:
If this was bad enough to get backed out of beta we should probably track for 97 as well.
Updated•3 years ago
|
Updated•3 years ago
|
Comment 20•3 years ago
|
||
Hi Olli, can you please back it out from beta?
Updated•3 years ago
|
Comment 21•3 years ago
|
||
This was effectively fixed for 97+ by bug 1750868. We can add reverting that change to this bug for whenever the root cause gets sorted.
Updated•3 years ago
|
Comment 22•3 years ago
|
||
Lowering down the severity as this issue was fixed on beta/release by bug 1750868.
Updated•3 years ago
|
Comment 23•3 years ago
|
||
We fixed the issue in beta and release, so this isn't a WebCompat issue at this moment.
Comment 24•1 year ago
|
||
emilio, can you still see this on Nightly?
We've had https://hg.mozilla.org/mozilla-central/rev/c41adb249305400e28f11167907015072dc13cda for quite some time now.
Comment 26•11 months ago
|
||
Emilio, smaug, is there anything we need from chess.com to debug this? Or is it just a matter of finding time on our side?
Reporter | ||
Comment 27•11 months ago
|
||
I don't think we have any evidence to think this is a Firefox bug, off-hand... There's no hint in the devtools about an error or anything of that sort. A chess.com person analyzing this / being able to debug what's going on in there and why clamping the exposed value fixes it would be great.
Updated•11 months ago
|
Updated•11 months ago
|
Description
•