Open Bug 1745425 Opened 3 years ago Updated 11 months ago

chess.com post-game analysis gets stuck after bug 1728741

Categories

(Core :: DOM: Core & HTML, defect)

defect

Tracking

()

Tracking Status
firefox-esr91 --- unaffected
firefox95 --- unaffected
firefox96 + fixed
firefox97 + fixed

People

(Reporter: emilio, Unassigned)

References

(Regression, )

Details

(Keywords: regression)

Attachments

(1 obsolete file)

STR: On a many-core machine (128 cores in this case) go to a random chess.com game like the one from the URL.

AR:

  • The "Checking for mistakes..." progress bar gets stuck.

ER:

  • The "Checking for mistakes..." process finishes soon enough.

It doesn't happen in Chromium, which also reports navigator.hardwareConcurrency==128 for me, so presumably is our bug, somewhere.

Depends on: 1745428

That's interesting, I thought this analysis happened server-side, but if it regressed after bug 1728741 then I guess not.

I have contacts there if we need specific information.

Workaround: don't make chess mistake ;)

emilio, did bug 1745428 fix this or why was the dependency added?

Flags: needinfo?(emilio)

It did not, I added it because it was a pre-requisite I landed for easier debugging... After taking a closer look, it seems like it's trying to launch a bunch of workers and they are getting stuck or something... Can you repro this? If not, do you know how should I go about debugging this?

Flags: needinfo?(emilio) → needinfo?(bugs)

I don't really have time to debug this right now, so I was thinking to back out bug 1728741.
But of course if we can sort this out quickly, then no need to back out.

hmm, except that the problematic patch is in beta.

Flags: needinfo?(bugs)

Set release status flags based on info from the regressing bug 1728741

I thought it might have been because of dom.workers.maxPerDomain, but that is already quite large

Flags: needinfo?(bugs)

I'll try to poke a bit, but pointers to debug this would be appreciated. Looking at MOZ_LOG=WorkerPrivate:5 seems to imply workers are doing stuff, but not sure any useful stuff, I see a bunch of:

[Child 174156: DOM Worker]: D/WorkerPrivate Worker 7f53f9ed5a00 canceled GC timer because periodic
[Child 174156: DOM Worker]: D/WorkerPrivate Worker 7f53f9ed6000 canceled GC timer because idle
[Child 174156: DOM Worker]: D/WorkerPrivate Worker 7f53f9ed6000 scheduled idle GC timer
[Child 174156: DOM Worker]: D/WorkerPrivate Worker 7f53f9ed5a00 scheduled periodic GC timer
[Child 174156: DOM Worker]: D/WorkerPrivate Worker 7f53f2b6be00 canceled GC timer because periodic
[Child 174156: DOM Worker]: D/WorkerPrivate Worker 7f53f9ed5a00 canceled GC timer because idle
[Child 174156: DOM Worker]: D/WorkerPrivate Worker 7f53f9ed5a00 scheduled idle GC timer
[Child 174156: DOM Worker]: D/WorkerPrivate Worker 7f53f2b6be00 scheduled periodic GC timer

Even way after the page has loaded. Maybe they have some synchronization in their worker code that is falling apart with many workers? But I would hope that'd break similarly in Chrome, and that's not the case, so there must be something else going on...

Flags: needinfo?(emilio)

So they create 64 dedicated workers with /bundles/app/js/engine/stockfish-single.830cf9cc.js#/bundles/app/js/engine/stockfish-single.8ffa2b70.wasm. In the end, a bunch of them are still getting scheduled all the time... But I'm not sure why, and debugging minified/wasm-compiled stockfish isn't particularly trivial.

Gian-Carlo, perhaps you can ping them to see if they can take a quick look and know what might be going on? They probably can log what's going inside of the worker quite more easily than me...

Flags: needinfo?(gpascutto)

I reached out with you in CC.

Flags: needinfo?(gpascutto)

I did some extra digging, though nothing conclusive so far (disclaimer: I'm not the most familiar with our workers code). Andrew mentioned bug 1592227 as a potential cause of this, but the symptoms here are quite different (and I tried to hackily keep the worker alive by artificially bumping by one the count and that made no difference). The symptoms here are just a worker getting scheduled over and over but not terminating, I think.

Something that has made this rather annoying to debug is the issue not reproducing under rr, btw... So it might be some sort of race condition (maybe in our code, maybe in the worker code, not sure), but I'll have to add lots of extra instrumentation to debug this I guess because I can't catch it in rr :/

Ok so digging a bit more before sleeping, I see the worker threads are still doing stuff. In particular, they seem to get stuck always scheduling new timeouts. We process a timeout and then the worker schedules another one 20ms timeout from:

[Child 323820: DOM Worker]: D/WorkerTimeouts Worker 7f2581abba00 has new timeout: delay=20 interval=no
0 ccall() ["https://www.chess.com/bundles/app/js/engine/stockfish-single.830cf9cc.js#/bundles/app/js/engine/stockfish-single.8ffa2b70.wasm":4:2231961]

So we really need to know why are they scheduling that 20ms timeout. Are they waiting for something that hasn't happened yet? Something else? It'd be great to be able to have the unminified source or some sort of debug logging... I've taken a look at https://github.com/nmrugg/stockfish.js but no obvious clue.

Ah, found it by prettifying just the chess.com stuff. It looks something like this:

    workerObj = {
      postMessage: function sendMessage(str, sync) {
        function ccall() {
          if (loaded) {
            Module.ccall("uci_command", "number", ["string"], [cmds.shift()]);
          } else {
            setTimeout(ccall, 20);
          }
        }
        cmds.push(str);
        if (sync) {
          ccall();
        } else {
          wait(ccall, 1);
        }
      },
    };
    wait(function() {
      Module = load_stockfish(myConsole, WasmPath);
      Module.onRuntimeInitialized = function() {
        loaded = true;
      };
      if (Module.print) {
        Module.print = myConsole.log;
      }
      if (Module.printErr) {
        Module.printErr = myConsole.log;
      }
    }, 1);

So it seems onRuntimeInitialized is never called on some worker threads... Why, I don't know yet.

Hi guys, how can I reproduce this issue? Do I need a 128 core machine?

It sounds like we may be getting a little too greedy with the number of engines we are spawning. There may be cases where we spawn more workers than needed on these high core machines and they sit idle. That sounds like an opportunity on our side.

Presumably you can use 128 in the places where right now you use navigator.hardwareConcurrency and it should reproduce.

Flags: needinfo?(emilio)

fixed for 96 by backout

[Tracking Requested - why for this release]:
If this was bad enough to get backed out of beta we should probably track for 97 as well.

Severity: -- → S2

Hi Olli, can you please back it out from beta?

Flags: needinfo?(bugs)

This was effectively fixed for 97+ by bug 1750868. We can add reverting that change to this bug for whenever the root cause gets sorted.

Has Regression Range: --- → yes

Lowering down the severity as this issue was fixed on beta/release by bug 1750868.

Severity: S2 → S3
Webcompat Priority: --- → ?

We fixed the issue in beta and release, so this isn't a WebCompat issue at this moment.

Webcompat Priority: ? → ---

emilio, can you still see this on Nightly?
We've had https://hg.mozilla.org/mozilla-central/rev/c41adb249305400e28f11167907015072dc13cda for quite some time now.

Flags: needinfo?(emilio)

I just checked and I still see this...

Flags: needinfo?(emilio)

Emilio, smaug, is there anything we need from chess.com to debug this? Or is it just a matter of finding time on our side?

Flags: needinfo?(smaug)
Flags: needinfo?(emilio)

I don't think we have any evidence to think this is a Firefox bug, off-hand... There's no hint in the devtools about an error or anything of that sort. A chess.com person analyzing this / being able to debug what's going on in there and why clamping the exposed value fixes it would be great.

Flags: needinfo?(emilio)
Flags: needinfo?(smaug)
Attachment #9385910 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: