LanguageDetector.sys.mjs creates a worker without shutdown awareness and this can contribute to shutdown hangs
Categories
(Firefox :: Translations, defect)
Tracking
()
People
(Reporter: asuth, Unassigned)
References
Details
New logging indicates in https://bugzilla.mozilla.org/show_bug.cgi?id=1805613#c73 that we are sometimes seeing content process shutdown hangs involving cld-worker.js which seems to be started by LanguageDetector.sys.mjs. While ongoing correctness work in workers should eliminate the shutdown hangs seen on that bug, I don't see any indications that the logic in these files is aware of browser shutdown, but it probably should be. If the shutdown concerns are addressed at a higher level, it would be appropriate to add comments near the worker creation/termination that explain how the worker's lifecycle is constrained and shutdown handling is addressed or link to the location of documentation elsewhere.
A very brief perusal of apparent calls to detectLanguage suggests the immediate callers aren't directly aware of shutdown either, but searchfox is quite limited when it comes to analyzing JS control-flow, so I could be missing something. Even if there's notionally higher level shutdown logic, I think there's probably an argument to be made for having LanguageDetector.sys.mjs itself just be aware of shutdown and rejecting all outstanding promises at that time while terminating the worker and refusing to launch the worker again.
Comment 1•2 years ago
|
||
You might want to look at this. We should verify:
- That we are terminating the language detector when we don't need it.
- It's aware of shutdown. I'm not sure how to do this exactly, and requires investigation.
I suspect the Translations worker will also suffer from this.
Updated•2 years ago
|
| Reporter | ||
Comment 2•2 years ago
•
|
||
(In reply to Greg Tatum [:gregtatum] from comment #1)
- It's aware of shutdown. I'm not sure how to do this exactly, and requires investigation.
For JS code, https://searchfox.org/mozilla-central/source/toolkit/components/asyncshutdown/AsyncShutdown.sys.mjs is generally the thing to use and it has docs at https://firefox-source-docs.mozilla.org/toolkit/modules/toolkit_modules/AsyncShutdown.html (whereas for C++ https://searchfox.org/mozilla-central/source/toolkit/components/asyncshutdown/nsIAsyncShutdown.idl should be used).
Also useful is https://searchfox.org/mozilla-central/rev/8433b62e54fd30663e82f090c4d31554531a2e66/toolkit/components/startup/public/nsIAppStartup.idl#185-193 for situations where there isn't a need to be notified to cancel in-flight things or delay shutdown, and where a shutdown blocker isn't already in use that results in a field that can be easily checked. (Note that since this is XPCOM accessed from JS via XPConnect, it's only available on the main thread for JS.)
/**
* Check if we entered or passed a specific shutdown phase.
*
* @param aPhase
* The shutdown phase we want to check.
*
* @return true if we are in or beyond the given phase.
*/
bool isInOrBeyondShutdownPhase(in nsIAppStartup_IDLShutdownPhase aPhase);
Comment 3•2 years ago
|
||
Greg, some of this is confusing to me:
New logging indicates in https://bugzilla.mozilla.org/show_bug.cgi?id=1805613#c73 that we are sometimes seeing content process shutdown hangs involving cld-worker.js which seems to be started by LanguageDetector.sys.mjs.
This is the old code from the translation directory, not our new translations directory. Any new code that I wrote shouldn't have affected this file at all.
However, I'll definitely look into this! As I believe that my LanguageId worker will suffer from the same issue. I'm marking this as assigned to myself.
Comment 4•2 years ago
|
||
Oh, I missed that this was the new code, since before it was LanguageDetector.jsm.
Comment 5•2 years ago
|
||
The severity field is not set for this bug.
:anatal, could you have a look please?
For more information, please visit auto_nag documentation.
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Comment 6•2 years ago
|
||
Now that the translations code is a consumer, we should probably look into a fix for this.
Comment 7•2 years ago
|
||
Here is an example of how to teach our workers about shutdown:
Comment 8•2 years ago
•
|
||
This caused massive test failures recently for automation because Firefox in most cases didn't shutdown. While this is fixed for harnesses like Marionette and Remote Protocol now, what about real users? Would they experience the same hang during shutdown? If yes, should this maybe block the 117 release?
Comment 9•2 years ago
|
||
I wasn't able to reproduce locally, and I tried doing a blocking while (true) {} loop in the worker, so I'm unsure on a steps to reproduce.
| Reporter | ||
Comment 10•2 years ago
•
|
||
(In reply to Greg Tatum [:gregtatum] from comment #9)
I wasn't able to reproduce locally, and I tried doing a blocking
while (true) {}loop in the worker, so I'm unsure on a steps to reproduce.
In general the problem situations are due to starting new workers in a way that races shutdown, not a busy worker. This can most noticeably happen in tests that don't run for very long and so a service whose initialization is deferred or lazily performed can easily end up overlapping with shutdown. The worker runtime will try and shutdown all the workers it knows about at xpcom-shutdown, but bug 1805613 which I spun this off from is an indication that this is not yet perfect[1]. A worker that starts up well before shutdown and has a while (true) {} loop would not cause a problem because it reaches steady-state before shutdown starts.
These situations are infamously hard to reproduce which is why usually the simplest thing is to make the code explicitly aware of shutdown. Automation tends to be pretty good at catching these things, though, like this random log from bug 1805613 has:
[task 2023-08-04T00:27:51.078Z] 00:27:51 INFO - GECKO(22667) | [Child 24266: Main Thread]: D/WorkerShutdownDump Found ActiveWorker (dedicated): resource://gre/modules/translation/cld-worker.js
[task 2023-08-04T00:27:51.080Z] 00:27:51 INFO - GECKO(22667) | [Child 24266: Main Thread]: D/WorkerShutdownDump Principal: [System Principal]
[task 2023-08-04T00:27:51.081Z] 00:27:51 INFO - GECKO(22667) | [Child 24266: Main Thread]: D/WorkerShutdownDump LoadingPrincipal: [System Principal]
[task 2023-08-04T00:27:51.082Z] 00:27:51 INFO - GECKO(22667) | [Child 24266: Main Thread]: D/WorkerShutdownDump BusyCount: 4
[task 2023-08-04T00:27:51.083Z] 00:27:51 INFO - GECKO(22667) | [Child 24266: Main Thread]: D/WorkerShutdownDump CrashInfo: IsChromeWorker(false)|ScriptLoader|XMLHttpRequestWorker
Which shows the hang is due to cld-worker.js apparently performing a synchronous XHR call from within a top-level script load (or deferred sync importScripts call). If that sync XHR is against a wacky channel type, it's possible there's nothing workers can do about the hang if the channel is in a broken state. Since the log is from linux, it's possible that this might be something that could be reproduced under pernosco; I did just try and trigger jobs based on the self-serve API exposed via treeherder, but I'm not sure the automation will actually work because pernosco wants to know a specific test to run and it's being told about the whole directory... even if pernosco runs, it will probably catch some other problem.
1: We are of course trying to make it perfect, but a complication is that worker lifecycles are inherently more complex than main thread lifecycles because the thread can go away, so it's not so much pure worker logic that's the problem so much as whatever web APIs might be running in the worker and the fact that most code is not usually tested against running during browser shutdown where errors can come from a lot of unusual places. That said, we landed a significant pure worker logic fix in bug 1800659 that has eliminated a major source of known problem cases.
| Reporter | ||
Comment 11•2 years ago
•
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #8)
This caused massive test failures recently for automation because Firefox in most cases didn't shutdown. While this is fixed for harnesses like Marionette and Remote Protocol now, what about real users?
Unless the worker here is explicitly being spun up by something in the shutdown process, users are unlikely to reproduce the problem (translation worker racing shutdown), and they definitely should not experience the implications of any hangs, as what I see reported is only for content processes. Also, see the next para.
Would they experience the same hang during shutdown? If yes, should this maybe block the 117 release?
As I Understand It (AIUI), the tests are turning orange because the test infrastructure is correctly detecting a child content process shutdown hang, but this is only something that's possible for debug/similar builds; in production/non-debug builds we don't care about whether content processes shutdown cleanly and don't bother even trying to shut them down fully.
Comment 12•2 years ago
|
||
I see. Thanks for the feedback. In regards of testing or reproducing the problem we could always use a Marionette test that immediately shutdown or restart Firefox a couple of times. So the changes to see it happening is kinda high and could then also be used to verify a fix.
Description
•