Bug 1826222 Comment 10 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Original comment by

Andrew Sutherland [:asuth] (he/him)

on 2023-08-04 11:44:31 PDT

(In reply to Greg Tatum [:gregtatum] from comment #9)
> I wasn't able to reproduce locally, and I tried doing a blocking `while (true) {}` loop in the worker, so I'm unsure on a steps to reproduce.

In general the problem situations are due to starting new workers in a way that races shutdown, not a busy worker.  This can most noticeably happen in tests that don't run for very long and so a service whose initialization is deferred or lazily performed can easily end up overlapping with shutdown.  The worker runtime will try and shutdown all the workers it knows about at xpcom-shutdown, but bug 1805613 which I spun this off from is an indication that this is not yet perfect[1].  A worker that starts up well before shutdown and has a `while (true) {}` loop would not cause a problem because it reaches steady-state before shutdown stats.

These situations are infamously hard to reproduce which is why usually the simplest thing is to make the code explicitly aware of shutdown.   Automation tends to be pretty good at catching these things, though, like [this random log](https://treeherder.mozilla.org/logviewer?job_id=424879926&repo=autoland&lineNumber=37690) from bug 1805613 has:
```
[task 2023-08-04T00:27:51.078Z] 00:27:51     INFO - GECKO(22667) | [Child 24266: Main Thread]: D/WorkerShutdownDump Found ActiveWorker (dedicated): resource://gre/modules/translation/cld-worker.js
[task 2023-08-04T00:27:51.080Z] 00:27:51     INFO - GECKO(22667) | [Child 24266: Main Thread]: D/WorkerShutdownDump Principal: [System Principal]
[task 2023-08-04T00:27:51.081Z] 00:27:51     INFO - GECKO(22667) | [Child 24266: Main Thread]: D/WorkerShutdownDump LoadingPrincipal: [System Principal]
[task 2023-08-04T00:27:51.082Z] 00:27:51     INFO - GECKO(22667) | [Child 24266: Main Thread]: D/WorkerShutdownDump BusyCount: 4
[task 2023-08-04T00:27:51.083Z] 00:27:51     INFO - GECKO(22667) | [Child 24266: Main Thread]: D/WorkerShutdownDump CrashInfo: IsChromeWorker(false)|ScriptLoader|XMLHttpRequestWorker
```

Which shows the hang is due to cld-worker.js apparently performing a synchronous XHR call from within a top-level script load (or deferred sync `importScripts` call).  If that sync XHR is against a wacky channel type, it's possible there's nothing workers can do about the hang if the channel is in a broken state.  Since the log is from linux, it's possible that this might be something that could be reproduced under pernosco; I did just try and trigger jobs based on the self-serve API exposed via treeherder, but I'm not sure the automation will actually work because pernosco wants to know a specific test to run and it's being told about the whole directory... even if pernosco runs, it will probably catch some other problem.

1: We are of course trying to make it perfect, but a complication is that worker lifecycles are inherently more complex than main thread lifecycles because the thread can go away, so it's not so much pure worker logic that's the problem so much as whatever web APIs might be running in the worker and the fact that most code is not usually tested against running during browser shutdown where errors can come from a lot of unusual places.  That said, we landed a significant pure worker logic fix in bug 1800659 that has eliminated a major source of known problem cases.

Revision 1 by

Andrew Sutherland [:asuth] (he/him)

on 2023-08-04 11:44:53 PDT

(In reply to Greg Tatum [:gregtatum] from comment #9)
> I wasn't able to reproduce locally, and I tried doing a blocking `while (true) {}` loop in the worker, so I'm unsure on a steps to reproduce.

In general the problem situations are due to starting new workers in a way that races shutdown, not a busy worker.  This can most noticeably happen in tests that don't run for very long and so a service whose initialization is deferred or lazily performed can easily end up overlapping with shutdown.  The worker runtime will try and shutdown all the workers it knows about at xpcom-shutdown, but bug 1805613 which I spun this off from is an indication that this is not yet perfect[1].  A worker that starts up well before shutdown and has a `while (true) {}` loop would not cause a problem because it reaches steady-state before shutdown starts.

These situations are infamously hard to reproduce which is why usually the simplest thing is to make the code explicitly aware of shutdown.   Automation tends to be pretty good at catching these things, though, like [this random log](https://treeherder.mozilla.org/logviewer?job_id=424879926&repo=autoland&lineNumber=37690) from bug 1805613 has:
```
[task 2023-08-04T00:27:51.078Z] 00:27:51     INFO - GECKO(22667) | [Child 24266: Main Thread]: D/WorkerShutdownDump Found ActiveWorker (dedicated): resource://gre/modules/translation/cld-worker.js
[task 2023-08-04T00:27:51.080Z] 00:27:51     INFO - GECKO(22667) | [Child 24266: Main Thread]: D/WorkerShutdownDump Principal: [System Principal]
[task 2023-08-04T00:27:51.081Z] 00:27:51     INFO - GECKO(22667) | [Child 24266: Main Thread]: D/WorkerShutdownDump LoadingPrincipal: [System Principal]
[task 2023-08-04T00:27:51.082Z] 00:27:51     INFO - GECKO(22667) | [Child 24266: Main Thread]: D/WorkerShutdownDump BusyCount: 4
[task 2023-08-04T00:27:51.083Z] 00:27:51     INFO - GECKO(22667) | [Child 24266: Main Thread]: D/WorkerShutdownDump CrashInfo: IsChromeWorker(false)|ScriptLoader|XMLHttpRequestWorker
```

Which shows the hang is due to cld-worker.js apparently performing a synchronous XHR call from within a top-level script load (or deferred sync `importScripts` call).  If that sync XHR is against a wacky channel type, it's possible there's nothing workers can do about the hang if the channel is in a broken state.  Since the log is from linux, it's possible that this might be something that could be reproduced under pernosco; I did just try and trigger jobs based on the self-serve API exposed via treeherder, but I'm not sure the automation will actually work because pernosco wants to know a specific test to run and it's being told about the whole directory... even if pernosco runs, it will probably catch some other problem.

1: We are of course trying to make it perfect, but a complication is that worker lifecycles are inherently more complex than main thread lifecycles because the thread can go away, so it's not so much pure worker logic that's the problem so much as whatever web APIs might be running in the worker and the fact that most code is not usually tested against running during browser shutdown where errors can come from a lot of unusual places.  That said, we landed a significant pure worker logic fix in bug 1800659 that has eliminated a major source of known problem cases.

Back to Bug 1826222 Comment 10