Crash in [@ IPCError-content | Init Already registered!]
Categories
(Core :: DOM: Service Workers, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr115 | --- | unaffected |
firefox-esr128 | --- | unaffected |
firefox132 | --- | unaffected |
firefox133 | + | fixed |
firefox134 | + | fixed |
People
(Reporter: gsvelto, Assigned: asuth)
References
(Regression)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(2 files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
phab-bot
:
approval-mozilla-beta+
|
Details | Review |
Crash report: https://crash-stats.mozilla.org/report/index/039d6da7-202d-414a-ad3f-c08d90241026
Reason:
DUMP_REQUESTED
Top 10 frames:
0 libc.so libc.so@0x99000
1 ? @0x00006af84f544244
2 ? @0x00002c77bd1c404c
3 libxul.so mozilla::OffTheBooksCondVar::Wait() xpcom/threads/CondVar.h:58
3 libxul.so mozilla::Monitor::Wait() xpcom/threads/Monitor.h:37
3 libxul.so mozilla::detail::BaseMonitorAutoLock<mozilla::Monitor>::Wait() xpcom/threads/Monitor.h:138
3 libxul.so nsAppShell::Queue::Pop(bool) widget/android/nsAppShell.h:196
3 libxul.so nsAppShell::ProcessNextNativeEvent(bool) widget/android/nsAppShell.cpp:662
3 libxul.so nsBaseAppShell::DoProcessNextNativeEvent(bool) widget/nsBaseAppShell.cpp:131
3 libxul.so nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool) widget/nsBaseAppShell.cpp:267
Unfortunately, because of the nature of the crash, there's not much to go on from the stack.
Comment 1•10 months ago
|
||
It looks like it first showed up in the 20241024094434 build. Here is the regression range for that build. This crash looks like it is from here in ClientSourceParent.cpp, and bug 1544232 touched that file and is in the regression range so maybe this is a regression from that?
Updated•10 months ago
|
Assignee | ||
Comment 2•10 months ago
|
||
Ah, yeah, this is a regression from that. Thank you for the precise identification, :mccr8!
In bug 1544232 I change ServiceWorkers from using a freshly minted (random) client id to using an unchanging id allocated as part of ServiceWorkerPrivate initialization to make it possible to map from a Client Id to the underlying ServiceWorker. The assertion is triggering because we can decide to re-spawn a ServiceWorker shortly after it is termination is requested but we do not wait for the global to fully die, so the lifetimes of the ClientSources are not inherently disjoint and I didn't adequately compensate for this.
It should be noted that ServiceWorkers exist in a weird conceptual space in the Clients API implementation. ServiceWorkers are not clients and in our system are canonically identified by ServiceWorkerDescriptor
s and indeed that's how they are exposed to content.
Broadly, the options here are:
- Remove the ClientSourceParent from the ClientManagerService::mSourceTable either when the ServiceWorkerPrivate decides to terminate the SW or when it decides to spawn a new instance of the SW.
- Regenerate the Client Id for the SW UUID either when the SWP decides to terminate the SW or when it decides to spawn a new instance.
- Move to registering ServiceWorker ClientSources through use of IPC endpoints. This is the desired endpoint but deepends on Bug 1853706 as the high level bug for changes related to this, with bug 1853726 being the ServiceWorker-specific variation of the bug. Currently the stack on bug 1672493 needs to land before we can move forward with those because it provides foundational refactorings.
Because the Client ID UUIDs being random potentially has security benefits, the 2nd option of regenerating the UUID seems preferable to option 1 from that perspective. Additionally, this minimizes behavioral changes relative to the pre-stack behavior.
The other factor to consider between the two is the impact of when we use the client id to map back to a ServiceWorker for lifetime extension purposes, specifically for register and update which use this lookup (ServiceWorker.postMessage will be using a ServiceWorkerDescriptor for lookup). Happily, a terminated ServiceWorker in fact has no lifetime left to grant an extension for, so it's fine if such a lookup fails. However, there is the question of what should happen if a dying ServiceWorker global calls register() right before it is terminated, but then the ServiceWorker's respawn is initialized in the parent process before the call to register() arrives. The 1st option would allow the register call to succeed with effectively a full lifetime grant, the 2nd option entirely forecloses it regardless of when we regenerate the UUID since it would only ever be the new UUID that would be associated with an effectively full lifetime grant. It's also worth noting that we explicitly do require there to be some minimal amount of lifetime left before spawning a ServiceWorker, so it's already the case that a register() call near the end of the SW lifetime intentionally fails.
So I'm going to go with option 2. Happily, as part of the regressing patch I added nsIServiceWorkerInfo::terminateWorker which makes it fairly easy to add a test case for this.
Assignee | ||
Comment 3•10 months ago
|
||
In terms of where we're seeing the crashes, I did look for non-fenix crashes and didn't see any. It makes sense that this would be primarily concentrated in fenix because we don't spawn ServiceWorkers in their own processes on android and android devices can potentially be slower. Note that because we wait for the ServiceWorker global to reach termination before advancing the state machine to report successful termination to the parent process this is not a question of the isolated processes somehow tearing down the process faster because of aggressive exit() calls or something; in both cases the ServiceWorker global will be fully torn down and therefore the ClientSourceParent destroyed before we would consider tearing down the content process. It's just that on android the ServiceWorker can and will experience meaningful main thread contention which impacts how fast it can shutdown.
Assignee | ||
Updated•10 months ago
|
Updated•10 months ago
|
Comment 4•10 months ago
|
||
The bug is marked as tracked for firefox133 (beta) and tracked for firefox134 (nightly). However, the bug still has low priority and has low severity.
:jstutte, could you please increase the priority and increase the severity for this tracked bug? Given that it is a regression and we know the cause, we could also simply backout the regressor. If you disagree with the tracking decision, please talk with the release managers.
For more information, please visit BugBot documentation.
Updated•10 months ago
|
Assignee | ||
Comment 5•10 months ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #2)
So I'm going to go with option 2. Happily, as part of the regressing patch I added nsIServiceWorkerInfo::terminateWorker which makes it fairly easy to add a test case for this.
I overestimated the simplicity of creating a reliable test for this. The problem is that as captured in this pernosco trace we do proactively start the teardown of the ClientManagerChild (which is the manager of ClientSourceChild) in the IPCWorkerRef callback. It's basically impossible to get this to happen on desktop because we will always pick the same process for ServiceWorkers thanks to our isolation approach for ServiceWorkers, which gives us a consistent ordering. But on android we pick a random process which makes it much more feasible to create a race.
I initially added a testing primitive to WorkerTestUtils that would let us add a ThreadSafeWorkerRef that wouldn't be released until we emitted an observer service notification, but that was based on the erroneous assumption that the ClientSource was staying alive until the killing transition. We definitely aren't going to change when we do that for this for test purposes, but I can create an alternate testing mechanism that uses a monitor to entirely block the worker until we see the observer notification. (We can't just push a nested event target because control runnables will still be serviced in that case; the thread needs to be absolutely blocked.)
Assignee | ||
Comment 6•10 months ago
|
||
Bug 1544232 changed it so ServiceWorker globals used a ClientInfo and
Client Id created by the ServiceWorkerPrivate rather than creating a
random client id. This allows the ServiceWorkerManager to reliably map
a ServiceWorker Client Id back to the underlying ServiceWorker.
The problem with this was that ClientManagerService is not okay with
there being multiple ClientSources using the same id and it results in
an IPC_FAIL. This was not a problem in desktop testing because under
fission the potential race window is incredibly small for a
ServiceWorker and its spawned successor to have a live ClientSource at
the same time because the ClientSource will be torn down by the
ClientManager WorkerRef on the transition to canceling and both SWs
will be spawned in the same process. But on Android where there is no
fission, SWs spawn randomly with no affinity and so a successor can be
spawned on a different, more responsive process.
The fix here is to regenerate the Client Id whenever we terminate the
SW so we are prepared for the next time we spawn the SW.
This patch adds an additional test case to
browser_sw_lifetime_extension.js that is able to reproduce the crash
case on desktop by artificially blocking the ServiceWorker thread with
a monitor so that the ServiceWorker can't transition to Canceling until
its successor has already been spawned. This reliably reproduces the
bug (when the fix is not in place). This required adding some new test
infrastructure to WorkerTestUtils.
The new WorkerTestUtils methods provide 2 ways to hang the worker in a
controlled fashion until an observer notification is notified on the
main thread which use a shared helper class:
- Use a monitor to completely block the thread until notified. This
prevents control runnables from running and thereby prevents worker
refs from being notified. - Acquire a ThreadSafeWorkerRef and hold it until notified. This lets
the worker advance to Canceling but prevents progressing to Killing.
I added the WorkerRef mechanism first but it wasn't sufficient, so I
added the monitor mechanism, slightly generalizing the mechanism.
A mechanism to generate an observer notification on the main thread is
also added so that the successor ServiceWorker can notify the
predecessor SW without us needing to involve JSActors or other means
of running arbitrary JS in the process hosting the SWs. This does mean
that when we are in non-fission mode we do need to limit the browser to
a single process in order to ensure both workers are spawned in the
same process.
Comment 7•10 months ago
|
||
:asuth next week is the final week of beta. Couple of questions:
- Are there any concerns with this patch in terms of landing and risk of uplift?
- So far there are no crashes reported in 133 beta. Wondering if this is nightly only or some other reason?
Updated•10 months ago
|
Assignee | ||
Comment 9•10 months ago
•
|
||
(In reply to Donal Meehan [:dmeehan] from comment #7)
- Are there any concerns with this patch in terms of landing and risk of uplift?
No landing or uplift concerns. The functional changes to the patch are fairly minimal and effectively restore the pre-bug 1544232 behavior so there won't be any surprising ripples. The rest of the patch is a test that 100% was able to reproduce the race-based crash (without the fix).
- So far there are no crashes reported in 133 beta. Wondering if this is nightly only or some other reason?
The race should exist on beta. The key ingredients (edit:) for a high probability crash are:
- No fission. So, Android. (edit:) but per :mccr8 below we are seeing crashes on desktop, which means the probability distribution is different than I thought.
- Multiple content processes running at the time of the ServiceWorker being spawned. The spawning logic picks a random process when we don't have fission and the race is really only feasible if it picks a different content process than it picked last time.
- The content process the old ServiceWorker was running in needs to be experiencing main thread jank.
Comment 10•10 months ago
|
||
IPCError-browser | Init Already registered!
is happening on desktop. Should I file a new bug for that if it is expected to be a different issue? Sorry for muddling things up at the last minute here.
Assignee | ||
Comment 11•10 months ago
|
||
It's the same issue and the fix fixes it on desktop too. (The test in fact is a "browser" mochitest that only runs on desktop.) I just hadn't seen any crashes for it at initial patch authoring time so have been posing things in terms of Android.
Comment 12•10 months ago
|
||
bugherder |
Assignee | ||
Comment 13•10 months ago
|
||
Bug 1544232 changed it so ServiceWorker globals used a ClientInfo and
Client Id created by the ServiceWorkerPrivate rather than creating a
random client id. This allows the ServiceWorkerManager to reliably map
a ServiceWorker Client Id back to the underlying ServiceWorker.
The problem with this was that ClientManagerService is not okay with
there being multiple ClientSources using the same id and it results in
an IPC_FAIL. This was not a problem in desktop testing because under
fission the potential race window is incredibly small for a
ServiceWorker and its spawned successor to have a live ClientSource at
the same time because the ClientSource will be torn down by the
ClientManager WorkerRef on the transition to canceling and both SWs
will be spawned in the same process. But on Android where there is no
fission, SWs spawn randomly with no affinity and so a successor can be
spawned on a different, more responsive process.
The fix here is to regenerate the Client Id whenever we terminate the
SW so we are prepared for the next time we spawn the SW.
This patch adds an additional test case to
browser_sw_lifetime_extension.js that is able to reproduce the crash
case on desktop by artificially blocking the ServiceWorker thread with
a monitor so that the ServiceWorker can't transition to Canceling until
its successor has already been spawned. This reliably reproduces the
bug (when the fix is not in place). This required adding some new test
infrastructure to WorkerTestUtils.
The new WorkerTestUtils methods provide 2 ways to hang the worker in a
controlled fashion until an observer notification is notified on the
main thread which use a shared helper class:
- Use a monitor to completely block the thread until notified. This
prevents control runnables from running and thereby prevents worker
refs from being notified. - Acquire a ThreadSafeWorkerRef and hold it until notified. This lets
the worker advance to Canceling but prevents progressing to Killing.
I added the WorkerRef mechanism first but it wasn't sufficient, so I
added the monitor mechanism, slightly generalizing the mechanism.
A mechanism to generate an observer notification on the main thread is
also added so that the successor ServiceWorker can notify the
predecessor SW without us needing to involve JSActors or other means
of running arbitrary JS in the process hosting the SWs. This does mean
that when we are in non-fission mode we do need to limit the browser to
a single process in order to ensure both workers are spawned in the
same process.
Original Revision: https://phabricator.services.mozilla.com/D227446
Updated•10 months ago
|
Comment 14•10 months ago
|
||
beta Uplift Approval Request
- User impact if declined: occasional content process crashes
- Code covered by automated testing: yes
- Fix verified in Nightly: yes
- Needs manual QE test: no
- Steps to reproduce for manual QE testing: n/a
- Risk associated with taking this patch: Low
- Explanation of risk level: as noted in https://bugzilla.mozilla.org/show_bug.cgi?id=1927247#c9 I created an automated test that was able to reliably verify the fix and the fix also resembles the original behavior prior to the regressing bug
- String changes made/needed: n/a
- Is Android affected?: yes
Updated•10 months ago
|
Updated•10 months ago
|
Comment 15•10 months ago
|
||
uplift |
Description
•