Firefox in a mostly unresponsive state, creating new tabs, new window -> new tab works but loading content fails
Categories
(Core :: DOM: Content Processes, defect, P3)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox68 | --- | affected |
People
(Reporter: ritu, Unassigned)
References
Details
Attachments
(2 files)
User Agent Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0
Nightly build ID: 20190518102559
Merge day: May 20th
BITS enabled (I opted in), pref name app.update.BITS.enabled = true
STR:
- Open a new tab (5/22 ~am PST morning)
- In the URL bar: enter github.com
ER: github.com loads on new tab
AR: Firefox client becomes mostly unresponsive but not completely.
With DHolbert's help figured out that some things works: Loading browser console, new window, new tab.
Based on Nika's investigation, the bug manifests into two problems:
i) it seems the process that creates new tabs was in a messed up state and preventing content from loading.
ii) I could not use the pre-existing tabs (from before getting into the busted state)
Browser console shows many errors like "TypeError: initialBrowser.frameLoader.remoteTab is null"
We were able to work around problem i) by changing dom.ipc.processCount from 8 to 6. After doing so, new window -> new tab -> content loading worked.
However, ii) was still a problem.
| Reporter | ||
Comment 1•6 years ago
|
||
Hi Nika, thanks for your help and investigation. Please lemme know if I need to add any more console logs or info from about:support here.
Comment 2•6 years ago
|
||
Here's a quick summary of what I've figured out:
- Something occurred which caused new process spawns to fail without triggering the "Nightly needs to restart" dialog. I think this is likely related in some way or another to BITS updating (which is enabled on :ritu's profile), and the recent nightly version bump (:ritu's nightly was still on 68). I don't know enough about the mechanisms behind these updates to comment more unfortunately.
- Because the process spawn failed,
nsFrameLoader::mBrowserParentis null, and nothing is displayed in the tab. The tab was in the middle of a process switch, so the loading progress indicator is also never cleared. (https://searchfox.org/mozilla-central/rev/952521e6164ddffa3f34bc8cfa5a81afc5b859c4/dom/base/nsFrameLoader.h#495) - Code in
AsyncTabSwitch.jsm(https://searchfox.org/mozilla-central/rev/952521e6164ddffa3f34bc8cfa5a81afc5b859c4/browser/modules/AsyncTabSwitcher.jsm#151-152) assumes that if theremoteattribute is set on the browser, and it is inserted, then it has a non-nullremoteTabproperty on its nsFrameLoader. This is incorrect if the tab has started, so theremoteTab is nullexception mentioned above fires. - The above exception causes the tab switch process to be aborted whenever it is attempted, causing the user to become locked-in to the current tab, and unable to switch tabs.
A few notes which back up my process:
- New windows can be created, and they successfully show the new tab page. This is because that page is drawn in the Privileged content process, and doesn't require starting new process to show.
- It is also possible to navigate in one of these new windows to a parent-process document, such as
about:supportorabout:config, as those also do not require process spawning. - Loading a web URI in one of these new windows triggers this lockup to occur again.
- In
about:support, theRemote Processescategory showed 1/1 Privileged, 1/1 Extension, 1/1 GPU, and 6/8 Web Content processes.- When
dom.ipc.processCountwas reduced to6, it became possible to browse the web, as existing processes are being re-used rather than new ones being spawned.
- When
So, in general, I think there are two issues here:
- For some reason the browser got into a state where it couldn't spawn content processes (possibly connected to updates?)
- I'm no expert on this, so I'm going to have to leave this one to people who are, unfortunately.
- When in this state, the browser became unusable, and didn't prompt the user to restart using a "Nightly needs to restart" dialog.
- This isn't great, and I think we can do better.
- As it is unlikely that all code will be defensively written to deal with dead nsFrameLoaders due to process start failures inside of the primary xul:browser, as that is a hard case to test, we should instead try to get the browser into a more stable, well known, state when this happens.
- I'd like us to detect that this failure to spawn a process has occurred in the platform level, and notify frontend about it. In this case we should probably switch the browser from being a remote browser to being a non-remote browser, and display an error page to the user. This is a situation which code is better at dealing with than a completely dead frameLoader.
| Reporter | ||
Comment 4•6 years ago
|
||
I ran into this problem again this morning, had 1 window with 8 tabs and creating the 9th tab triggered this.
Comment 5•6 years ago
|
||
I just ran into this problem as well, when opening a new Bugzilla tab.
In case it's handy, here are the first two related-looking errors from my browser console, with backtraces.
Comment 6•6 years ago
|
||
Hey bytesized, is BITS enabled on all platforms or just Windows? And on which channels?
Comment 7•6 years ago
|
||
I'd like us to detect that this failure to spawn a process has occurred in the platform level, and notify frontend about it. In this case we should probably switch the browser from being a remote browser to being a non-remote browser, and display an error page to the user. This is a situation which code is better at dealing with than a completely dead frameLoader.
This sounds very sensible to me. I suspect it might be worth collecting Telemetry on this kind of failure too.
Comment 8•6 years ago
|
||
(My busted session is using Nightly 2019-05-22, and I do see an "uparrow" on my hamburger menu to indicate that an update is ready. It's possible that the update was already installed in the background, via me starting another fresh-profile Nightly instance on the same machine to test something.)
Comment 9•6 years ago
|
||
Hey rstrong, has something changed recently with how updater is working that might cause this? Is there a way we can locally kick off the updating code or try to simulate an update applying, to see if we can reproduce this?
Comment 10•6 years ago
|
||
The simplest is to download the previous build and update it.
Comment 11•6 years ago
|
||
(In reply to Robert Strong (Robert they/them) [:rstrong] (use needinfo to contact me) from comment #10)
The simplest is to download the previous build and update it.
Unfortunately, that won't allow us to perform modifications to the pre-update version of the binary to test fixes easily. Is there a way to test the update path with a local build?
Comment 12•6 years ago
•
|
||
Would the existing tests suffice?
The browser chrome tests with stage or staging in the name stage updates
https://searchfox.org/mozilla-central/source/toolkit/mozapps/update/tests/browser
The tests that start with mar test the entire update process. Also, the tests that start with marAppApply laucnh firefox to verify the update is applied.
https://searchfox.org/mozilla-central/source/toolkit/mozapps/update/tests/unit_base_updater
On Windows, since there are several security measures in place you'll need to change the following to bypass a couple of the checks when running locally
diff --git a/toolkit/components/maintenanceservice/moz.build b/toolkit/components/maintenanceservice/moz.build
--- a/toolkit/components/maintenanceservice/moz.build
+++ b/toolkit/components/maintenanceservice/moz.build
@@ -14,17 +14,17 @@ SOURCES += [
'workmonitor.cpp',
]
USE_LIBS += [
'updatecommon',
]
# For debugging purposes only
-#DEFINES['DISABLE_UPDATER_AUTHENTICODE_CHECK'] = True
+DEFINES['DISABLE_UPDATER_AUTHENTICODE_CHECK'] = True
DEFINES['UNICODE'] = True
DEFINES['_UNICODE'] = True
DEFINES['NS_NO_XPCOM'] = True
# Pick up nsWindowsRestart.cpp
LOCAL_INCLUDES += [
'/mfbt',
diff --git a/toolkit/mozapps/update/tests/moz.build b/toolkit/mozapps/update/tests/moz.build
--- a/toolkit/mozapps/update/tests/moz.build
+++ b/toolkit/mozapps/update/tests/moz.build
@@ -46,17 +46,17 @@ for var in ('MOZ_APP_VENDOR', 'MOZ_APP_B
DEFINES['NS_NO_XPCOM'] = True
DisableStlWrapping()
if CONFIG['MOZ_MAINTENANCE_SERVICE']:
DEFINES['MOZ_MAINTENANCE_SERVICE'] = CONFIG['MOZ_MAINTENANCE_SERVICE']
# For debugging purposes only
-#DEFINES['DISABLE_UPDATER_AUTHENTICODE_CHECK'] = True
+DEFINES['DISABLE_UPDATER_AUTHENTICODE_CHECK'] = True
if CONFIG['CC_TYPE'] == 'clang-cl':
WIN32_EXE_LDFLAGS += ['-ENTRY:wmainCRTStartup']
if CONFIG['OS_ARCH'] == 'WINNT':
DEFINES['UNICODE'] = True
DEFINES['_UNICODE'] = True
USE_STATIC_LIBS = True
You will also need to add a test key to the registry for the maintenance service tests and I'll attach a reg file with the additions
As for what app update does as it relates to this bug I suspect that you could just have a local build running and replace the existing files. On Windows, if there is a file in use just rename it and add the new file.
If you want to simulate the entire update process instead there are numerous steps that will need to be taken including creating the mar file that releng creates, changing the app.update.url pref to point to a server with a custom xml file since balrog won't know about it, etc. etc. I can give you instructions for it but it would likely be overkill for what you are trying to check here.
Comment 13•6 years ago
|
||
Comment 14•6 years ago
•
|
||
Another option might be to use the oak branch. It will require additions to your mozconfig but you could use the update advertisements or just MAR files that it creates.
As a side note, I'm hoping to be able to work on bug 1553982 in the next few weeks which should significantly lessen how often this happens.
Comment 15•6 years ago
|
||
I'm having a very tough time understanding how this could be related to BITS. BITS is involved with downloading an update, but should have very little to do with installing the update. Installation of an update downloaded with or without BITS should be pretty much identical.
I'm also confused because I thought that we had code to handle the situation where Firefox's binary is updated out from under it. It was added in Bug 1366808 and should show this page when it detects that situation. How sure are we that this problem is due to an update installation? Is there a reason why that page isn't being shown?
@mconley BITS is enabled for half of our users on Nightly and Beta. And no, it is not enabled for users that are not on Windows, as BITS is a Windows component.
Comment 16•6 years ago
|
||
Thanks, bytesized. Sounds like BITS is not our culprit here - just us stabbing around in the dark.
Comment 17•6 years ago
|
||
(In reply to Kirk Steuber (he/him) [:bytesized] from comment #15)
I'm having a very tough time understanding how this could be related to BITS. BITS is involved with downloading an update, but should have very little to do with installing the update. Installation of an update downloaded with or without BITS should be pretty much identical.
Yeah, I'm pretty sure that BITS is not the culprit - we thought it might be because it was related to updates, but there are people encountering this without BITS, so it's not that.
I'm also confused because I thought that we had code to handle the situation where Firefox's binary is updated out from under it. It was added in Bug 1366808 and should show this page when it detects that situation. How sure are we that this problem is due to an update installation? Is there a reason why that page isn't being shown?
We're not sure why the page isn't being shown. My theory is that something is failing somehow too early in the subprocess startup lifecycle, so that we never get around to creating the BrowserParent, and thus never detect the version mismatch.
Comment 18•6 years ago
|
||
The priority flag is not set for this bug.
:jimm, could you have a look please?
For more information, please visit auto_nag documentation.
Comment 19•6 years ago
|
||
sounds like a hung content process causing issues with content display.
Comment 20•4 years ago
|
||
Hello Jim is this issue still valid in the latest versions of firefox? If not can we close it?
Comment 21•4 years ago
|
||
No activity here, I think we can close this.
Description
•