Closed Bug 1553644 Opened 6 years ago Closed 4 years ago

Firefox in a mostly unresponsive state, creating new tabs, new window -> new tab works but loading content fails

Categories

(Core :: DOM: Content Processes, defect, P3)

68 Branch
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox68 --- affected

People

(Reporter: ritu, Unassigned)

References

Details

Attachments

(2 files)

User Agent Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0

Nightly build ID: 20190518102559
Merge day: May 20th
BITS enabled (I opted in), pref name app.update.BITS.enabled = true

STR:

  1. Open a new tab (5/22 ~am PST morning)
  2. In the URL bar: enter github.com

ER: github.com loads on new tab

AR: Firefox client becomes mostly unresponsive but not completely.

With DHolbert's help figured out that some things works: Loading browser console, new window, new tab.

Based on Nika's investigation, the bug manifests into two problems:
i) it seems the process that creates new tabs was in a messed up state and preventing content from loading.
ii) I could not use the pre-existing tabs (from before getting into the busted state)

Browser console shows many errors like "TypeError: initialBrowser.frameLoader.remoteTab is null"

We were able to work around problem i) by changing dom.ipc.processCount from 8 to 6. After doing so, new window -> new tab -> content loading worked.

However, ii) was still a problem.

Hi Nika, thanks for your help and investigation. Please lemme know if I need to add any more console logs or info from about:support here.

Severity: normal → critical
Flags: needinfo?(nika)

Here's a quick summary of what I've figured out:

  1. Something occurred which caused new process spawns to fail without triggering the "Nightly needs to restart" dialog. I think this is likely related in some way or another to BITS updating (which is enabled on :ritu's profile), and the recent nightly version bump (:ritu's nightly was still on 68). I don't know enough about the mechanisms behind these updates to comment more unfortunately.
  2. Because the process spawn failed, nsFrameLoader::mBrowserParent is null, and nothing is displayed in the tab. The tab was in the middle of a process switch, so the loading progress indicator is also never cleared. (https://searchfox.org/mozilla-central/rev/952521e6164ddffa3f34bc8cfa5a81afc5b859c4/dom/base/nsFrameLoader.h#495)
  3. Code in AsyncTabSwitch.jsm (https://searchfox.org/mozilla-central/rev/952521e6164ddffa3f34bc8cfa5a81afc5b859c4/browser/modules/AsyncTabSwitcher.jsm#151-152) assumes that if the remote attribute is set on the browser, and it is inserted, then it has a non-null remoteTab property on its nsFrameLoader. This is incorrect if the tab has started, so the remoteTab is null exception mentioned above fires.
  4. The above exception causes the tab switch process to be aborted whenever it is attempted, causing the user to become locked-in to the current tab, and unable to switch tabs.

A few notes which back up my process:

  1. New windows can be created, and they successfully show the new tab page. This is because that page is drawn in the Privileged content process, and doesn't require starting new process to show.
  2. It is also possible to navigate in one of these new windows to a parent-process document, such as about:support or about:config, as those also do not require process spawning.
  3. Loading a web URI in one of these new windows triggers this lockup to occur again.
  4. In about:support, the Remote Processes category showed 1/1 Privileged, 1/1 Extension, 1/1 GPU, and 6/8 Web Content processes.
    • When dom.ipc.processCount was reduced to 6, it became possible to browse the web, as existing processes are being re-used rather than new ones being spawned.

So, in general, I think there are two issues here:

  1. For some reason the browser got into a state where it couldn't spawn content processes (possibly connected to updates?)
    • I'm no expert on this, so I'm going to have to leave this one to people who are, unfortunately.
  2. When in this state, the browser became unusable, and didn't prompt the user to restart using a "Nightly needs to restart" dialog.
    • This isn't great, and I think we can do better.
    • As it is unlikely that all code will be defensively written to deal with dead nsFrameLoaders due to process start failures inside of the primary xul:browser, as that is a hard case to test, we should instead try to get the browser into a more stable, well known, state when this happens.
      • I'd like us to detect that this failure to spawn a process has occurred in the platform level, and notify frontend about it. In this case we should probably switch the browser from being a remote browser to being a non-remote browser, and display an error page to the user. This is a situation which code is better at dealing with than a completely dead frameLoader.
Flags: needinfo?(nika) → needinfo?(mconley)

NI Kirk as BITS feature owner

Flags: needinfo?(ksteuber)

I ran into this problem again this morning, had 1 window with 8 tabs and creating the 9th tab triggered this.

I just ran into this problem as well, when opening a new Bugzilla tab.

In case it's handy, here are the first two related-looking errors from my browser console, with backtraces.

Hey bytesized, is BITS enabled on all platforms or just Windows? And on which channels?

I'd like us to detect that this failure to spawn a process has occurred in the platform level, and notify frontend about it. In this case we should probably switch the browser from being a remote browser to being a non-remote browser, and display an error page to the user. This is a situation which code is better at dealing with than a completely dead frameLoader.

This sounds very sensible to me. I suspect it might be worth collecting Telemetry on this kind of failure too.

Flags: needinfo?(mconley)

(My busted session is using Nightly 2019-05-22, and I do see an "uparrow" on my hamburger menu to indicate that an update is ready. It's possible that the update was already installed in the background, via me starting another fresh-profile Nightly instance on the same machine to test something.)

Hey rstrong, has something changed recently with how updater is working that might cause this? Is there a way we can locally kick off the updating code or try to simulate an update applying, to see if we can reproduce this?

Flags: needinfo?(robert.strong.bugs)

The simplest is to download the previous build and update it.

Flags: needinfo?(robert.strong.bugs)

(In reply to Robert Strong (Robert they/them) [:rstrong] (use needinfo to contact me) from comment #10)

The simplest is to download the previous build and update it.

Unfortunately, that won't allow us to perform modifications to the pre-update version of the binary to test fixes easily. Is there a way to test the update path with a local build?

Flags: needinfo?(robert.strong.bugs)

Would the existing tests suffice?

The browser chrome tests with stage or staging in the name stage updates
https://searchfox.org/mozilla-central/source/toolkit/mozapps/update/tests/browser

The tests that start with mar test the entire update process. Also, the tests that start with marAppApply laucnh firefox to verify the update is applied.
https://searchfox.org/mozilla-central/source/toolkit/mozapps/update/tests/unit_base_updater

On Windows, since there are several security measures in place you'll need to change the following to bypass a couple of the checks when running locally

diff --git a/toolkit/components/maintenanceservice/moz.build b/toolkit/components/maintenanceservice/moz.build
--- a/toolkit/components/maintenanceservice/moz.build
+++ b/toolkit/components/maintenanceservice/moz.build
@@ -14,17 +14,17 @@ SOURCES += [
     'workmonitor.cpp',
 ]
 
 USE_LIBS += [
     'updatecommon',
 ]
 
 # For debugging purposes only
-#DEFINES['DISABLE_UPDATER_AUTHENTICODE_CHECK'] = True
+DEFINES['DISABLE_UPDATER_AUTHENTICODE_CHECK'] = True
 
 DEFINES['UNICODE'] = True
 DEFINES['_UNICODE'] = True
 DEFINES['NS_NO_XPCOM'] = True
 
 # Pick up nsWindowsRestart.cpp
 LOCAL_INCLUDES += [
     '/mfbt',
diff --git a/toolkit/mozapps/update/tests/moz.build b/toolkit/mozapps/update/tests/moz.build
--- a/toolkit/mozapps/update/tests/moz.build
+++ b/toolkit/mozapps/update/tests/moz.build
@@ -46,17 +46,17 @@ for var in ('MOZ_APP_VENDOR', 'MOZ_APP_B
 DEFINES['NS_NO_XPCOM'] = True
 
 DisableStlWrapping()
 
 if CONFIG['MOZ_MAINTENANCE_SERVICE']:
     DEFINES['MOZ_MAINTENANCE_SERVICE'] = CONFIG['MOZ_MAINTENANCE_SERVICE']
 
 # For debugging purposes only
-#DEFINES['DISABLE_UPDATER_AUTHENTICODE_CHECK'] = True
+DEFINES['DISABLE_UPDATER_AUTHENTICODE_CHECK'] = True
 
 if CONFIG['CC_TYPE'] == 'clang-cl':
     WIN32_EXE_LDFLAGS += ['-ENTRY:wmainCRTStartup']
 
 if CONFIG['OS_ARCH'] == 'WINNT':
     DEFINES['UNICODE'] = True
     DEFINES['_UNICODE'] = True
     USE_STATIC_LIBS = True

You will also need to add a test key to the registry for the maintenance service tests and I'll attach a reg file with the additions

As for what app update does as it relates to this bug I suspect that you could just have a local build running and replace the existing files. On Windows, if there is a file in use just rename it and add the new file.

If you want to simulate the entire update process instead there are numerous steps that will need to be taken including creating the mar file that releng creates, changing the app.update.url pref to point to a server with a custom xml file since balrog won't know about it, etc. etc. I can give you instructions for it but it would likely be overkill for what you are trying to check here.

Flags: needinfo?(robert.strong.bugs)

Another option might be to use the oak branch. It will require additions to your mozconfig but you could use the update advertisements or just MAR files that it creates.

As a side note, I'm hoping to be able to work on bug 1553982 in the next few weeks which should significantly lessen how often this happens.

I'm having a very tough time understanding how this could be related to BITS. BITS is involved with downloading an update, but should have very little to do with installing the update. Installation of an update downloaded with or without BITS should be pretty much identical.

I'm also confused because I thought that we had code to handle the situation where Firefox's binary is updated out from under it. It was added in Bug 1366808 and should show this page when it detects that situation. How sure are we that this problem is due to an update installation? Is there a reason why that page isn't being shown?

@mconley BITS is enabled for half of our users on Nightly and Beta. And no, it is not enabled for users that are not on Windows, as BITS is a Windows component.

Flags: needinfo?(ksteuber)

Thanks, bytesized. Sounds like BITS is not our culprit here - just us stabbing around in the dark.

(In reply to Kirk Steuber (he/him) [:bytesized] from comment #15)

I'm having a very tough time understanding how this could be related to BITS. BITS is involved with downloading an update, but should have very little to do with installing the update. Installation of an update downloaded with or without BITS should be pretty much identical.

Yeah, I'm pretty sure that BITS is not the culprit - we thought it might be because it was related to updates, but there are people encountering this without BITS, so it's not that.

I'm also confused because I thought that we had code to handle the situation where Firefox's binary is updated out from under it. It was added in Bug 1366808 and should show this page when it detects that situation. How sure are we that this problem is due to an update installation? Is there a reason why that page isn't being shown?

We're not sure why the page isn't being shown. My theory is that something is failing somehow too early in the subprocess startup lifecycle, so that we never get around to creating the BrowserParent, and thus never detect the version mismatch.

The priority flag is not set for this bug.
:jimm, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jmathies)

sounds like a hung content process causing issues with content display.

Flags: needinfo?(jmathies)
Priority: -- → P3

Hello Jim is this issue still valid in the latest versions of firefox? If not can we close it?

Flags: needinfo?(jmathies)

No activity here, I think we can close this.

Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(jmathies)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: