Closed Bug 1618936 Opened 4 years ago Closed 4 years ago

An increase in the number of tabs not correctly connecting to content processes are being reported for recent Nightlies

Categories

(Firefox :: Tabbed Browser, defect, P1)

defect

Tracking

()

VERIFIED FIXED
Firefox 76
Tracking Status
firefox-esr68 --- unaffected
firefox74 --- unaffected
firefox75 --- verified
firefox76 --- verified

People

(Reporter: mconley, Assigned: mconley)

References

(Regression)

Details

(Keywords: regression, regressionwindow-wanted)

Attachments

(8 files)

Both tbabos and shorlander have been seeing this behaviour over the past two days or so.

From what I can gather, tabs don't have nsIRemoteTab available off of their browser's frameloaders, so a bunch of stuff in the front-end breaks because it's not designed to handle the case where the frameloader didn't successfully connect to a content process.

Putting this in DOM :: Content Processes for now, but perhaps this could also belong in Firefox :: Tabbed Browser, in the event that we're not handling a failure case properly here.

Apparently, both tbabos and shorlander are experiencing this on Windows - version 10, presumably.

Mike, is there an existing bug to improve frontend error handling of process launch failures?

This bug might be a regression from Yoric's async process launching changes (in bug 1602712), but it's unclear whether the bug is caused by actual process launch failures or frontend code getting confused by launching processes asynchronously.

Flags: needinfo?(mconley)
See Also: → 1602712
Attached image Browser Console error

Did not encounter this again so far on Windows 10. Attaching the screenshot I did when it happened.

Attached image 2020-03-06_16h20_53.png

I also encountered this issue. At the time being, I was connected via NordVpn and loaded amazon.com using the latest Nightly 75.0a1 on Windows 10 x64. I'm attaching a screenshot of the browser console.

I'm not aware of a pre-existing bug, no. Presuming content process launch is at fault here, I'm also not 100% certain what the appropriate response should be from the parent - should we retry? Show an error message? Something else?

Flags: needinfo?(mconley)

This bug can happen without Fission (because e10s content process launching can fail), but this bug becomes more likely with Fission's many iframe processes and async process launching.

The frontend code needs to handle tab process launch failure more robustly. Moving to the Firefox frontend component.

Component: DOM: Content Processes → Tabbed Browser
Product: Core → Firefox

(In reply to Chris Peterson [:cpeterson] from comment #8)

The frontend code needs to handle tab process launch failure more robustly.

Thanks, cpeterson. Do you know how we should be handling this case? Who do we talk to about that?

Flags: needinfo?(cpeterson)

(In reply to Mike Conley (:mconley) (:⚙️) from comment #9)

(In reply to Chris Peterson [:cpeterson] from comment #8)

The frontend code needs to handle tab process launch failure more robustly.

Thanks, cpeterson. Do you know how we should be handling this case? Who do we talk to about that?

Nika will know. She recommended a new test be written to verify the frontend's handling of process launch failures doesn't regress.

Flags: needinfo?(cpeterson) → needinfo?(nika)
Priority: -- → P1
Summary: An increase in the number of tabs not correctly connecting to content processes are being reported for recent Nightly's → An increase in the number of tabs not correctly connecting to content processes are being reported for recent Nightlies

The root of many of these issues which cause frontend code to lock up is that a browser element may both be considered "remote", and not have any remoteTab associated with it, as the remote tab has already crashed. There is then code in important code paths, such as the tab switching code, which don't null-check this value before accessing it, causing an exception.

There are 2 main ways that frontend code can handle this better:

  1. Catch all of the places where we access remoteTab without null-checking it, and add null-checks to them.
  2. Ensure that any browser elements which contain a crashed tab are promptly replaced with an error document, so that general frontend code doesn't have to deal with the potentially-broken state.

The second case seems to be the most likely solution, as it can be done locally, and doesn't require writing tests for every frontend functionality operating correctly on a crashed tab. It also lets us load browser crashed UI into the crashed tab, which is a nicer UX than a blank document.

Right now, frontend code tries to show a crashed UI from the "oop-browser-crashed" or "oop-browser-buildid-mismatch" events are fired. If the crashing browser is selected, the onSelectedBrowserCrash method adds the browser into a queue, and doesn't mark properly swap it to a tab crashed page until an "ipc:content-shutdown" event is fired with the ChildID from browser.frameLoader.childID. I think this is the wrong behaviour in the case where the process failed to start at all, rather than crashing, however. In that case, the childID will be 0, and no "ipc:content-shutdown" observer notification will be fired, leaving the browser permanently in the queue. We probably want to immediately switch to a tab crashed page if childID was 0.

I worry there may also be some issues if the crashing browser is not selected. In the case of a normal "oop-browser-crashed", it seems we immediately try to restore it in the background, but in a "oop-browser-buildid-mismatch" case, we appear to do nothing. I worry this could also lead to issues if the tab which required starting a new process was in the background, and then was switched to the foreground.

Flags: needinfo?(nika) → needinfo?(mconley)

Now I get this behavior on the latest Nightly, was just tab switching. Hovering over the tab will show "New Tab(pid 19060)"

Reproduction steps:
Set up:
dom.payments.request.enabled to "true"
region: US or CA

  1. Open several tabs
  2. Go to https://rsolomakhin.github.io/pr/us/
  3. Click on Buy
    Payment widget is not displayed
  4. Start switching tabs and occasionally refresh the Payments test page

For this one, I get a different error in browser console but its basically the same behavior. Attached the recording for repro steps.

Hm, no luck reproducing this on my Windows 10 machine with a recent Nightly.

Hey tbabos, if you're able to reproduce this semi-reliably, any chance you could help us find a regression range?

Flags: needinfo?(mconley) → needinfo?(tbabos)

I'm able to reproduce if I happen to cause the an update to install by opening a separate profile using the same instance of Firefox, and then having a background tab attempt to migrate from one process to another.

Presuming something like this is the underlying cause, I'm taking Nika's advice here - my plan to immediately switch to the error pages regardless of foreground state if:

  1. It's a build-id mismatch crash
  2. It's a crash where we never had a childID (so likely that we didn't correctly launch a content process)

These are speculative fixes, because we're theorizing that this is what's causing this behaviour out in the wild, but it's the best we can do without more clues.

Assignee: nobody → mconley
Status: NEW → ASSIGNED

I am getting different regression ranges and quite different behaviors on different windows machines..:(
The only common thing is that the regression ranges are both from mid-february 2020.
However, both behaviors have the console error as mentioned in Comment 14, can't reproduce the error mentioned in Comment 4.
I will check if that fix solves my scenario when the patch is landed in Nightly tho, fingers crossed.

Flags: needinfo?(tbabos)

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression
Pushed by mconley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/59cef5b69286
Make the front-end more robust in how it handles content process launch failures. r=dao
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 76

The speculative patch has been in Nightly for a few days now. tbabos, are you still seeing this?

Flags: needinfo?(tbabos)

Do we have any idea of when this regressed?

Flags: needinfo?(mconley)
Flags: in-testsuite+

I don't. Perhaps tbabos has a timeline?

Flags: needinfo?(mconley)

For the timeline it is somewhere mid February, I first saw it on 02-28 and Mike submitted the issue for it.
As good news, I can't reproduce it anymore using the steps from Comment 14 on latest Nightly!

As bad news, saw it yesterday on Beta but couldn't reproduce it again: new profile, open private window, close private window, open new private window.

Leaving this open a bit until I get to work more around Nightly and be confident about it as it is very intermittent but annoying when it happens.

Flags: needinfo?(tbabos)

This problem is older than February. I posted on reddit about this issue 4 months ago.
https://www.reddit.com/r/firefox/comments/doetiz/firefox_is_not_loading_pages/

Unfortunately I don't remember which version I was running back then.

I posted about the same a month ago a finally opened bug. This bug was closed as a duplicate of this one.

Hi Mathieu,
Thanks for reaching out to us again! Could you also check it on Nightly and see if it happens? We got hopes that the fix landed in Nightly could solve this issue. You can download it from here: https://www.mozilla.org/en-US/firefox/channel/desktop/

Flags: needinfo?(mathieu.carpentier)

Fresh installation of nightly on Fedora 31. "First Run" tab didn't load.

Flags: needinfo?(mathieu.carpentier)

I installed the latest nightly 76.0a1 on Fedora 31. The same problem occurred in the first 10 minutes of usage. On a new start of Firefox a new tab "Firefox Nightly First Run Page" opened. I noticed the following:

  • this tab remains empty
  • there is no loading animation
  • the "Home" button does nothing
  • typing an URL does nothing
  • other tabs are working correctly

I could not get the content of the browser console after recording my screen.

Few minutes before that I had the same problem with the stable release from Fedora repo (v74.0): one of my pinned tab didn't load.

Regressed by: 1242912
Has Regression Range: --- → yes

Did that happen before you signed in to sync? The addons that were synced could cause this too, please disable them and check it out once more.

Flags: needinfo?(mathieu.carpentier)
Flags: needinfo?(mathieu.carpentier)

Yesterday it did happen after I signed in to sync.

Today I started nightly with a new profile. This time I did not signin to sync and did not install any addon. It took less than 5 minutes before I get a tab that was stuck with the loading animation.

The patch landed in nightly and beta is affected.
:mconley, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(mconley)

Hi mathieu.carpentier,

I suspect that this patch then is not solving things for you, and that you're likely experiencing a slightly different issue.

Do you happen to have the ESET Nod32 antivirus product installed and enabled on your Linux box?

Flags: needinfo?(mconley) → needinfo?(mathieu.carpentier)

Comment on attachment 9133606 [details]
Bug 1618936 - Make the front-end more robust in how it handles content process launch failures. r?jaws!

Beta/Release Uplift Approval Request

  • User impact if declined: Background tabs that fail to get a content process associated with them due to launch failures (build ID mismatch, or other launch failures), might result in a broken tab.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The code being changed is nicely isolated, and also has automated tests. There is enough coverage here to make me confident that this can be uplifted safely.
  • String changes made/needed: None.
Attachment #9133606 - Flags: approval-mozilla-beta?

(In reply to Mike Conley (:mconley) (:⚙️) from comment #35)

Hi mathieu.carpentier,

I suspect that this patch then is not solving things for you, and that you're likely experiencing a slightly different issue.

Do you happen to have the ESET Nod32 antivirus product installed and enabled on your Linux box?

Yes ! I do have ESET Nod32 v4 installed and running on my Linux machine.

Flags: needinfo?(mathieu.carpentier)

(In reply to mathieu.carpentier from comment #37)

Yes ! I do have ESET Nod32 v4 installed and running on my Linux machine.

In that case, I suspect you're hitting bug 1604218. According to bug 1604218 comment 34, ESET is shipping an update that will fix this issue.

Comment on attachment 9133606 [details]
Bug 1618936 - Make the front-end more robust in how it handles content process launch failures. r?jaws!

approved for 75 rc1

Attachment #9133606 - Flags: approval-mozilla-beta? → approval-mozilla-release+

Spent a whole day opening and loading a lot of tabs with heavy content and didn't experience this issue anymore on Windows. Find it more comfortable to say it is fixed now on latest Nightly 76.0a1 (2020-03-30) (64-bit).

Verified-fixed on latest Beta 75.0 (64-bit) on Windows 10 x64 as well. Didn't encounter the issue during a day of surfing around on Beta.

Status: RESOLVED → VERIFIED
See Also: → 1630403
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: