Open Bug 1879387 Opened 2 years ago Updated 2 months ago

[meta] Fenix fails to gracefully handle network transition during pageload

Categories

(Core :: Networking, defect, P2)

defect

Tracking

()

People

(Reporter: acreskey, Assigned: acreskey)

References

(Depends on 4 open bugs, Blocks 2 open bugs)

Details

(Keywords: meta, Whiteboard: [necko-triaged])

Attachments

(5 files)

Attached image geckoview_pageload.jpg

When we transition networks midway through page load we are sometimes left with a partially loaded page.

For example:

• Share network via wifi from a host desktop machine.
• Connect to this wifi from Geckoview_exmple
• Start loading a large website (e.g. www.washingtonpost.com)
• Midway through the pageload (after first paint), disable the wifi sharing
• The Android device will transition to the next available wifi network

Result:
The page will be left in an incomplete state.
Only reloading it will fix it.

Expectation:
In progress connections are automatically retried, allowing for a complete pageload?
Chrome does seem to often handling this gracefully, but certainly not always.

Opening this bug for discussion -- it's not clear if we could do this better.

To make the partially loaded page behaviour easier to reproduce, I add packet-level network throttling on the wifi host desktop (e.g. 2mbps and 50ms latency).

I did ensure that network.http.http2.move_to_pending_list_after_network_change from bug 1706377 was enabled.
From remote DevTools, the content did look to be almost all HTTP/2 requests.

Summary: Android: Changing networks midway through pageload leads to incomplete load → Android: changing networks midway through pageload leads to incomplete load
Whiteboard: [necko-triaged]
Duplicate of this bug: 1879388

Profile:
https://share.firefox.dev/42AWuAU

Not sure how long it takes for requests to timeout?

Severity: -- → S4
Priority: -- → P2

I haven't been able to pinpoint scenarios where Chrome consistently outperforms Firefox in this test.

Setup:
• Packet-level network throttling on desktop (using network link conditioner on macOs), higher latency and limited bandwidth
• Desktop shares network via wifi
• Connect Android device to this shared network
• Ensure that the Android device also has a secondary Wifi network that it will transition to if the shared desktop wifi is dropped

Scenario 1:

  • Start a pageload on Fenix/Chrome
  • Once the navigation has begun and the view is cleared, disable network sharing from the host desktop
  • The Android device will automatically transition to the next auto-join Wifi network

Behaviour: both Fenix and Chrome will stall the pageload with an empty document.
A reload is required to complete the page load.
(Note that Chrome implements "Pull to refresh" on Android, so the drag down motion will trigger a refresh.

Scenario 2:

  • Start a pageload on Fenix/Chrome
  • Once the navigation has begun and the first non-blank paint has been made, disable network sharing from the host desktop
  • The Android device will automatically transition to the next auto-join Wifi network

Behaviour: both Fenix and Chrome will stall the pageload with a partially rendered document.
A reload is required to complete the page load.
(Again, note that Chrome implements "Pull to refresh" on Android, so the drag down motion will trigger a refresh.

The only difference that I can consistently see is the UI gesture "Pull to refresh" which Chrome triggers quite readily.

Blocks: perf-android

Andrew asked me to add some notes about an issue I'm seeing for some time as well.
FTR I see this both on Android and Desktop. Not in the same context though: Android it happens more often in the subway while on Desktop this is when I'm in the high-speed train with bad connectivity.

Sometimes, I connect to some website, and at that point my connection is probably quite bad (like in some subway station, or in the high-speed train), so the connection doesn't really succeeds. Maybe (just a wild guess) packets are lost and maybe they're not resent, or they are resent but they're lost again, and the TCP window increases, so they're not resent again or not often enough.

But in the next subway station, the connection is back to a very good state. But the connection still doesn't seem to success. Even worse, because I'm aware of this, I try to reload the page, but this still doesn't work (I think Firefox knows a connection is ongoing to that server, so we don't try to start a new one?).
In the end I have to wait for the timeout (which gives me a blank page BTW, not even an error page), or I can kill all of firefox and rerun it again, and then it generally works.

It's not clear this is the exact same issue that the one outlined here though. Especially in my case this is more about the initial connection to the website. It's also possible that Andrew's described issue is similar: when disabling the wifi routing while the page is loading, there may be new connections to different domains happening at that exact moment, leading to the same process.

Related to bug 1906323, in which Fenix fails to show any error when the access point has no WAN connectivity.

See Also: → 1906323

Renamed the bug as I think I have a very reproducible scenario.

Steps to reproduce

  1. Connect Android device to a wifi network which you can easily disable, or else be able to walk out of range from
  2. Ensure that there is a secondary network that your device will automatically transition to (i.e. has stored credentials and is configured to auto-connect)
  3. Initiate a page load by clicking on a link
  4. Disable the connected to network (or walk out of range), so that the device will automatically transition to the new network

Expected behaviour

As the device connects to the new network, the pageload resumes

Actual behaviour

The pageload will generally stall for a long period of time and then silently fail.
This leaves the user with a blank page (even though the device has successfully transitioned networks).

In Chrome, this is handled gracefully:
• The user is briefly notified of the loss of network
• Once the device connects to the new network the pageload resumes and the page is successfully loaded

(See attached videos).

Note: this can be made easier to reproduce by introduce additional latency on the wifi access point (e.g. +300ms rtt)

Here's an example profile of a page load that stalled after the initial network was disabled (nsHttp logs as markers)
https://share.firefox.dev/3xO8CDR

Summary: Android: changing networks midway through pageload leads to incomplete load → Fenix fails to gracefully handle network transition during pageload
Attached video Chrome_bbc_load.mov

Pageload while transitioning networks in Chrome.
Note how the change of network is messaged to the user followed by the graceful resumption of the pageload.

Attached video fenix_bbc_trimmed.mov

Fenix transitioning networks during pageload.
Note that the page load never completes and the user is left with a blank document.
Video trimmed for size, but it takes over 30 seconds before the loading bar stops.

Screenshot of final view after network transition, Fenix nightly.

Severity: S4 → S3
Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-queue]
Assignee: nobody → acreskey
See Also: → 1909562
Depends on: 1910991

The BBC.com scenario looks to be caused by bug 1910991 since we don't yet have logic to resume HTTP/3 connections on change of networks.

For the most common connection types, HTTP/2, we do have logic in place to establish new connections, bug 1706377.
network.http.http2.move_to_pending_list_after_network_change is enabled. I'm still investigating whether it's working as expected in all cases.

Blocks: 1913418

On the chance that this Gecko bug is what we're seeing on occasion with debug Fenix builds (default Gecko settings provided) that are instrumented on Firebase Test Lab (emulators), intermittently (and not daily), but common enough, is there anything we can verify that gets logcat logged by default to at least confirm there's a network transition (e.g, anything from GeckoNetworkManager/Session)?

We too see partially loaded complete stalls (progress bar) in Fenix on what should be accessible (e.g, storage.googleapis.com) URLs. Again to confirm this is very intermittent.

Emulator video attached.

(In reply to Aaron Train [:aaronmt] from comment #12)

On the chance that this Gecko bug is what we're seeing on occasion with debug Fenix builds (default Gecko settings provided) that are instrumented on Firebase Test Lab (emulators), intermittently (and not daily), but common enough, is there anything we can verify that gets logcat logged by default to at least confirm there's a network transition (e.g, anything from GeckoNetworkManager/Session)?

We too see partially loaded complete stalls (progress bar) in Fenix on what should be accessible (e.g, storage.googleapis.com) URLs. Again to confirm this is very intermittent.

Emulator video attached.

That scenario also looks similar to bug 1906323.

Do we have any way of capturing Firefox profiles from the Fenix instances running in the Firebase Test lab?

Flags: needinfo?(aaron.train)

I'm not aware of any method for doing so. Firebase Test Lab is meant for UIAutomator/Espresso instrumentation of the clients and requires: a signed debug APK and a signed test (androidTest) APK. There's no root access on their devices either. I would need a reproducible scenario too.

Flags: needinfo?(aaron.train)

(In reply to Aaron Train [:aaronmt] from comment #15)

I'm not aware of any method for doing so. Firebase Test Lab is meant for UIAutomator/Espresso instrumentation of the clients and requires: a signed debug APK and a signed test (androidTest) APK. There's no root access on their devices either. I would need a reproducible scenario too.

Understood. If we do have the ability to launch geckoview example via ADB, it's provides a way to capture networking logs via adb.

adb shell am start --es env0 MOZ_LOG=nsHttp:5,nsSocketTransport:5 org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity

adb logcat

I think it's best to make this a [meta] bug because the addressable issues are logged in specific bugs. e.g. bug 1910991, bug 1914416, bug 1909562.

Keywords: meta
Summary: Fenix fails to gracefully handle network transition during pageload → [meta] Fenix fails to gracefully handle network transition during pageload

Bug turned into a meta to cover cases that affect Fenix; moving out of priority queue to be replaced with actionable bugs like bug 1910991

Whiteboard: [necko-triaged][necko-priority-queue] → [necko-triaged]

(In reply to Julien Wajsberg [:julienw] from comment #4)

Andrew asked me to add some notes about an issue I'm seeing for some time as well.
FTR I see this both on Android and Desktop. Not in the same context though: Android it happens more often in the subway while on Desktop this is when I'm in the high-speed train with bad connectivity.

Sometimes, I connect to some website, and at that point my connection is probably quite bad (like in some subway station, or in the high-speed train), so the connection doesn't really succeeds. Maybe (just a wild guess) packets are lost and maybe they're not resent, or they are resent but they're lost again, and the TCP window increases, so they're not resent again or not often enough.

But in the next subway station, the connection is back to a very good state. But the connection still doesn't seem to success. Even worse, because I'm aware of this, I try to reload the page, but this still doesn't work (I think Firefox knows a connection is ongoing to that server, so we don't try to start a new one?).
In the end I have to wait for the timeout (which gives me a blank page BTW, not even an error page), or I can kill all of firefox and rerun it again, and then it generally works.

It's not clear this is the exact same issue that the one outlined here though. Especially in my case this is more about the initial connection to the website. It's also possible that Andrew's described issue is similar: when disabling the wifi routing while the page is loading, there may be new connections to different domains happening at that exact moment, leading to the same process.

At least for sites loaded over HTTP/3, this looks like it might be one of the biggest issues: https://bugzilla.mozilla.org/show_bug.cgi?id=1910991#c6

See Also: → 1962316
No longer blocks: necko-perf
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: