[meta] Fenix fails to gracefully handle network transition during pageload
Categories
(Core :: Networking, defect, P2)
Tracking
()
People
(Reporter: acreskey, Assigned: acreskey)
References
(Depends on 4 open bugs, Blocks 2 open bugs)
Details
(Keywords: meta, Whiteboard: [necko-triaged])
Attachments
(5 files)
When we transition networks midway through page load we are sometimes left with a partially loaded page.
For example:
• Share network via wifi from a host desktop machine.
• Connect to this wifi from Geckoview_exmple
• Start loading a large website (e.g. www.washingtonpost.com)
• Midway through the pageload (after first paint), disable the wifi sharing
• The Android device will transition to the next available wifi network
Result:
The page will be left in an incomplete state.
Only reloading it will fix it.
Expectation:
In progress connections are automatically retried, allowing for a complete pageload?
Chrome does seem to often handling this gracefully, but certainly not always.
Opening this bug for discussion -- it's not clear if we could do this better.
To make the partially loaded page behaviour easier to reproduce, I add packet-level network throttling on the wifi host desktop (e.g. 2mbps and 50ms latency).
I did ensure that network.http.http2.move_to_pending_list_after_network_change from bug 1706377 was enabled.
From remote DevTools, the content did look to be almost all HTTP/2 requests.
| Assignee | ||
Updated•2 years ago
|
| Assignee | ||
Comment 2•2 years ago
|
||
Profile:
https://share.firefox.dev/42AWuAU
Not sure how long it takes for requests to timeout?
| Assignee | ||
Comment 3•2 years ago
|
||
I haven't been able to pinpoint scenarios where Chrome consistently outperforms Firefox in this test.
Setup:
• Packet-level network throttling on desktop (using network link conditioner on macOs), higher latency and limited bandwidth
• Desktop shares network via wifi
• Connect Android device to this shared network
• Ensure that the Android device also has a secondary Wifi network that it will transition to if the shared desktop wifi is dropped
Scenario 1:
- Start a pageload on Fenix/Chrome
- Once the navigation has begun and the view is cleared, disable network sharing from the host desktop
- The Android device will automatically transition to the next auto-join Wifi network
Behaviour: both Fenix and Chrome will stall the pageload with an empty document.
A reload is required to complete the page load.
(Note that Chrome implements "Pull to refresh" on Android, so the drag down motion will trigger a refresh.
Scenario 2:
- Start a pageload on Fenix/Chrome
- Once the navigation has begun and the first non-blank paint has been made, disable network sharing from the host desktop
- The Android device will automatically transition to the next auto-join Wifi network
Behaviour: both Fenix and Chrome will stall the pageload with a partially rendered document.
A reload is required to complete the page load.
(Again, note that Chrome implements "Pull to refresh" on Android, so the drag down motion will trigger a refresh.
The only difference that I can consistently see is the UI gesture "Pull to refresh" which Chrome triggers quite readily.
| Assignee | ||
Updated•2 years ago
|
Comment 4•1 year ago
|
||
Andrew asked me to add some notes about an issue I'm seeing for some time as well.
FTR I see this both on Android and Desktop. Not in the same context though: Android it happens more often in the subway while on Desktop this is when I'm in the high-speed train with bad connectivity.
Sometimes, I connect to some website, and at that point my connection is probably quite bad (like in some subway station, or in the high-speed train), so the connection doesn't really succeeds. Maybe (just a wild guess) packets are lost and maybe they're not resent, or they are resent but they're lost again, and the TCP window increases, so they're not resent again or not often enough.
But in the next subway station, the connection is back to a very good state. But the connection still doesn't seem to success. Even worse, because I'm aware of this, I try to reload the page, but this still doesn't work (I think Firefox knows a connection is ongoing to that server, so we don't try to start a new one?).
In the end I have to wait for the timeout (which gives me a blank page BTW, not even an error page), or I can kill all of firefox and rerun it again, and then it generally works.
It's not clear this is the exact same issue that the one outlined here though. Especially in my case this is more about the initial connection to the website. It's also possible that Andrew's described issue is similar: when disabling the wifi routing while the page is loading, there may be new connections to different domains happening at that exact moment, leading to the same process.
| Assignee | ||
Comment 5•1 year ago
|
||
Related to bug 1906323, in which Fenix fails to show any error when the access point has no WAN connectivity.
| Assignee | ||
Comment 6•1 year ago
|
||
Renamed the bug as I think I have a very reproducible scenario.
Steps to reproduce
- Connect Android device to a wifi network which you can easily disable, or else be able to walk out of range from
- Ensure that there is a secondary network that your device will automatically transition to (i.e. has stored credentials and is configured to auto-connect)
- Initiate a page load by clicking on a link
- Disable the connected to network (or walk out of range), so that the device will automatically transition to the new network
Expected behaviour
As the device connects to the new network, the pageload resumes
Actual behaviour
The pageload will generally stall for a long period of time and then silently fail.
This leaves the user with a blank page (even though the device has successfully transitioned networks).
In Chrome, this is handled gracefully:
• The user is briefly notified of the loss of network
• Once the device connects to the new network the pageload resumes and the page is successfully loaded
(See attached videos).
Note: this can be made easier to reproduce by introduce additional latency on the wifi access point (e.g. +300ms rtt)
Here's an example profile of a page load that stalled after the initial network was disabled (nsHttp logs as markers)
https://share.firefox.dev/3xO8CDR
| Assignee | ||
Comment 7•1 year ago
|
||
Pageload while transitioning networks in Chrome.
Note how the change of network is messaged to the user followed by the graceful resumption of the pageload.
| Assignee | ||
Comment 8•1 year ago
|
||
Fenix transitioning networks during pageload.
Note that the page load never completes and the user is left with a blank document.
Video trimmed for size, but it takes over 30 seconds before the loading bar stops.
| Assignee | ||
Comment 9•1 year ago
|
||
Screenshot of final view after network transition, Fenix nightly.
| Assignee | ||
Updated•1 year ago
|
| Assignee | ||
Updated•1 year ago
|
| Assignee | ||
Updated•1 year ago
|
| Assignee | ||
Comment 10•1 year ago
|
||
The BBC.com scenario looks to be caused by bug 1910991 since we don't yet have logic to resume HTTP/3 connections on change of networks.
| Assignee | ||
Comment 11•1 year ago
|
||
For the most common connection types, HTTP/2, we do have logic in place to establish new connections, bug 1706377.
network.http.http2.move_to_pending_list_after_network_change is enabled. I'm still investigating whether it's working as expected in all cases.
Comment 12•1 year ago
•
|
||
On the chance that this Gecko bug is what we're seeing on occasion with debug Fenix builds (default Gecko settings provided) that are instrumented on Firebase Test Lab (emulators), intermittently (and not daily), but common enough, is there anything we can verify that gets logcat logged by default to at least confirm there's a network transition (e.g, anything from GeckoNetworkManager/Session)?
We too see partially loaded complete stalls (progress bar) in Fenix on what should be accessible (e.g, storage.googleapis.com) URLs. Again to confirm this is very intermittent.
Emulator video attached.
Comment 13•1 year ago
|
||
| Assignee | ||
Comment 14•1 year ago
|
||
(In reply to Aaron Train [:aaronmt] from comment #12)
On the chance that this Gecko bug is what we're seeing on occasion with debug Fenix builds (default Gecko settings provided) that are instrumented on Firebase Test Lab (emulators), intermittently (and not daily), but common enough, is there anything we can verify that gets
logcatlogged by default to at least confirm there's a network transition (e.g, anything from GeckoNetworkManager/Session)?We too see partially loaded complete stalls (progress bar) in Fenix on what should be accessible (e.g, storage.googleapis.com) URLs. Again to confirm this is very intermittent.
Emulator video attached.
That scenario also looks similar to bug 1906323.
Do we have any way of capturing Firefox profiles from the Fenix instances running in the Firebase Test lab?
Comment 15•1 year ago
|
||
I'm not aware of any method for doing so. Firebase Test Lab is meant for UIAutomator/Espresso instrumentation of the clients and requires: a signed debug APK and a signed test (androidTest) APK. There's no root access on their devices either. I would need a reproducible scenario too.
| Assignee | ||
Comment 16•1 year ago
|
||
(In reply to Aaron Train [:aaronmt] from comment #15)
I'm not aware of any method for doing so. Firebase Test Lab is meant for UIAutomator/Espresso instrumentation of the clients and requires: a signed debug APK and a signed test (androidTest) APK. There's no root access on their devices either. I would need a reproducible scenario too.
Understood. If we do have the ability to launch geckoview example via ADB, it's provides a way to capture networking logs via adb.
adb shell am start --es env0 MOZ_LOG=nsHttp:5,nsSocketTransport:5 org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity
adb logcat
| Assignee | ||
Comment 17•1 year ago
|
||
I think it's best to make this a [meta] bug because the addressable issues are logged in specific bugs. e.g. bug 1910991, bug 1914416, bug 1909562.
| Assignee | ||
Updated•1 year ago
|
| Assignee | ||
Comment 18•1 year ago
|
||
Bug turned into a meta to cover cases that affect Fenix; moving out of priority queue to be replaced with actionable bugs like bug 1910991
| Assignee | ||
Comment 19•1 year ago
|
||
(In reply to Julien Wajsberg [:julienw] from comment #4)
Andrew asked me to add some notes about an issue I'm seeing for some time as well.
FTR I see this both on Android and Desktop. Not in the same context though: Android it happens more often in the subway while on Desktop this is when I'm in the high-speed train with bad connectivity.Sometimes, I connect to some website, and at that point my connection is probably quite bad (like in some subway station, or in the high-speed train), so the connection doesn't really succeeds. Maybe (just a wild guess) packets are lost and maybe they're not resent, or they are resent but they're lost again, and the TCP window increases, so they're not resent again or not often enough.
But in the next subway station, the connection is back to a very good state. But the connection still doesn't seem to success. Even worse, because I'm aware of this, I try to reload the page, but this still doesn't work (I think Firefox knows a connection is ongoing to that server, so we don't try to start a new one?).
In the end I have to wait for the timeout (which gives me a blank page BTW, not even an error page), or I can kill all of firefox and rerun it again, and then it generally works.It's not clear this is the exact same issue that the one outlined here though. Especially in my case this is more about the initial connection to the website. It's also possible that Andrew's described issue is similar: when disabling the wifi routing while the page is loading, there may be new connections to different domains happening at that exact moment, leading to the same process.
At least for sites loaded over HTTP/3, this looks like it might be one of the biggest issues: https://bugzilla.mozilla.org/show_bug.cgi?id=1910991#c6
| Assignee | ||
Updated•2 months ago
|
Description
•