Open Bug 1956356 Opened 5 months ago Updated 22 days ago

Fenix - create a speculative connection to the app link host

Categories

(Core :: Networking, enhancement, P2)

Unspecified
Android
enhancement

Tracking

()

People

(Reporter: acreskey, Assigned: acreskey)

References

(Depends on 1 open bug, Blocks 4 open bugs)

Details

(Whiteboard: [necko-triaged])

Attachments

(3 files)

In profiling the Fenix applink startup test we see that we only start loading the applink URL after ContentParent::CreateBrowser() is run and we've loaded chrome://geckoview/content/geckoview.xhtml

But this can be over a second after other network requests have started loading (location services, telemetry, etc).
https://share.firefox.dev/420hq4z

One remedy is to initiate a speculative connection to the applink host before we initialize the browser.
In the above profile (~100ms of network latency), the savings from TCP + TLS would be over 500ms.

(The code change likely won't be in Necko, so we will have to move this bug)

Summary: Fenix - create a speculative connection to the app link url → Fenix - create a speculative connection to the app link host

This may be an S2, 500ms for applink is Huge

Severity: -- → S3
Priority: -- → P2
Whiteboard: [necko-triaged][necko-priority-new]

Experiment: create speculative connection from Fenix layer

Based on the idea (and code :) in Bug 1880491 (Warm up DNS on android Intent invocation) I prototyped this and will evaluate it asap.

See Also: → 1880491

From profiling the WIP patch, comment #2, so far the new speculative connection also appears to be executed very late and is not yet providing benefit.
https://share.firefox.dev/4hVZOwS

Whiteboard: [necko-triaged][necko-priority-new] → [necko-triaged][necko-priority-next]
Whiteboard: [necko-triaged][necko-priority-next] → [necko-triaged]
Assignee: nobody → acreskey

After changing the attached patch to also pass engine = engine where we create the TabIntentProcessor, the patch indeed works! Here's a profile showing the speculative connection: https://share.firefox.dev/3SurNJr

(In reply to Markus Stange [:mstange] from comment #5)

After changing the attached patch to also pass engine = engine where we create the TabIntentProcessor, the patch indeed works! Here's a profile showing the speculative connection: https://share.firefox.dev/3SurNJr

Thanks for profiling this!
I'll measure the impact on the shopify test locally.
From a perf compare I didn't see any improvements in CI, but I will look more closely now that I know it's working.

From a local profile of the shopify startup test, I can see the speculative connection being initiated about 1 second before the actual load:
https://share.firefox.dev/4k5uME3
However in the network request for the https://theme-crave-demo.myshopify.com/ load I'm not seeing the preconnect.
So network partitioning may lead to this connection not being used in this context. Will debug.

The symbols are missing, but this is a profile that captures the problem w/ Moz logs on the socket thread
https://share.firefox.dev/4kl1lhz

We can see that while the early speculative connection is attempted at 1.221 seconds, it is ultimately not used when we go to load the page at 2.242 seconds

Some potential issues:

2.243 LogMessages — (nsHostResolver)   No usable record in cache for host [theme-crave-demo.myshopify.com] type 0.

This looked promising:

2.265 LogMessages — (nsHttp) GetH2orH3ActiveConn() request for ent 7015629790 .S........[tlsflags0x00000000]theme-crave-demo.myshopify.com:443^partitionKey=%28https%2Ctheme-crave-demo.myshopify.com%29 found an active but inexperienced connection 6f4ff09700 in native connection entry

But ultimately it looks like we want to connect over HTTP/3 and so make a new connection

2.371 LogMessages — (nsHttp) ConnectionEntry::ConnectionEntry this=6f4c545940 key=.S........[tlsflags0x00000000]theme-crave-demo.myshopify.com:443 <ROUTE-via theme-crave-demo.myshopify.com:443> {NPN-TOKEN h3}^partitionKey=%28https%2Ctheme-crave-demo.myshopify.com%29
Depends on: 1965871
Depends on: 1965894
Blocks: 1966774

I took the in-progress patch from this bug as well as from bug 1965871 and bug 1903116 and built it all.

This looks like it's working as desired:

https://share.firefox.dev/45qB3Gc

See https://theme-crave-demo.myshopify.com/

1.371 seconds: initiate speculative connection to theme-crave-demo.myshopify.com

And after the lengthy content process creation when we finally go to load the URL, the connection is already active:

2.28 seconds:
LogMessages — (nsHttp) GetH2orH3ActiveConn() request for ent 7001f8e240 .S........[tlsflags0x00000000]theme-crave-demo.myshopify.com:443^partitionKey=%28https%2Ctheme-crave-demo.myshopify.com%29 found an active but inexperienced connection 7001f9bb00 in native connection entry

In this particular profile, DNS + TCP + TLS took 200ms, so that's a great savings.

Currently evaluating this locally on wifi as the shopify test is not stable in bitbar (although early results from lambdaTest look promising).

But I think would be ideal to measure the impact is the telemetry described in bug 1966167

Depends on: 1903116

It looks like the speculative connection is still may not being used if we ultimately connect over HTTP/3 (comment 9 was an HTTP/2 connection).

See profile
https://share.firefox.dev/3ZzrSzy

I could be misreading this but, I see:

Creating the speculative connection 1.162S

LogMessages — (nsHttp) Creating DnsAndConnectSocket [this=70356fca80 trans=7151284300 ent=theme-crave-demo.myshopify.com key=.S........[tlsflags0x00000000]theme-crave-demo.myshopify.com:443^partitionKey=%28https%2Ctheme-crave-demo.myshopify.com%29]

2.047 - finding it at document load time

LogMessages — (nsHttp) GetH2orH3ActiveConn() request for ent 700b08bb00 .S........[tlsflags0x00000000]theme-crave-demo.myshopify.com:443^partitionKey=%28https%2Ctheme-crave-demo.myshopify.com%29 found an active but inexperienced connection 6f6aefed00 in native connection entry

But then shortly after we are actually looking for a connection with {NPN-TOKEN h3}

LogMessages — (nsHttp) GetH2orH3ActiveConn() request for ent 6f6ad6a560 .S........[tlsflags0x00000000]theme-crave-demo.myshopify.com:443 <ROUTE-via theme-crave-demo.myshopify.com:443> {NPN-TOKEN h3}^partitionKey=%28https%2Ctheme-crave-demo.myshopify.com%29 did not find an active connection

And in the profile network track we don't see the speculative connection for the document load.

I'll followup in bug 1965871, but we may not be supporting the scenario when HTTP/3 is negotiated during the TLS handshake.

From the profile, I can confirm that the speculative connection is still not used.
The reason is that we skip looking up alt-svc cache because the data storage is not ready.

2025-05-27 17:58:18.309261718 UTC - [Parent Process 22952: GeckoMain]: D/nsHttp AltSvcCache::LookupMapping 700b0d0680 https:theme-crave-demo.myshopify.com:443:.:^partitionKey=%28https%2Ctheme-crave-demo.myshopify.com%29:3
2025-05-27 17:58:18.309355224 UTC - [Parent Process 22952: GeckoMain]: D/nsHttp AltSvcCache::LookupMapping 700b0d0680 skip when storage is not ready

So, the hash key for the speculative connection is:

2025-05-27 17:58:18.309533447 UTC - [Parent Process 22952: GeckoMain]: D/nsHttp MaybeSpeculativeConnectWithHTTPSRR for ci=.S........[tlsflags0x00000000]theme-crave-demo.myshopify.com:443^partitionKey=%28https%2Ctheme-crave-demo.myshopify.com%29

Then, the request used to load the document uses HTTP/3:

2025-05-27 17:58:19.186610107 UTC - [Parent Process 22952: GeckoMain]: D/nsHttp MaybeSpeculativeConnectWithHTTPSRR for ci=.S........[tlsflags0x00000000]theme-crave-demo.myshopify.com:443 <ROUTE-via theme-crave-demo.myshopify.com:443> {NPN-TOKEN h3}^partitionKey=%28https%2Ctheme-crave-demo.myshopify.com%29
2025-05-27 17:58:19.221854980 UTC - [Parent Process 22952: GeckoMain]: D/nsHttp nsHttpChannel 6f6ae0c800 Alt Service Mapping Found https://theme-crave-demo.myshopify.com:443 [https:theme-crave-demo.myshopify.com:443:.:^partitionKey=%28https%2Ctheme-crave-demo.myshopify.com%29:3]
2025-05-27 17:58:19.221916015 UTC - [Parent Process 22952: GeckoMain]: D/nsHttp nsHttpChannel 6f6ae0c800 Using connection info from altsvc mapping

I filed bug 1966017 and hope to fix this issue completely.

Thanks for debugging this Kershaw. Can I add bug 1966017 as a blocker to this bug, or is there another solution for HTTP/2 -> HTTP/3 speculative connections?

Flags: needinfo?(kershaw)

By disabling HTTP/3 via pref and increasing the number of sockets available we can poke at the impact of the applink speculative connection.

Running the cvne-shopify.sh test on a high latency network, 200ms rtt (to highlight the improvement), I am measuring an improvement of 400ms (median startup from ~2900ms to ~2500ms).
This seems reasonable -- two round trips saved.

Notes:
• Using 100 iterations of the test instead of 5

--- Statistics for spec_connect_h2_200ms_rtt.txt ---
Count: 100
Mean: 2600.26
Median: 2521.62
Min: 1977.47
Max: 4999.00
Std Dev: 333.70
25th Percentile: 2456.52
75th Percentile: 2651.04

--- Statistics for baseline_h2_200ms_rtt.txt ---
Count: 100
Mean: 2926.86
Median: 2932.95
Min: 1426.16
Max: 4907.54
Std Dev: 370.03
25th Percentile: 2803.23
75th Percentile: 3091.71

(In reply to Andrew Creskey [:acreskey] from comment #12)

Thanks for debugging this Kershaw. Can I add bug 1966017 as a blocker to this bug, or is there another solution for HTTP/2 -> HTTP/3 speculative connections?

Sure. :)

Depends on: 1966017
Flags: needinfo?(kershaw)

Thanks, and for reference here is a profile on the happy path, http/2, early speculative connection which is used by the url load that follows about one second later.
https://share.firefox.dev/4kn5TnX

Early speculative connection:

1.216s LogMessages — (nsHttp) nsHttpConnectionMgr::OnMsgSpeculativeConnect [ci=.S........[tlsflags0x00000000]theme-crave-demo.myshopify.com:443^partitionKey=%28https%2Ctheme-crave-demo.myshopify.com%29, mFetchHTTPSRR=1]

... content/geckoview.xhtml is loaded ...

And later when the applink load arrives,
nsIOService::NewChannelFromURIWithProxyFlagsInternal() -> nsHttpChannel::Init(), main thread of parent

2.209s LogMessages — (nsHttp) host=theme-crave-demo.myshopify.com port=-1
2.211 Load 8: https://theme-crave-demo.myshopify.com/

On the socket thread, we find the active connection

2.211s LogMessages — (nsHttp) GetH2orH3ActiveConn() request for ent 70239ac920 .S........[tlsflags0x00000000]theme-crave-demo.myshopify.com:443^partitionKey=%28https%2Ctheme-crave-demo.myshopify.com%29 found an active but inexperienced connection 6f58751e00 in native connection entry

And then we appear to spend 75ms with URL classifier (very slow profile because of profilermarkers:5,profilerstacks) and, as far as I can tell, waiting for the Cache check

Main thread

2.286s LogMessages — (nsHttp) nsHttpChannel::OnCacheEntryAvailable [this=70243d9800 entry=6f53b20860 new=0 status=0] for https://theme-crave-demo.myshopify.com/

And we ultimately make the request right after:

Main thread

2.287s LogMessages — (nsHttp) http request [
2.287s LogMessages — (nsHttp)   GET / HTTP/1.1
2.288s LogMessages — (nsHttp)   Host: theme-crave-demo.myshopify.com

I can measure improvements resulting from a speculative connection on applink on networks down to about 50ms of additional rtt.

But at that point the change blends into the noise even though I thought I would still see the improvement in the mean and medians.

But I think I know what's happening.

See this marker only profile (to reduce overhead), A54
https://share.firefox.dev/4mUC3bU

Early speculative connection

1.595s SpeculativeConnect - https://theme-crave-demo.myshopify.com

A bit over a second later the actual load is started (kicks off a second speculative connection)

2.823s LogMessages — (nsHttp) nsHttpConnectionMgr::SpeculativeConnect [ci=.S........[tlsflags0x00000000]theme-crave-demo.myshopify.com:443^partitionKey=%28https%2Ctheme-crave-demo.myshopify.com%29]
Load 9: https://theme-crave-demo.myshopify.com/

After about 43ms the channel classifier completes:

2.866s Runnable - AsyncUrlChannelClassifier::CheckChannel - return - priority: Normal (4) task: 57d80365928b78da0
2.866s (nsHttp) nsHttpChannel::ResolveProxy [this=7028465600]

And the cache lookup is initiated:

2.868s LogMessages — (nsHttp) nsHttpChannel::OpenCacheEntry [this=7028465600]
2.868s LogMessages — (cache2) CacheStorageService::AddStorageEntry [entryKey=:https://theme-crave-demo.myshopify.com/, contextKey=O^partitionKey=%28https%2Ctheme-crave-demo.myshopify.com%29,]
2.868s LogMessages — (cache2)   new storage entries table for context 'O^partitionKey=%28https%2Ctheme-crave-demo.myshopify.com%29,'

It completes about 18ms later

2.886s LogMessages — (nsHttp) nsHttpChannel::OnCacheEntryAvailable [this=7028465600 entry=7028097740 new=1 status=0] for https://theme-crave-demo.myshopify.com/

And then we make the actual request

2.887s

LogMessages — (nsHttp) http request [
LogMessages — (nsHttp)   GET / HTTP/1.1
LogMessages — (nsHttp)   Host: theme-crave-demo.myshopify.com

So we receive the url load message, initiate another speculative connection (was there previously) and then hit a 67ms bottleneck waiting for classification + cache check.
Without the Fenix layer speculative connection, we're already benefiting from the onLoad speculative connection during the channel classification and cache check.
(Except when the network is slower, which is why the impact is visible with induced rtt).

So this patch will be helpful in the field especially on slower networks, but the bottlenecks in URL classification and cache retrieval look to be good areas to optimize.

I had made a mistake in my previous profile as it included stack sampling.

This one shouldn't have any stack sampling: https://share.firefox.dev/3ZtIxVk

Early Speculative connection:

0.454s LogMessages — (nsHttp) uri=https://theme-crave-demo.myshopify.com/

Actual load once gecko session is ready

1.314s Load 9: https://theme-crave-demo.myshopify.com/

Adding a cache entry with no context key, which I don't understand

1.365s LogMessages — (cache2) CacheStorageService::AddStorageEntry [entryKey=:https://theme-crave-demo.myshopify.com/, contextKey=]

Classification complete, 69ms

1.383s Runnable - AsyncUrlChannelClassifier::CheckChannel - return - priority: Normal (4) task: 858c8d80330d16db0

Adding a cache entry with context key

1.391s LogMessages — (cache2) CacheStorageService::AddStorageEntry [entryKey=:https://theme-crave-demo.myshopify.com/, contextKey=O^partitionKey=%28https%2Ctheme-crave-demo.myshopify.com%29,]

Cache entry available, make the actual request:

1.414s LogMessages — (nsHttp) nsHttpChannel::OnCacheEntryAvailable [this=6fb230fe00 entry=6f54b102e0 new=1 status=0] for https://theme-crave-demo.myshopify.com/

1.416s LogMessages — (nsHttp)   GET / HTTP/1.1

In this case about 102ms to get the request out, even though the speculative connection was established.

Jeff pointed out yesterday that the Url classifier is initializing at this point, visible here on the new profile (43ms), https://share.firefox.dev/44dApeg

So I'm now evaluating this with Markus' patch to initialize earlier and Dimi's pref to defer CRC32 checks

(In reply to Andrew Creskey [:acreskey] from comment #17)

So I'm now evaluating this with Markus' patch to initialize earlier and Dimi's pref to defer CRC32 checks

https://share.firefox.dev/4kzGmI5
Can see the URL classifier initialize before the load
Classification took 16ms in this run
And the request was out the door in 33ms
Will see how this behaves in longer running test.

I ran this test overnight, 200 cold navigations to the shopify site, i.e. cvne-shopify.sh with device clearing before each iteration.
Fastest network available to me (home wifi), no latency added.

Combining the URL classifier patches with the Fenix speculative connection is looking very promising.
~114 ms improvement in the median and ~160ms in the mean.
And the noise looks to be reduced as well, with standard deviation cut to between 1/2 and 1/3. (Visible on the box plot).

I'll run these again to confirm reproducibility and to measure each URL Classifier patch separately.
Randell also a wip-patch to do URL Classification and CacheLookup in parallel, and I think that would shave another 16+ms off these numbers.

Variants:
baseline_h2 - latest Fenix
classifier_patches_h2 - Markus' patch to initialize classifier earlier and Dimi's pref to defer CRC32 checks, bug 1956920
spec_connect_h2 - Fenix speculative connect on receipt of applink intent
spec_connect_classifier_patches_h2 - Fenix speculative connect and both classifier changes

Results:

By median startup time, fastest to slowest
---------------------------------------------
 1. spec_connect_classifier_patches_h2.txt    2416.87ms
 2. spec_connect_h2.txt                       2443.15ms
 3. classifier_patches_h2.txt                 2462.45ms
 4. baseline_h2.txt                           2530.77ms

Statistics:
--------------------

File: baseline_h2.txt
  Count: 200
  Mean: 2582.21ms
  Median: 2530.77ms
  Min: 2195.46ms
  Max: 5220.56ms
  Std Dev: 328.18ms
  25th Percentile: 2419.68ms
  75th Percentile: 2638.19ms

File: classifier_patches_h2.txt
  Count: 200
  Mean: 2459.04ms
  Median: 2462.45ms
  Min: 2113.61ms
  Max: 2790.13ms
  Std Dev: 111.63ms
  25th Percentile: 2386.07ms
  75th Percentile: 2529.25ms

File: spec_connect_classifier_patches_h2.txt
  Count: 200
  Mean: 2420.46ms
  Median: 2416.87ms
  Min: 1260.52ms
  Max: 2745.16ms
  Std Dev: 138.00ms
  25th Percentile: 2358.37ms
  75th Percentile: 2495.55ms

File: spec_connect_h2.txt
  Count: 200
  Mean: 2440.25ms
  Median: 2443.15ms
  Min: 2002.24ms
  Max: 3295.13ms
  Std Dev: 143.74ms
  25th Percentile: 2362.85ms
  75th Percentile: 2510.15ms

Notes: HTTP/3 disabled while we work on speculative connections for HTTP/3. (hence the _h2)

Boxplot from the spec connect/url classifier test. (medians)

Marking as blocked on landing telemetry for this scenario, bug 1966167, so we can measure the impact for Fenix users.

Depends on: 1966167
Attachment #9475363 - Attachment description: WIP: Bug 1956356 Fenix - create a speculative connection to the app link host → Bug 1956356 Fenix - create a speculative connection to the applink host r=#android-reviewers,kaya
No longer depends on: 1966167
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: