[meta] Determine what improvements we can make to our use and measurement of speculative connect
Categories
(Core :: Performance: General, enhancement, P2)
Tracking
()
Performance Impact | low |
People
(Reporter: acreskey, Assigned: acreskey)
References
(Depends on 3 open bugs)
Details
(Keywords: meta)
Attachments
(1 file)
168.64 KB,
image/png
|
Details |
We discovered in bug 1813618 that most* speculative connections in necko are being effectively aborted (desktop only).
This includes speculative connections in the traditional sense - from the url bar, the dom (e.g. preconnect, mouse down), and the predictor.
*Note that every httpChannel attempts a speculative connection as part of nsHttpChannel::Connect()
and these are not being aborted (although the pool is small and can easily fill up.)
We also discovered in Bug 1816545 that the number of sockets available for speculative connections in our performance test infrastructure is set to 0
.
This impacts the realism of our tests by effectively disabling an important performance feature.
So this bug is to track improvements to both how we make speculative connections and how we measure it in our performance tests.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Comment 1•2 years ago
|
||
The severity field for this bug is set to S3. However, the Performance Impact
field flags this bug as having a high impact on the performance.
:acreskey, could you consider increasing the severity of this performance-impacting bug? Alternatively, if you think the performance impact is lower than previously assessed, could you request a re-triage from the performance team by setting the Performance Impact
flag to ?
?
For more information, please visit auto_nag documentation.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 2•2 years ago
|
||
Even if we re-enable the speculative connection pool in performance tests (Bug 1816545), many attempts to connect (those that go through nsIOService) will still be aborted because it stops speculative connections when connected via proxy (which is how our performance tests run).
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 3•2 years ago
|
||
The fix for Bug 1813618 has landed.
This query compares some of the metrics in the pageload_event before and after the fix.
While only in the nightly population, there looks to be some significant gains. Some examples:
Cross-origin links
~6% improvement in response time
38ms improvement in fcp at median, 59ms @ 95th percentile
Normal page loads
30ms improvement in response time @ 50th percentile
115ms improvement in response time @ 75th percentile
However we discovered yesterday that there was a secondary problem with speculative connections, see Bug 1543990.
A patch is already up for that one.
So, given all of the other performance work that is landing these days, I think that if we want to fully quantify the impact of the speculative connection fixes, we will likely need a regression experiment.
Assignee | ||
Comment 4•2 years ago
|
||
We may just be seeing the improvement from speculative DNS, but I have not fully verified that.
Assignee | ||
Comment 5•2 years ago
|
||
Some speculative connect scenarios were also broken a few years ago by site partitioning. Including rel=preconnect
Good discussion here: https://phabricator.services.mozilla.com/D177047
The fix looks to be a bit more complex.
Assignee | ||
Comment 6•2 years ago
•
|
||
A quick summary of where we are.
There are three code changes that are expected to improve performance.
- In Dana's fix to Bug 1813618 she removed the logic preventing a large class of speculative connections from running.
This includes speculative connections made from the front end (e.g. Awesome bar, bookmark selection, tab hover), as well as some of those from the dom (clicking on links) as well as those from the Necko predictor.
This code change has been in nightly for a little over a month.
Using the pageload event we can compare various pageload metrics before and after the fix.
And we are seeing overwhelming signs of improvement in pageload metrics.
Metric | Load type | ~Improvement at median | ~Improvement at 75th percentile |
---|---|---|---|
fcp (first contentful paint) | Link | 49ms (4.5%) | 72ms (4.5%) |
fcp (first contentful paint) | Normal | 45ms (4.0%) | 111ms (6%) |
response time | Link | 31ms (6.2%) | 46ms (5.2%) |
response time | Normal | 29ms (5%) | 88ms (8%) |
pageload time | Link | 54ms (3.4%) | 102ms (3.5%) |
pageload time | Normal | 25ms (1.3%) | 71ms (3%) |
We don't have telemetry for response times from Awesome bar navigations specifically, but we can see from profiling that the impact can be huge. (over 100ms in the following attachement)
Ideally we would measure this off/on in Beta and Release, but due to the nature of the fix (specifically code removal), that would involve re-introducing a dead codepath.
This landed in Fx 114.
- The next set of improvements will come from fixing
rel="preconnect"
, Bug 1543990.
This is widely used, it appears in the markup on almost half of sites according to Chrome usage stats.
This fix is expected to land in Fx 115.
Although dependent on experiment design and discussions, we plan to measure the impact of this fix via the Nimbus experiment mentioned in 3., below:
- Currently we have a global maximum of 6 sockets available for speculative connections. In Bug 1816539 we will experiment to see what performance gains, if any, can be obtained by increasing this number.
This constant, 6, is widely believed to be too small (our global limit is 900). That we hit this limit can easily verified by local testing and is also seen in telemetry.
Depending on the outcome of the experiment, we anticipate landing a patch that increases the number of sockets available.
Assignee | ||
Comment 7•2 years ago
|
||
Local profile showing an example of the impact of re-enablibng speculative connections on Awesome bar search results.
Assignee | ||
Comment 8•2 years ago
|
||
While other changes are included in Beta 114, we do see a very strong signal that we've improved the network-dependent time_to_response_start
from telemetry.
Assignee | ||
Comment 9•2 years ago
|
||
This is a project, not a bug; setting the severity to NA.
Assignee | ||
Comment 10•1 year ago
|
||
Reducing perf impact as we believe we've addressed the highest impact use cases.
Description
•