Open Bug 1821691 Opened 3 years ago Updated 1 year ago

Cancel remaining timers when http connections are all resolved

Categories

(Core :: Networking: HTTP, enhancement, P3)

enhancement
Points:
8

Tracking

()

Performance Impact low

People

(Reporter: jesup, Unassigned)

References

(Depends on 1 open bug, Blocks 3 open bugs)

Details

(Keywords: perf:resource-use, Whiteboard: [necko-triaged])

https://share.firefox.dev/3yv1dWG
This shows a number of timers that should be possible to remove once the http connections are all resolved.
This can reduce power usage in idle cases (especially on mobile, but also probably laptops).

This would fall into the performance and tech-debt categories

Performance Impact: --- → ?

Here is a profile where the 'AddTimer' markers have stacks for the 'nsHttpConnectionMgr' timers: https://share.firefox.dev/3ZVF67x

Blocks: power-usage
Severity: -- → N/A
Priority: -- → P3
Whiteboard: [necko-triaged]

Likely we'll need to retain one timer to close the socket after <whenever>

I can't see anything here hinting about meaningful performance impact. Clearing the flag

Performance Impact: ? → ---

ok, perhaps impact: low. The calculator isn't really good with this kinds of issues.

Performance Impact: --- → low
See Also: → 1829841
Duplicate of this bug: 1829841
Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-next]
Depends on: 1889621
Points: --- → 8
Whiteboard: [necko-triaged][necko-priority-next] → [necko-triaged][necko-priority-queue]
Assignee: nobody → acreskey

The timer I see firing most often is nsHttpConnectionMgr::mTimer, the timer for pruning of dead connections

There is logic in ConditionallyStopPruneDeadConnectionsTimer() to cancel the timer when there are no more idle or active connections.

That being said, I'm generally seeing that the timer is continuing to fire, so I'll find out what's triggering it.

Here's a longer running profile with callstacks on AddTimer and also profilermarkers for nsHttp:5
https://share.firefox.dev/3ZjhKdd

We don't seem to reach a point where we resolve all the connections as there are quite a few that look to be persistent:

ads.mozilla.org
private.canadianshield.cira.ca
firefox-settings-attachments.cdn.mozilla.net
mozilla.cloudflare-dns.com
firefox.settings.services.mozilla.com
detectportal.firefox.com
content-signature-2.cdn.mozilla.net

So what I'm seeing is a series of connections with UINT64_MAX for time to live.

time to live: 4294967295 .SA.....F.[tlsflags0x00000000]ads.mozilla.org:443 
time to live: 4294967295 .S........[tlsflags0x00000000]push.services.mozilla.com:443 
time to live: 4294967295 .SA...C...[tlsflags0x00000000]aus5.mozilla.org:443 
time to live: 4294967295 .SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1] 
time to live: 4294967295 .SA...C...[tlsflags0x00000000]content-signature-2.cdn.mozilla.net:443 
time to live: 4294967295 .SA.......[tlsflags0x00000000]private.canadianshield.cira.ca:443[TRR:1]

And in general we update the nsHttpConnectionMgr::mTimer, prune dead connections timer, via Init()
https://searchfox.org/mozilla-central/rev/7fb746f0be47ce0135af7bca9fffdb5cd1c4b1d5/netwerk/protocol/http/nsHttpConnectionMgr.cpp#257
And Init() ultimately adds a new timer via this codepath.

mozilla::net::nsHttpConnectionMgr::PruneDeadConnectionsAfter(unsigned int) 
nsTimerImpl::Init(nsIObserver*, unsigned int, unsigned int)
nsTimerImpl::InitCommon(mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const&, unsigned int, mozilla::Variant<nsTimerImpl::UnknownCallback, nsCOMPtr<nsITimerCallback>, nsCOMPtr<nsIObserver>, nsTimerImpl::FuncCallback, nsTimerImpl::ClosureCallback>&&, mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&)
TimerThreadWrapper::AddTimer(nsTimerImpl*, mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&)

This particular timer will actually delete itself if we have no idle connections (but I don't know if that situation is likely to happen).
https://searchfox.org/mozilla-central/rev/7fb746f0be47ce0135af7bca9fffdb5cd1c4b1d5/netwerk/protocol/http/nsHttpConnectionMgr.cpp#275-277

I'm guessing those are long-pull connections, plus our DoH connection. Not sure about canadianshield.cira.ca

There may be no time when we have no connections, even without DoH, and even if we rewrite ads, etc to not need permanent connections.

Do we stop the timers on Android if the browser is backgrounded? If not, we probably should and then recheck on foregrounding.

We might also be able to notice if all connections have MAX TTL... not sure if that helps though.

Thanks, this gives me some ideas.
(And canadianshield.cira.ca is a Canadian TRR, by the way.)

Let me find out what's happening on Android.

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #9)

There may be no time when we have no connections, even without DoH, and even if we rewrite ads, etc to not need permanent connections.

The profile in comment 0 has plenty of times where we have no nsHttpConnectionMgr timer running, the profile from comment 7 has none, do you think something has changed, or is the test case different? It's likely that the profile from comment 0 came from me when I was profiling what Firefox does when it is left for a full day with a single about:blank tab.

(In reply to Florian Quèze [:florian] from comment #11)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #9)

There may be no time when we have no connections, even without DoH, and even if we rewrite ads, etc to not need permanent connections.

The profile in comment 0 has plenty of times where we have no nsHttpConnectionMgr timer running, the profile from comment 7 has none, do you think something has changed, or is the test case different? It's likely that the profile from comment 0 came from me when I was profiling what Firefox does when it is left for a full day with a single about:blank tab.

I think the test cases are likely quite different.
For comment 7 I captured a profile from a fresh profile of a nightly build, with the only changes being that I turned off the Pocket, sponsored tiles, and weather on the new tab.
Since I live in Canada, I was automatically enrolled in DoH.
But otherwise it was a profile from just after launch.

Good news: I've marked up the source to output to ADB when a new timer is added on Android and I don't see any being added when the app is backgrounded (using Fenix debug).

When the app is in the foreground, I see the prune dead connection timer being updated periodically because of what also look to be long-poll connections with max time to live:
(two from the amazon, which I had just loaded, and one from firefox.settings.services.mozilla.com, and incoming.telemetry.mozilla.org)

nsHttpConnectionMgr::OnMsgPruneDeadConnections  timeToNextExpire: 4294967295 UINT32_MAX 4294967295 , [ci=.SA...C...[tlsflags0x00000000]incoming.telemetry.mozilla.org:443] 
nsHttpConnectionMgr::OnMsgPruneDeadConnections  timeToNextExpire: 4294967295 UINT32_MAX 4294967295 , [ci=.SA.......[tlsflags0x00000000]images-na.ssl-images-amazon.com:443 <ROUTE-via images-na.ssl-images-amazon.com:443> {NPN-TOKEN h3}^partitionKey=%28https%2Camazon.ca%29] 
nsHttpConnectionMgr::OnMsgPruneDeadConnections  timeToNextExpire: 169 UINT32_MAX 4294967295 , [ci=.S........[tlsflags0x00000000]fls-na.amazon.ca:443^partitionKey=%28https%2Camazon.ca%29] 
nsHttpConnectionMgr::OnMsgPruneDeadConnections  timeToNextExpire: 4294967295 UINT32_MAX 4294967295 , [ci=.SA.......[tlsflags0x00000000]firefox.settings.services.mozilla.com:443] 

On Desktop, with DoH disabled, leaving the browser idle on about:blank for a few hours, I see the following long lived connections:

nsHttpConnectionMgr::OnMsgPruneDeadConnections  timeToNextExpire 4294967295 [ci=.S........[tlsflags0x00000000]addons.mozilla.org:443^partitionKey=%28https%2Cmozilla.org%29] 
nsHttpConnectionMgr::OnMsgPruneDeadConnections  timeToNextExpire 4294967295 [ci=.S........[tlsflags0x00000000]push.services.mozilla.com:443] 
nsHttpConnectionMgr::OnMsgPruneDeadConnections  timeToNextExpire 4294967295 [ci=..A.......[tlsflags0x00000000]detectportal.firefox.com:80[TRR:1][!v4]] 
nsHttpConnectionMgr::OnMsgPruneDeadConnections  timeToNextExpire 4294967295 [ci=.S........[tlsflags0x00000000]encrypted-tbn0.gstatic.com:443 <ROUTE-via encrypted-tbn0.gstatic.com:443> {NPN-TOKEN h3}] 
nsHttpConnectionMgr::OnMsgPruneDeadConnections  timeToNextExpire 4294967295 [ci=.SA...C...[tlsflags0x00000000]aus5.mozilla.org:443] 
nsHttpConnectionMgr::OnMsgPruneDeadConnections  timeToNextExpire 4294967295 [ci=.S........[tlsflags0x00000000]www.google.com:443^partitionKey=%28https%2Cgoogle.com%29] 
nsHttpConnectionMgr::OnMsgPruneDeadConnections  timeToNextExpire 4294967295 [ci=.S........[tlsflags0x00000000]firefox-settings-attachments.cdn.mozilla.net:443^partitionKey=%28about%2Cabout.ef2a7dd5-93bc-417f-a698-142c3116864f.mozilla%29] 
nsHttpConnectionMgr::OnMsgPruneDeadConnections  timeToNextExpire 4294967295 [ci=.SA.......[tlsflags0x00000000]ads.mozilla.org:443 <ROUTE-via ads.mozilla.org:443> {NPN-TOKEN h3}] 
nsHttpConnectionMgr::OnMsgPruneDeadConnections  timeToNextExpire 4294967295 [ci=.SA.......[tlsflags0x00000000]www.google.com:443 <ROUTE-via www.google.com:443> {NPN-TOKEN h3}^firstPartyDomain=google-b-d.search.suggestions.mozilla] 
nsHttpConnectionMgr::OnMsgPruneDeadConnections  timeToNextExpire 4294967295 [ci=.SA...C...[tlsflags0x00000000]firefox.settings.services.mozilla.com:443] 
nsHttpConnectionMgr::OnMsgPruneDeadConnections  timeToNextExpire 4294967295 [ci=..A.......[tlsflags0x00000000]detectportal.firefox.com:80[TRR:1]] 
nsHttpConnectionMgr::OnMsgPruneDeadConnections  timeToNextExpire 4294967295 [ci=..A.......[tlsflags0x00000000]detectportal.firefox.com:80[TRR:1][!v6]] 
nsHttpConnectionMgr::OnMsgPruneDeadConnections  timeToNextExpire 4294967295 [ci=.SA...C...[tlsflags0x00000000]content-signature-2.cdn.mozilla.net:443] 

We decided to pull this one from the Necko priority queue because it doesn't appear to be actionable since we never actually reach a state where all requests are resolved. (Because of the number of long lived connections (e.g. addons.mozilla.org, firefox.settings.services.mozilla, TRR's etc).

But reducing timer wakeups is important for energy usage.

Does anyone know of any other bugs in this area that we can consider to replace this one?

Whiteboard: [necko-triaged][necko-priority-queue] → [necko-triaged]

I grabbed another long profile today and I still see the same pattern as in the profile from comment 0: https://share.firefox.dev/40cIH2H

Now that we have a bandwidth track, I also see that we are transferring 34 bytes most times there are timer wake-ups. Are there some servers Firefox is reaching out to that are improperly configured to keep connections alive? Should the client code more explicitly close the connection?

Flags: needinfo?(acreskey)

(In reply to Florian Quèze [:florian] from comment #16)

I grabbed another long profile today and I still see the same pattern as in the profile from comment 0: https://share.firefox.dev/40cIH2H

Now that we have a bandwidth track, I also see that we are transferring 34 bytes most times there are timer wake-ups. Are there some servers Firefox is reaching out to that are improperly configured to keep connections alive? Should the client code more explicitly close the connection?

I'm looking at your profile and from what I can tell the recorded bandwidth transfers show up around network transactions (e.g. response from https://ads.mozilla.org/v1/ads or https://safebrowsing.googleapis.com/v4/threatListUpdates:fetch?... or OCSP responses).
So those seem to be expected?

We could also use Wireshark to pinpoint the source of the connections that use the most bandwidth when Firefox is idle.
Related: Just today I started looking at decrypted Wireshark captures, https://my.f5.com/manage/s/article/K50557518

And then follow up with the teams that are making the requests (e.g. push.services.mozilla.com) and see if we can change the connection settings.

Assignee: acreskey → nobody
Flags: needinfo?(acreskey)
You need to log in before you can comment on or make changes to this bug.