Don't poll for captive portal detection

RESOLVED FIXED in Firefox -esr52

Status

()

defect
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: rnewman, Assigned: valentin)

Tracking

(Blocks 1 bug)

Trunk
mozilla55
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(relnote-firefox 53+, firefox-esr5254+ fixed, firefox53+ verified, firefox54+ verified, firefox55 verified)

Details

(Whiteboard: [power][necko-active])

Attachments

(1 attachment)

Reporter

Description

2 years ago
I got a new Mac. It complained that Nightly was using lots of power. As well as finding all the usual suspects (add-ons, Test Pilot), I noticed that we were hitting the captive portal endpoint every few minutes.

This was happening:

- When Firefox was backgrounded.
- With no page loads happening.
- On a stable wireless network on which this machine has spent 100% of its life.
- On an OS with its own reliable captive portal detection.

This is just bad enough to keep the machine from going into a low power state, but just useless enough that it's statistically unlikely to beat the OS (or the user!) to detecting a captive portal. To do that we'd need to poll very frequently.

Can someone explain why we poll at all? Is this to catch the hotel 24-hour logout case?

Possible solutions:

- Don't poll; wait for OS network transition events.
- Don't poll when Firefox is not the foreground application.
- Don't do captive portal detection on Mac OS or modern Windows by default; they have good captive portal detection systems already.
- Don't poll on a separate schedule; instead coalesce these polls with a failed telemetry upload, blocklist ping, update check, or one of the countless other network hits we do.
after the initial check passes (i.e. you're not on CP), the algorithm is not expected to continue to poll unless something else suspicious is observed (cert failures iirc).. valentin will take a look. he'll probably need logs.
Flags: needinfo?(valentin.gosu)
Assignee: nobody → valentin.gosu
Whiteboard: [power] → [power][necko-active]
Reporter

Comment 2

2 years ago
Definitely happening frequently. I have about 10 sitting in the log right now, after I restarted to flip the pref back.

The stack for each is the same:


	URLFetcher jar:file:///Applications/FirefoxNightly.app/Contents/Resources/omni.ja!/components/captivedetect.js:63:3
	_startDetection jar:file:///Applications/FirefoxNightly.app/Contents/Resources/omni.ja!/components/captivedetect.js:312:22
	finishPreparation jar:file:///Applications/FirefoxNightly.app/Contents/Resources/omni.ja!/components/captivedetect.js:282:5
	_applyDetection jar:file:///Applications/FirefoxNightly.app/Contents/Resources/omni.ja!/components/captivedetect.js:301:7
	_runNextRequest jar:file:///Applications/FirefoxNightly.app/Contents/Resources/omni.ja!/components/captivedetect.js:408:7
	_addRequest jar:file:///Applications/FirefoxNightly.app/Contents/Resources/omni.ja!/components/captivedetect.js:416:7
	checkCaptivePortal jar:file:///Applications/FirefoxNightly.app/Contents/Resources/omni.ja!/components/captivedetect.js:266:5

Let me know what you need, Valentin.
So, the polling logic is like this:
The initial polling time is network.captive-portal-service.minInterval
If nothing happens to  trigger a recheck, this time exponentially grows to network.captive-portal-service.maxInterval
So we start polling every 60s the first 10 times, then every 5 minutes 10 more times, and so on until we reach the maxInterval.

If something happens that calls RecheckCaptivePortal - redirect to local IP address, cert errors, coming back from stand-by, or network state changes, we start polling using the minInterval.

I agree with you that maybe this isn't the most energy efficient way of checking. I am open to suggestions on how to improve this. My first ideas would be to add a pref for how quickly we grow the polling interval.

(In reply to Richard Newman [:rnewman] from comment #0)
> were hitting the captive portal endpoint every few minutes.

Do you know how much this affects your battery life? Does it cut it down by half? 10%? less?

> - When Firefox was backgrounded.

We would like the CP portal state to be detected by the time the user brings Firefox to the foreground. I don't know if we can do this without polling.

> - On an OS with its own reliable captive portal detection.

Some CP implementations block the native Mac/Windows detection endpoints, so they don't get detected properly.

> This is just bad enough to keep the machine from going into a low power
> state, but just useless enough that it's statistically unlikely to beat the
> OS (or the user!) to detecting a captive portal. To do that we'd need to
> poll very frequently.

Our current behaviour is a compromise, to avoid polling too frequently, but still allow detection to happen before a user needs to navigate to something.

> Can someone explain why we poll at all? Is this to catch the hotel 24-hour
> logout case?

Yes. Some CPs lock you out after 30 minutes, or 1 hour, etc. There is currently no way of knowing how much time the user has left.

> Possible solutions:
> 
> - Don't poll; wait for OS network transition events.

We trigger detection when network changes happen.

> - Don't poll when Firefox is not the foreground application.

This would reduce the usefulness of our detection.  The polling is specifically used to address the background case, when the user isn't actively using the network.

> - Don't do captive portal detection on Mac OS or modern Windows by default;

This would reduce its usefulness. There are corner cases which Win/Mac doesn't catch, but we do.
Figuring out a better strategy seems better than simply disabling it.

> - Don't poll on a separate schedule; instead coalesce these polls with a
> failed telemetry upload, blocklist ping, update check, or one of the
> countless other network hits we do.

We kind of do this already. If a background connection happens, and it fails (cert error or redirect) we trigger detection immediately.
Flags: needinfo?(valentin.gosu)
my understanding was that if no CP was ever detected in the session (based on the initial check) we would only poll again based on some kind of signal such as link change or maybe cert fail. but there wouldn't be a polling interval at all. That was for exactly the reason in this bug.

That's different than a session that has seen a CP - inevitably its going to have to do some polling.
Reporter

Comment 5

2 years ago
(In reply to Valentin Gosu [:valentin] from comment #3)

> I agree with you that maybe this isn't the most energy efficient way of
> checking. I am open to suggestions on how to improve this. 

Perhaps a better option is to not reduce the polling interval on most of these? That is, it's fine to _bring forward_ the next check if I switch networks or return from idle, but just because I returned from idle doesn't mean we should start polling every minute. For many users that'll effectively cause them to spend most of their time polling every minute.

I'm not really sure what the point of frequent initial polling is. Isn't it enough to poll every 30 minutes, and also to poll when the user foregrounds Firefox + shortly after waking from sleep + shortly after a network change?

IMO a successful poll should _immediately_ kick us into the longest polling duration. There's no point going "yup, no captive portal! better double-check in 60 seconds".

I also wonder if there's a return-from-standby/idle/idle timer bug here.


> My first ideas
> would be to add a pref for how quickly we grow the polling interval.

I don't think adding a pref really helps; users will not open about:config, they'll simply switch away from Firefox when Apple suggests to them that their battery will last four hours longer if they use Safari. I felt a strong temptation to do so myself. Captive portal polling is, I'm sure, not the main offender, but every little helps.


> Do you know how much this affects your battery life? Does it cut it down by
> half? 10%? less?

It's hard to measure.

A Firefox with a handful of static pages open in tabs is still often present in macOS's "apps using significant power" list. Certainly polling once per minute or two over the ~5h I'd get out of a 15" MBP will put the wlan card into a high-power mode quite often, and unnecessarily.

More importantly, this is just about the worst possible timing for a user tethered to a cellular device: if I were idle on my iPhone's hotspot, Firefox would _kill_ my phone's battery, by keeping it in high-power radio mode for 25-50% of the time just to poll.


> We would like the CP portal state to be detected by the time the user brings
> Firefox to the foreground. I don't know if we can do this without polling.

Isn't it sufficient to check as soon as Firefox is foregrounded? I mean, the user isn't going to interact with the CP page without switching to the browser, and milliseconds after foreground is just as good as a minute before.

Perhaps a reasonable compromise, as I suggest above, is using a longer polling interval plus event-based checks.


> > - Don't poll on a separate schedule; instead coalesce these polls with a
> > failed telemetry upload, blocklist ping, update check, or one of the
> > countless other network hits we do.
> 
> We kind of do this already. If a background connection happens, and it fails
> (cert error or redirect) we trigger detection immediately.

OK, so this is even more evidence that rapid polling is unnecessary. Firefox Sync runs every few minutes, plus on network state changes. Telemetry uploads, blocklist ping, GMP updates, safe browsing downloads, etc. all run, too.

It's tempting to say "easy, just wake up and see if we recently made a successful network request, and skip this check". However, ideally a solution doesn't even involve waking up to figure out if we should poll -- that might be enough to kick this big quad-core CPU out of a low-power mode, which is also going to affect my battery life.

Do you have telemetry to suggest how often a timed captive portal detection found a CP?

Do you have telemetry to suggest what proportion of those would not have been found in time by one of the other mechanisms?

My hypothesis is that between background operations like Sync, a long-timer poll, and checks based on user activity and network transitions, the short-timer poll doesn't really help.
We shipped captive portal stuff on FF53, right?  Will the changes here be upliftable to FF54?

Comment 8

2 years ago
mozreview-review
Comment on attachment 8864445 [details]
Bug 1359697 - Make CaptivePortalService not poll for status when it is in the NOT_CAPTIVE state

https://reviewboard.mozilla.org/r/136124/#review139352
Attachment #8864445 - Flags: review?(mcmanus) → review+
(In reply to Ben Kelly [not reviewing due to deadline][:bkelly] from comment #7)
> We shipped captive portal stuff on FF53, right?  Will the changes here be
> upliftable to FF54?

I think so. We haven't made a lot of changes to the CaptivePortalService since then.

Comment 10

2 years ago
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/8a1206836aa4
Make CaptivePortalService not poll for status when it is in the NOT_CAPTIVE state r=mcmanus

Comment 11

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/8a1206836aa4
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Comment on attachment 8864445 [details]
Bug 1359697 - Make CaptivePortalService not poll for status when it is in the NOT_CAPTIVE state

Approval Request Comment
[Feature/Bug causing the regression]:
Initial captive portal impl - bug 1048131
[User impact if declined]:
Unnecessary network requests will be made leading to extra battery use and extra load on our endpoint.
[Is this code covered by automated tests?]:
No.
[Has the fix been verified in Nightly?]:
Yes.
[Needs manual test from QE? If yes, steps to reproduce]:
Yes. Start wireshark on a normal network. Start firefox. Keep watching for 10 minutes. Only one or two requests to the captive portal URL should be made.
We should also check the main use cases for captive portal detection.
[List of other uplifts needed for the feature/fix]: none.
[Is the change risky?]: no.
[Why is the change risky/not risky?]: This is a fairly simple change.
[String changes made/needed]: none
Attachment #8864445 - Flags: approval-mozilla-beta?
Attachment #8864445 - Flags: approval-mozilla-aurora?
Blocks: 1363651
Hi Brindusa, could you help find someone to verify if this issue was fixed as expected on a latest Nightly build? Thanks!
Flags: qe-verify+
Flags: needinfo?(brindusa.tot)
(In reply to Gerry Chang [:gchang] from comment #13)
> Hi Brindusa, could you help find someone to verify if this issue was fixed
> as expected on a latest Nightly build? Thanks!

I can verify this fix.

(In reply to Valentin Gosu [:valentin] from comment #12)
> [Needs manual test from QE? If yes, steps to reproduce]:
> Yes. Start wireshark on a normal network. Start firefox. Keep watching for
> 10 minutes. Only one or two requests to the captive portal URL should be
> made.
> We should also check the main use cases for captive portal detection.
 Valentin, could you please help me out with some hints on what to look for in wireshark? I'm not that experienced at using it.

Also, I'd probably include a very brief smoke on the detection on Win/Linux/Mac, since we know that the behavior of the CP varies with the OS. Please let me know if you think this is unnecessary work for this fix.
Flags: needinfo?(valentin.gosu)
Flags: needinfo?(brindusa.tot)
Flags: needinfo?(adrian.florinescu)
(In reply to Adrian Florinescu [:AdrianSV] from comment #14)
> > We should also check the main use cases for captive portal detection.
>  Valentin, could you please help me out with some hints on what to look for
> in wireshark? I'm not that experienced at using it.

Filter by http requests, there shouldn't be too many of them.
I think you can even apply a filter such as http.host=="detectportal.firefox.com" and it should only show you the requests that interest you.
Make sure there are no more than 2 requests when starting Firefox, and that no other requests are made when NOT in a captive portal. Make sure to use a clean profile just in case.
 
> Also, I'd probably include a very brief smoke on the detection on
> Win/Linux/Mac, since we know that the behavior of the CP varies with the OS.
> Please let me know if you think this is unnecessary work for this fix.

I think only one platform should be enough.
Flags: needinfo?(valentin.gosu)
Sounds like this affects 53, let's track it.
Valentin, if you want to request uplift to mozilla-release here we can try to include this fix in the upcoming 53.0.3 release.
Flags: needinfo?(valentin.gosu)
See Also: → 1365732
Comment on attachment 8864445 [details]
Bug 1359697 - Make CaptivePortalService not poll for status when it is in the NOT_CAPTIVE state

Based on conversation on irc, email, and with :mcmanus let's just take this patch on m-r for the 53 dot release as it sounds pretty safe, and should decrease the frequency that we are polling.
Attachment #8864445 - Flags: approval-mozilla-release+
Attachment #8864445 - Flags: approval-mozilla-beta?
Attachment #8864445 - Flags: approval-mozilla-beta+
Attachment #8864445 - Flags: approval-mozilla-aurora?
Attachment #8864445 - Flags: approval-mozilla-aurora+
Attachment #8864445 - Flags: approval-mozilla-aurora+ → approval-mozilla-aurora-
I think this also would affect esr52, but I think the fix could wait for the next ESR (52.2.0) in June.
Tomcat, can you land this when you get a chance? Julien, this should land for the dot release.
Flags: needinfo?(jcristau)
Flags: needinfo?(cbook)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #20)
> Tomcat, can you land this when you get a chance? Julien, this should land
> for the dot release.

will do it first thing today
Flags: needinfo?(cbook)
Flags: needinfo?(jcristau)
Attachment #8864445 - Flags: approval-mozilla-esr52?
Verified the fix on: Windows 10 x64 using:

FF Version  53.0.3   20170518000419
FF Version  55.0a1   20170518100156

Also checked basic captive portal detection and everything seems in good shape.
Flags: needinfo?(adrian.florinescu)
Comment on attachment 8864445 [details]
Bug 1359697 - Make CaptivePortalService not poll for status when it is in the NOT_CAPTIVE state

let's get this fix in esr52 (default branch, for 52.2esr)
Attachment #8864445 - Flags: approval-mozilla-esr52? → approval-mozilla-esr52+
Included in 53.0.3 relnotes.
I also verified this issue using Firefox 54 beta 10 on Windows 10 64 bit. I could only see 2 requests from http.host=="detectportal.firefox.com" in Wireshark, using bad build 53.0.2 I could see 11 requests.
Flags: qe-verify+
Flags: needinfo?(valentin.gosu)
You need to log in before you can comment on or make changes to this bug.