Closed Bug 1456005 Opened 7 years ago Closed 6 years ago

TRR deadlock when captive portal detection fails

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: mt, Assigned: kershaw)

References

Details

(Whiteboard: [necko-triaged][trr][mode 3])

Attachments

(2 files)

Bug 1456005 - Set LOAD_DISABLE_TRR flag in CaptiveDetect.jsm r=mayhemer 6 years ago Valentin Gosu [:valentin] (he/him) {{ AFK in July }} 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1456005 - Add captive portal URI to the TRR exclusion list 6 years ago Kershaw Chang [:kershaw] 47 bytes, text/x-phabricator-request		Details \| Review

Martin Thomson [:mt:]

Reporter

Description

•

7 years ago

In an airport lounge and Windows detects the captive portal, but Firefox does not. This leads to a window being opened in Firefox (the default browser), which proceeds to use TRR (mode=3), which naturally fails. The tab seems stuck in a limbo state: no content, no error page, just a blank white page. It's possible that this is intended behaviour. I think that is the point of bug 1446352. But the complete lack of feedback is a concern.

Daniel Stenberg [:bagder]

Comment 1

•

7 years ago

This is curious. In mode 3 (trronly), TRR doesn't care about the captive portal check but will keep on retrying to resolve the confirmation NS domain with an exponential backoff up to a 60 seconds interval. When in trronly, Firefox can't do the CP check itself since it will fail all host names until the CP is up. But if the CP state is cleared externally, like by another browser (session), it should then detect it working again and pop back to life. Did you try that, or did you just spot Firefox getting into limbo? I would of course agree that the lack of feedback in this scenario makes it really hard for anyone to actually understand what's going on and it'll require that you really understand the low level network doings not to just get confused by this. I think trronly mode is still a bit experimental and for the adventurous only. If TRR is enabled by pref (any mode), it would be nice with some sort of feedback somewhere saying if it actually works or not...

Assignee: nobody → daniel

Priority: -- → P2

Whiteboard: [necko-triaged][trr]

Martin Thomson [:mt:]

Reporter

Comment 2

•

7 years ago

I could open another browser and unblock things, which would unbind things, certainly. But the indefinite wait is a problem. Maybe it's just a matter of having the TRR resolutions fail at a normal rate while TRR is waiting for confirmation rather than waiting on that process indefinitely. Maybe if TRR has hit a certain point in its backoff, then attempts to use TRR can fail immediately rather than block (you know the code, so might have another idea that is better, of course). I want to make mode 3 viable, not limit it to use by people who understand how this all fits together. Maybe there are two bugs to investigate here, the second to spin off in another bug: 1. the feedback (faster, better, etc...) 2. CP+TRR interaction - maybe we should let the CP check use the default name server (or cap TRR at mode 2 for those checks); it seems like that check isn't a privacy exposure, and that exemption would make the entire process more robust.

Daniel Stenberg [:bagder]

Comment 3

•

7 years ago

(Cc'ing Valentin, "mr CP")

Daniel Stenberg [:bagder]

Comment 4

•

7 years ago

We should probably consider making trronly respect the "wait for the captive portal" pref, or perhaps have a more laxed "trronly-after-cp" mode. For trronly, simply whitelisting the CP host we use will work to at least confirm that we are in a CP (since it will fail to get the correct data), but it would still most often fail to load the subsequent CP login page unless the native resolver is used for that as well. The one issue with a "trronly-after-cp" mode is for those users who strictly never want a native resolve to happen, since while waiting for the CP to signal OK in a browser-startup even without a CP present, there's likely to be a few native name resolves. If you for example run your desktop in a fixed known environment where there's no CP, you wouldn't have to accept that compromise.

Jonathan Kingston [:jkt] he/him

Comment 5

•

7 years ago

In mode 2 the captive portal test netwerk/test/unit/test_captive_portal_service.js is permafailing. This will be an issue of the rollout even if it's just a broken test. https://treeherder.mozilla.org/#/jobs?repo=try&revision=26d5ce9b665329222522aeebbb13b9f3198c2397 is the try push failures (I'm not sure why try didn't capture my commit with the extension also, but that looks like this: https://hg.mozilla.org/try/rev/4aa199208703239f3b82d92958600a97a565cf62)

Daniel Stenberg [:bagder]

Comment 6

•

7 years ago

jkt: I think that warrants its own separate bug. It's certainly not the same issue as this CP deadlock in mode 3...

Jonathan Kingston [:jkt] he/him

Updated

•

7 years ago

Updated

•

7 years ago

Assignee: daniel → nobody

Tony Cinotto [:tcinotto] (UTC-8)

Updated

•

6 years ago

Whiteboard: [necko-triaged][trr] → [necko-triaged][trr][mode3]

Tony Cinotto [:tcinotto] (UTC-8)

Updated

•

6 years ago

Priority: P2 → P3

Valentin Gosu [:valentin] (he/him) {{ AFK in July }}

Comment 7

•

6 years ago

Actually this probably affects mode2 as well, in the split horizon case.
There's a chance that a network is using a remote captive portal service. If that service is set up to return different results based on the recursive resolver's IP, the user might be unable to log into the captive portal, or we might detect a captive portal even though it has already been cleared.

Solution:

set the LOAD_DISABLE_TRR flag on the xhr in CaptiveDetect.jsm
make sure the tab opened from ensureCaptivePortalTab also gets passed that flag
Add the captive portal endpoint domain name to the list of excluded domains

Depending on the weird ways the CP is configured, we might have a problem if the user navigates to example.com but is redirected to the captive portal login page in a regular tab. But at least they'd have a way out by clicking the notification bar.

This needs a bunch of unit tests that might be difficult to write at the moment - especially for what happens if we have a DoH connection active and the CP decides to lock itself during that session.

Assignee: nobody → valentin.gosu

Whiteboard: [necko-triaged][trr][mode3] → [necko-triaged][trr]

Valentin Gosu [:valentin] (he/him) {{ AFK in July }}

Updated

•

6 years ago

Keywords: leave-open

Valentin Gosu [:valentin] (he/him) {{ AFK in July }}

Comment 8

•

6 years ago

•

Edited

Attached file Bug 1456005 - Set LOAD_DISABLE_TRR flag in CaptiveDetect.jsm r=mayhemer — Details

This is useful to prevent issues in TRR mode 2 when a captive portal redirect occurs via DNS.
For example, if we are in an unlocked CP and in suddenly locks, we must make sure that we don't use a cached DNS entry from when the portal was unlocked.
This is especially relevant in split horizon situations.

This isn't useful in TRR mode 3, as the flag would just cause the resolution to fail.
For that we need to add the captive portal URI to the TRR exclusion list.

Valentin Gosu [:valentin] (he/him) {{ AFK in July }}

Comment 9

•

6 years ago

(In reply to Valentin Gosu [:valentin] from comment #8)

This isn't useful in TRR mode 3, as the flag would just cause the resolution to fail.
For that we need to add the captive portal URI to the TRR exclusion list.

Also note, that it would be only marginally useful with captive portals in mode 3, since it would only deal with DNS based captive portals, and would fail for CP that do a HTTP redirect to yet another domain (be it private or public).

Phabricator Automation

Updated

•

6 years ago

Attachment #9054878 - Attachment description: Bug 1456005 - Set LOAD_DISABLE_TRR flag in CaptiveDetect.jsm r=dragana → Bug 1456005 - Set LOAD_DISABLE_TRR flag in CaptiveDetect.jsm r=mayhemer

Pulsebot

Comment 10

•

6 years ago

Pushed by valentin.gosu@gmail.com: https://hg.mozilla.org/integration/autoland/rev/f002275ecaca Set LOAD_DISABLE_TRR flag in CaptiveDetect.jsm r=mayhemer

Bogdan Tara[:bogdan_tara | bogdant]

Comment 11

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f002275ecaca

Nhi Nguyen (:nhi)

Updated

•

6 years ago

Whiteboard: [necko-triaged][trr] → [necko-triaged][trr][mode 3]

Nhi Nguyen (:nhi)

Updated

•

6 years ago

Assignee: valentin.gosu → nobody

Priority: P3 → P1

Nhi Nguyen (:nhi)

Comment 12

•

6 years ago

Kershaw, could you pick this up? Comment 8 seems to say we still need to fix it for mode 3.

Assignee: nobody → kershaw

Kershaw Chang [:kershaw]

Assignee

Comment 13

•

6 years ago

Attached file Bug 1456005 - Add captive portal URI to the TRR exclusion list — Details

Pulsebot

Comment 14

•

6 years ago

Pushed by kjang@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b3a7ade91b61 Add captive portal URI to the TRR exclusion list r=dragana

Natalia Csoregi [:nataliaCs]

Comment 15

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/b3a7ade91b61

Nhi Nguyen (:nhi)

Comment 16

•

6 years ago

Kershaw, can this be closed now? Looks like the leave-open keyword is set before your patch.

Flags: needinfo?(kershaw)

Kershaw Chang [:kershaw]

Assignee

Comment 17

•

6 years ago

(In reply to Nhi Nguyen (:nhi) from comment #16)

Kershaw, can this be closed now? Looks like the leave-open keyword is set before your patch.

Yeah, I think we can close this.

Status: NEW → RESOLVED

Closed: 6 years ago

Flags: needinfo?(kershaw)

Keywords: leave-open

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.