Closed Bug 1456005 Opened 2 years ago Closed 2 months ago

TRR deadlock when captive portal detection fails

Categories

(Core :: Networking: DNS, defect, P1)

x86_64
Windows 10
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: mt, Assigned: kershaw)

References

Details

(Whiteboard: [necko-triaged][trr][mode 3])

Attachments

(2 files)

In an airport lounge and Windows detects the captive portal, but Firefox does not.  This leads to a window being opened in Firefox (the default browser), which proceeds to use TRR (mode=3), which naturally fails.  The tab seems stuck in a limbo state: no content, no error page, just a blank white page.

It's possible that this is intended behaviour.  I think that is the point of bug 1446352.  But the complete lack of feedback is a concern.
This is curious. In mode 3 (trronly), TRR doesn't care about the captive portal check but will keep on retrying to resolve the confirmation NS domain with an exponential backoff up to a 60 seconds interval. When in trronly, Firefox can't do the CP check itself since it will fail all host names until the CP is up.

But if the CP state is cleared externally, like by another browser (session), it should then detect it working again and pop back to life.

Did you try that, or did you just spot Firefox getting into limbo? 

I would of course agree that the lack of feedback in this scenario makes it really hard for anyone to actually understand what's going on and it'll require that you really understand the low level network doings not to just get confused by this. I think trronly mode is still a bit experimental and for the adventurous only.

If TRR is enabled by pref (any mode), it would be nice with some sort of feedback somewhere saying if it actually works or not...
Assignee: nobody → daniel
Priority: -- → P2
Whiteboard: [necko-triaged][trr]
I could open another browser and unblock things, which would unbind things, certainly.  But the indefinite wait is a problem.

Maybe it's just a matter of having the TRR resolutions fail at a normal rate while TRR is waiting for confirmation rather than waiting on that process indefinitely.  Maybe if TRR has hit a certain point in its backoff, then attempts to use TRR can fail immediately rather than block (you know the code, so might have another idea that is better, of course).

I want to make mode 3 viable, not limit it to use by people who understand how this all fits together.  

Maybe there are two bugs to investigate here, the second to spin off in another bug:

1. the feedback (faster, better, etc...)

2. CP+TRR interaction - maybe we should let the CP check use the default name server (or cap TRR at mode 2 for those checks); it seems like that check isn't a privacy exposure, and that exemption would make the entire process more robust.
(Cc'ing Valentin, "mr CP")
We should probably consider making trronly respect the "wait for the captive portal" pref, or perhaps have a more laxed "trronly-after-cp" mode.

For trronly, simply whitelisting the CP host we use will work to at least confirm that we are in a CP (since it will fail to get the correct data), but it would still most often fail to load the subsequent CP login page unless the native resolver is used for that as well.

The one issue with a "trronly-after-cp" mode is for those users who strictly never want a native resolve to happen, since while waiting for the CP to signal OK in a browser-startup even without a CP present, there's likely to be a few native name resolves. If you for example run your desktop in a fixed known environment where there's no CP, you wouldn't have to accept that compromise.
In mode 2 the captive portal test netwerk/test/unit/test_captive_portal_service.js is permafailing. This will be an issue of the rollout even if it's just a broken test.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=26d5ce9b665329222522aeebbb13b9f3198c2397 is the try push failures (I'm not sure why try didn't capture my commit with the extension also, but that looks like this: https://hg.mozilla.org/try/rev/4aa199208703239f3b82d92958600a97a565cf62)
jkt: I think that warrants its own separate bug. It's certainly not the same issue as this CP deadlock in mode 3...
See Also: → 1498525
Assignee: daniel → nobody
Whiteboard: [necko-triaged][trr] → [necko-triaged][trr][mode3]
Priority: P2 → P3

Actually this probably affects mode2 as well, in the split horizon case.
There's a chance that a network is using a remote captive portal service. If that service is set up to return different results based on the recursive resolver's IP, the user might be unable to log into the captive portal, or we might detect a captive portal even though it has already been cleared.

Solution:

  1. set the LOAD_DISABLE_TRR flag on the xhr in CaptiveDetect.jsm
  2. make sure the tab opened from ensureCaptivePortalTab also gets passed that flag
  3. Add the captive portal endpoint domain name to the list of excluded domains

Depending on the weird ways the CP is configured, we might have a problem if the user navigates to example.com but is redirected to the captive portal login page in a regular tab. But at least they'd have a way out by clicking the notification bar.

This needs a bunch of unit tests that might be difficult to write at the moment - especially for what happens if we have a DoH connection active and the CP decides to lock itself during that session.

Assignee: nobody → valentin.gosu
Whiteboard: [necko-triaged][trr][mode3] → [necko-triaged][trr]

This is useful to prevent issues in TRR mode 2 when a captive portal redirect occurs via DNS.
For example, if we are in an unlocked CP and in suddenly locks, we must make sure that we don't use a cached DNS entry from when the portal was unlocked.
This is especially relevant in split horizon situations.

This isn't useful in TRR mode 3, as the flag would just cause the resolution to fail.
For that we need to add the captive portal URI to the TRR exclusion list.

(In reply to Valentin Gosu [:valentin] from comment #8)

This isn't useful in TRR mode 3, as the flag would just cause the resolution to fail.
For that we need to add the captive portal URI to the TRR exclusion list.

Also note, that it would be only marginally useful with captive portals in mode 3, since it would only deal with DNS based captive portals, and would fail for CP that do a HTTP redirect to yet another domain (be it private or public).

Attachment #9054878 - Attachment description: Bug 1456005 - Set LOAD_DISABLE_TRR flag in CaptiveDetect.jsm r=dragana → Bug 1456005 - Set LOAD_DISABLE_TRR flag in CaptiveDetect.jsm r=mayhemer
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/f002275ecaca
Set LOAD_DISABLE_TRR flag in CaptiveDetect.jsm r=mayhemer
Whiteboard: [necko-triaged][trr] → [necko-triaged][trr][mode 3]
Assignee: valentin.gosu → nobody
Priority: P3 → P1

Kershaw, could you pick this up? Comment 8 seems to say we still need to fix it for mode 3.

Assignee: nobody → kershaw
Pushed by kjang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b3a7ade91b61
Add captive portal URI to the TRR exclusion list r=dragana

Kershaw, can this be closed now? Looks like the leave-open keyword is set before your patch.

Flags: needinfo?(kershaw)

(In reply to Nhi Nguyen (:nhi) from comment #16)

Kershaw, can this be closed now? Looks like the leave-open keyword is set before your patch.

Yeah, I think we can close this.

Status: NEW → RESOLVED
Closed: 2 months ago
Flags: needinfo?(kershaw)
Keywords: leave-open
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.