Closed Bug 1441391 Opened 6 years ago Closed 6 years ago

TRR: Suspended browser can't resolve any names

Categories

(Core :: Networking, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox60 --- fixed

People

(Reporter: valentin, Assigned: bagder)

References

Details

(Whiteboard: [necko-triaged][trr])

Attachments

(1 file)

mcmanus: I came back to a laptop that had been suspended for an hr in mode 3 and couldn't resolve any names.. toggling the mode to 0 back to 3 fixed it.. can you file a bug? (today's nightly)
Hm, that has to somehow have botched the HTTPS requests themselves. It feels like we don't reset state back properly to set it up again when the DOH server's name is not in the cache anymore.
I have a log of this from overnight.

resolving host foo
no usable address in cache for host [foo]
trrlookup:: foo service not enabled


and this is mode 3, so we're stuck (but in any other mode I suspect we wouldn't be using trr at that point)

so why won't the service get re-enabled?
Priority: -- → P2
I had this happen on my desktop today for the first time.. now the internet was unusually flaky, so its certainly possible some kind of failed connection was the common thread
Changing mode doesn't in itself trigger anything, but it will make the regular resolver to get used. As that then helped mode 3 could imply that it needed an address to get added to the DNS cache first and then it worked? Do you have bootstrapAddress set? If you get it stuck again like that, can you see if mode 2 or 1 also gets it back on track? Presumably they do.

Of course, if the connection is so bad that the HTTP requests fail, then that could explain it as well as then the NS confirm might fail and you end up stranded similar to how you describe. But that seems implausible as it would require a *really* bad connection situation.

(PS: a subject for more thinking is certainly what TRR can do to inform exactly why it doesn't work/behave. I've personally managed to fill in the URL wrong, forget to set "useGET" etc and when doing so TRR is just silent and it is far from obvious to a user why it isn't working correctly... your case is yet another version of "TRR doesn't do anything, why?")
bootstrap is indeed set.

this is a bit worse than not doing anything - it was doing something and then stopped and got stuck there. Its certainly possible that it tried to do the NS confirm when coming back from suspend and that failed due to interface coming up wonkery.. but I think we would need to be robust to that.. and it doesn't really explain the desktop issue where the NS should have already been checked and I don't see any reason that it would do it again.
daniel and I explored this a bit:

* various things reset the dns service upon problems like a network change
* that would reset the trr service too
* normally the NS check is gated on the cap-portal green light.. but
* in only mode that is bypassed because cap-portal has a dns dependency
* if the ns check fails due to connectivity it stays perma failed until the service is reset
* failure is possible in this scenarios because cap-portal hasn't confirmed anything
* resetting the mode forces the check to be redone, that's why things work.

the fix is to set a backoff timer upon ns check failing and mode = only and try it again. that's basically what cap-port would do.
My suggested patch here adds a retry mechanism and adds/removes some log output to help future diagnosing what's going on...
Assignee: valentin.gosu → daniel
Priority: P2 → P1
Comment on attachment 8957475 [details]
bug 1441391 - TRR: restart failed NS confirms in TRR-only mode

https://reviewboard.mozilla.org/r/226380/#review232292

::: netwerk/dns/TRR.cpp
(Diff revision 1)
>  
>  NS_IMETHODIMP
>  TRR::Notify(nsITimer *aTimer)
>  {
>    if (aTimer == mTimeout) {
> -    LOG(("TRR request for %s timed out\n", mHost.get()));

Did you mean to remove this?

::: netwerk/dns/TRRService.cpp:531
(Diff revision 1)
> +    if ((mConfirmationState == CONFIRM_FAILED) && (mMode == MODE_TRRONLY)) {
> +      // in TRR-only mode; retry failed confirmations
> +      NS_NewTimerWithCallback(getter_AddRefs(mRetryConfirmTimer),
> +                              this, mRetryConfirmInterval,
> +                              nsITimer::TYPE_ONE_SHOT);
> +      if (mRetryConfirmInterval < 64000) {

Should we reset the interval when confirmation succeeds?
Attachment #8957475 - Flags: review?(valentin.gosu) → review+
Comment on attachment 8957475 [details]
bug 1441391 - TRR: restart failed NS confirms in TRR-only mode

https://reviewboard.mozilla.org/r/226380/#review232292

> Did you mean to remove this?

Yes, it's on purpose. This log output is not helpful and in fact mostly quite spammy. There's already another log output if the timeout actually cancels the HTTP channel, which is what will interest log readers.

> Should we reset the interval when confirmation succeeds?

Yes, good catch!
Pushed by daniel@haxx.se:
https://hg.mozilla.org/integration/autoland/rev/798a47cd74d5
TRR: restart failed NS confirms in TRR-only mode r=valentin
I hate those compiler errors that 'mach build' on my machine don't show... :-/
Flags: needinfo?(daniel)
Pushed by daniel@haxx.se:
https://hg.mozilla.org/integration/autoland/rev/558353d9fc61
TRR: restart failed NS confirms in TRR-only mode r=valentin
https://hg.mozilla.org/mozilla-central/rev/558353d9fc61
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
Depends on: 1521639
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: