Closed Bug 1640091 Opened 5 months ago Closed 5 months ago

dnsResolve() failure in Proxy Auto-Configuration (PAC)

Categories

(Core :: Networking: DNS, defect, P1)

78 Branch
defect

Tracking

()

VERIFIED FIXED
mozilla79
Tracking Status
firefox-esr68 --- unaffected
firefox76 --- unaffected
firefox77 --- unaffected
firefox78 + verified
firefox79 --- verified

People

(Reporter: eros_uk, Assigned: valentin)

References

(Regression)

Details

(Keywords: regression, Whiteboard: [necko-triaged][trr])

Attachments

(3 files)

User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0

Steps to reproduce:

Within the last few Nightly updates, it appears that the behaviour of synchronous dnsResolve() has changed.

dnsResolve() seems to get the value from DNS cache and thus fails on un-cached DNS.

If the DNS has been previously cached by Firefox, it synchronously returns the value but if not, it actually stops the PAC from functioning.

It might be that the behaviour has become asynchronous which breaks the synchronous PAC.

Component: Untriaged → Networking: DNS
Product: Firefox → Core

Hello reporter,
This looks bad.
Could you please find the regression window given it's broken within days?
https://mozilla.github.io/mozregression/

cc Valentin to see if he has a quick idea what's going on recently.

Flags: needinfo?(eros_uk)
Priority: -- → P1
Whiteboard: [necko-triaged]

Wasn't there some work done recently on TRR & DNS-over-HTTPS, setting mozilla.cloudflare-dns.com as target?!!
(I think Shane might have been working on it)
Could it be related to above?!

Flags: needinfo?(eros_uk)

I think this is the same issue as in bug 1592248 comment 6, but we'd have to confirm. The fact that it only started happening recently for you on nightly means maybe there's a slightly different scenario here.

First of all check what the value of network.trr.mode is for you.
Then please run mozregression to find the change that introduced this problem.
Eventually we may need you to also gather some HTTP logs to see exactly what's happening.

Flags: needinfo?(eros_uk)

First of all check what the value of network.trr.mode is for you.
It is 2. I have not edited it.

Doesnt mozregression install various version of FF?
I use FF Nightly as my development platform and do not want to mess it up.

I debugged from within the PAC to come to the above mentioned conclusion.

Flags: needinfo?(eros_uk)

(In reply to erosman from comment #4)

First of all check what the value of network.trr.mode is for you.
It is 2. I have not edited it.

Doesnt mozregression install various version of FF?
I use FF Nightly as my development platform and do not want to mess it up.

Mozregression can be set up to use a different profile, so you don't have to worry about that.
You can also make a backup of your profile before just to be sure.
The --profile-persistence clone-first option of mozregression will make it clone the profile so you don't get any issues.

I debugged from within the PAC to come to the above mentioned conclusion.

So, please check if the bug still reproduces if network.trr.mode is set to 0. I expect it will not.
Then we'll also need a regression range using mozregression to figure out what caused this issue.

Flags: needinfo?(eros_uk)

I am unable to run Mozregression

Fatal error detected

Failed to execute script pyi_rth_multiprocessing

However ....

So, please check if the bug still reproduces if network.trr.mode is set to 0. I expect it will not.

Correct, When network.trr.mode is 0, dnsResolve(host) functions normally.
When network.trr.mode is 2, dnsResolve(host) works ONLY if the DNS is already cached (i.e. it doesn't do the NS Lookup) and fails otherwise.
The dnsResolve(host) failure breaks the PAC and results in tab waiting to load forever.

I imagine, the regression would be when Nightly set network.trr.mode to 2.

Flags: needinfo?(eros_uk)

(In reply to erosman from comment #6)

I am unable to run Mozregression

Fatal error detected

Failed to execute script pyi_rth_multiprocessing

Could you also post a stack trace?

Correct, When network.trr.mode is 0, dnsResolve(host) functions normally.
When network.trr.mode is 2, dnsResolve(host) works ONLY if the DNS is already cached (i.e. it doesn't do the NS Lookup) and fails otherwise.
The dnsResolve(host) failure breaks the PAC and results in tab waiting to load forever.

I imagine, the regression would be when Nightly set network.trr.mode to 2.

If you are in the US, the pref was probably turned on for you on Nightly a couple of releases ago.
If it only started failing recently, most likely it was a recent change that caused it.

If you're not able to run mozregression, we'll need you to gather some HTTP logs for us to figure out what's wrong:
https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging

Flags: needinfo?(eros_uk)

Could you also post a stack trace?

What is that?
PS. Tested v4.0.5 mozregression-gui.exe

If you're not able to run mozregression, we'll need you to gather some HTTP logs for us to figure out what's wrong:

Well, I am not familiar with the process.

  • I found a bug and reported it.
  • I provide specific function that causes a PAC to fail (i.e. dnsResolve(host))
  • I tested the bug with the network.trr.mode to further point towards the possible location of the problem.
  • I provide a pointer to timeline.

PAC is synchronous. If dnsResolve(host) is not resolved synchronously, it fails to conform to PAC standard (as well as MDN documentation).

It would be better to let people who are more knowledgeable than me to take it from here.

STR
Here is a sample PAC to test. You can save it as a file and enter file location in Network Settings.
It will not have any proxy. It passes everything without a proxy (i.e. DIRECT).

All it does is for testing dnsResolve(host).
The alert() will be logged in Browser Console as PAC-alert: .........................

function FindProxyForURL(url, host) {
  alert(host);
  alert(dnsResolve(host));
  return 'DIRECT';
}
Flags: needinfo?(eros_uk)
Assignee: nobody → valentin.gosu
Severity: -- → S3
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [necko-triaged] → [necko-triaged][trr]

There are two problems here.

  1. is that cancelling the TRR channel does not cancel the proxy resolution, and TRRServiceChannel::OnStopRequest doesn't end up getting called.
  2. PAC resolution is stuck at this line
  • I'm not yet sure why commenting it makes it be unstuck, but I suspect we need that too if we still want proxies to work with TRR

Fixing any of these two issues makes the problem go away. I only have a patch for 1. Still investigating 2.

Regressed by: 1539819

[Tracking Requested - why for this release]:

People with a PAC script might see their DNS stops working when using TRR.

FWIW, I can also see DNS resolution failed without using TRRServiceChannel.

I think this is some kind of a dead lock problem.

  1. In PAC script, dnsResolve() is called. Note that at this moment, proxy resolution is not done.
  2. We create a TRRServiceChannel (or nsHttpChannel) to do TRR.
  3. The channel needs to do proxy resolution again.
  4. I am not sure what happens here, but the new channel might wait for the proxy resolution to be completed.

I think maybe we should use native DNS resolver for the DNS requests coming from PAC script.

(In reply to Kershaw Chang [:kershaw] from comment #11)

FWIW, I can also see DNS resolution failed without using TRRServiceChannel.

I think this some kind of a dead lock problem.

  1. In PAC script, dnsResolve() is called. Note that at this moment, proxy resolution is not done.
  2. We create a TRRServiceChannel (or nsHttpChannel) to do TRR.
  3. The channel needs to do proxy resolution again.
  4. I am not sure what happens here, but the new channel might wait for the proxy resolution to be completed.

The problem is that the PAC thread is only supposed to deal with one resolution at a time.
When dnsResolve generates another TRRServiceChannel for the resolution, this channel will wait until the first PAC resolve is complete, which it won't be.

I think maybe we should use native DNS resolver for the DNS requests coming from PAC script.

I agree this seems like the best approach.

Can we back out the regressing change for 78?

I also considered setting a flag if a PAC query originates from a
TRR service channel but that doesn't work because of this:

  1. We try to load test.com -> PAC script
  2. We try to load TRR channel -> PAC script with DISABLE_TRR flag
  3. PAC thread is still blocked on resolving test.com - we exit early
    regardless of flag

Depends on D78525

Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/f481d4ec255a
Add test for using PAC script with TRR r=kershaw,necko-reviewers
https://hg.mozilla.org/integration/autoland/rev/a1d86c177d79
Make sure we TRRServiceChannel::mProxyRequest is initialized and used r=kershaw,necko-reviewers
https://hg.mozilla.org/integration/autoland/rev/03c4c517ee37
Make sure DNS requests originating from PAC scripts use TRR_DISABLED_MODE r=kershaw,necko-reviewers

(In reply to Julien Cristau [:jcristau] from comment #14)

Can we back out the regressing change for 78?

I don't think that would make for a good user experience.
If we backout the regressing commit (may be complicated to do) then we don't get the deadlock, but every DNS request will have a 1.5 second delay when using a PAC that calls dnsResolve when using TRR.
The delay is because the dnsResolve will still be blocked, but the TRR timer will fire and cancel the DNS resolution after the timeout expires.

Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/9ffa44ac58b9
Add test for using PAC script with TRR r=kershaw,necko-reviewers
https://hg.mozilla.org/integration/autoland/rev/5a4fdac6f0eb
Make sure we TRRServiceChannel::mProxyRequest is initialized and used r=kershaw,necko-reviewers
https://hg.mozilla.org/integration/autoland/rev/d9fd4f2d1835
Make sure DNS requests originating from PAC scripts use TRR_DISABLED_MODE r=kershaw,necko-reviewers
Tracking Flags:

status-firefox-esr68:unaffected
status-firefox76:unaffected
status-firefox77:unaffected
tracking-firefox78:+
status-firefox78:affected
status-firefox79:fixed

Does above mean PAC will fail once Firefox 78 is released?
What happens to PAC users & Firefox 78?

It doesn't say anything about when 78 is released, it says the fix is not currently in the 78 branch (beta).

It doesn't say anything about when 78 is released, it says the fix is not currently in the 78 branch (beta).

I see... I misunderstood, since the bug was closed, I thought no further action would be taken.

Comment on attachment 9154579 [details]
Bug 1640091 - Make sure DNS requests originating from PAC scripts use TRR_DISABLED_MODE r=kershaw

Beta/Release Uplift Approval Request

  • User impact if declined: Potential deadlock when using a PAC script that calls dnsResolve while TRR is also enabled.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: Load PAC file specified in comment 8.
    Turn TRR on via about:preferences or about:config network.trr.mode = 2
    Browse some websites
    See if the pages load.
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The risk is low, but non-zero. We have tests for this, but our code coverage isn't great.
    The alternatives to this aren't great; we could disable TRR when PAC scripts are used, but that seems even riskier. And we'd still have a bug that proxy lookups wouldn't be cancelled when the channel is.
  • String changes made/needed:
Flags: needinfo?(valentin.gosu)
Attachment #9154579 - Flags: approval-mozilla-beta?
Attachment #9154577 - Flags: approval-mozilla-beta?
Attachment #9154578 - Flags: approval-mozilla-beta?
Flags: qe-verify+
QA Whiteboard: [qa-triaged]

Comment on attachment 9154579 [details]
Bug 1640091 - Make sure DNS requests originating from PAC scripts use TRR_DISABLED_MODE r=kershaw

approved for 78.0b6

Attachment #9154579 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9154577 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9154578 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

fix mistake in conflict resolution. a=bustage
https://hg.mozilla.org/releases/mozilla-beta/rev/d9989f30a754

I think we also need the patches in bug 1641496 to make the test pass. Maybe the best way is to disable this test for socket process on beta.
To do this, just add skip-if = socketprocess_networking below the test.

Flags: needinfo?(valentin.gosu)

Hi,

I'm having trouble reproducing the issue, would you mind verifying the fix Erosman?

Thanks!

Flags: needinfo?(eros_uk)

Testing:

79.0a1 (2020-06-11) (64-bit)
network.trr.mode = 2
dnsResolve() in PAC works as excepted (synchronously).

Note: (maybe not relevant)
There is still a call to https://mozilla.cloudflare-dns.com/ which resolves synchronously to 104.16.249.249

Flags: needinfo?(eros_uk)

Thanks!

When you find the time, could you also check the fix on the latest Firefox Beta 78?

Flags: needinfo?(eros_uk)

Testing:

78.0b6 (64-bit)
network.trr.mode = 0 (default)
dnsResolve() in PAC works as excepted (synchronously).

network.trr.mode = 2
dnsResolve() in PAC works as excepted (synchronously).

Note: (For info only)
It seems only when network.trr.mode = 2 Firefox makes calls to https://mozilla.cloudflare-dns.com/

Flags: needinfo?(eros_uk)

Thank you for verifying !

Status: RESOLVED → VERIFIED
QA Whiteboard: [qa-triaged]
Flags: qe-verify+
See Also: → 1641155
You need to log in before you can comment on or make changes to this bug.