dnsResolve() failure in Proxy Auto-Configuration (PAC)
Categories
(Core :: Networking: DNS, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | unaffected |
firefox76 | --- | unaffected |
firefox77 | --- | unaffected |
firefox78 | + | verified |
firefox79 | --- | verified |
People
(Reporter: eros_uk, Assigned: valentin)
References
(Regression)
Details
(Keywords: regression, Whiteboard: [necko-triaged][trr])
Attachments
(3 files)
47 bytes,
text/x-phabricator-request
|
jcristau
:
approval-mozilla-beta+
|
Details | Review |
47 bytes,
text/x-phabricator-request
|
jcristau
:
approval-mozilla-beta+
|
Details | Review |
47 bytes,
text/x-phabricator-request
|
jcristau
:
approval-mozilla-beta+
|
Details | Review |
User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0
Steps to reproduce:
Within the last few Nightly updates, it appears that the behaviour of synchronous dnsResolve()
has changed.
dnsResolve()
seems to get the value from DNS cache and thus fails on un-cached DNS.
If the DNS has been previously cached by Firefox, it synchronously returns the value but if not, it actually stops the PAC from functioning.
It might be that the behaviour has become asynchronous which breaks the synchronous PAC.
Updated•5 years ago
|
Comment 1•5 years ago
|
||
Hello reporter,
This looks bad.
Could you please find the regression window given it's broken within days?
https://mozilla.github.io/mozregression/
cc Valentin to see if he has a quick idea what's going on recently.
Wasn't there some work done recently on TRR & DNS-over-HTTPS, setting mozilla.cloudflare-dns.com
as target?!!
(I think Shane might have been working on it)
Could it be related to above?!
Assignee | ||
Comment 3•5 years ago
•
|
||
I think this is the same issue as in bug 1592248 comment 6, but we'd have to confirm. The fact that it only started happening recently for you on nightly means maybe there's a slightly different scenario here.
First of all check what the value of network.trr.mode
is for you.
Then please run mozregression to find the change that introduced this problem.
Eventually we may need you to also gather some HTTP logs to see exactly what's happening.
First of all check what the value of
network.trr.mode
is for you.
It is 2. I have not edited it.
Doesnt mozregression install various version of FF?
I use FF Nightly as my development platform and do not want to mess it up.
I debugged from within the PAC to come to the above mentioned conclusion.
Assignee | ||
Comment 5•5 years ago
|
||
(In reply to erosman from comment #4)
First of all check what the value of
network.trr.mode
is for you.
It is 2. I have not edited it.Doesnt mozregression install various version of FF?
I use FF Nightly as my development platform and do not want to mess it up.
Mozregression can be set up to use a different profile, so you don't have to worry about that.
You can also make a backup of your profile before just to be sure.
The --profile-persistence clone-first
option of mozregression will make it clone the profile so you don't get any issues.
I debugged from within the PAC to come to the above mentioned conclusion.
So, please check if the bug still reproduces if network.trr.mode
is set to 0. I expect it will not.
Then we'll also need a regression range using mozregression to figure out what caused this issue.
I am unable to run Mozregression
Fatal error detected
Failed to execute script pyi_rth_multiprocessing
However ....
So, please check if the bug still reproduces if network.trr.mode is set to 0. I expect it will not.
Correct, When network.trr.mode
is 0
, dnsResolve(host)
functions normally.
When network.trr.mode
is 2
, dnsResolve(host)
works ONLY if the DNS is already cached (i.e. it doesn't do the NS Lookup) and fails otherwise.
The dnsResolve(host)
failure breaks the PAC and results in tab waiting to load forever.
I imagine, the regression would be when Nightly set network.trr.mode
to 2
.
Assignee | ||
Comment 7•5 years ago
|
||
(In reply to erosman from comment #6)
I am unable to run Mozregression
Fatal error detected
Failed to execute script pyi_rth_multiprocessing
Could you also post a stack trace?
Correct, When
network.trr.mode
is0
,dnsResolve(host)
functions normally.
Whennetwork.trr.mode
is2
,dnsResolve(host)
works ONLY if the DNS is already cached (i.e. it doesn't do the NS Lookup) and fails otherwise.
ThednsResolve(host)
failure breaks the PAC and results in tab waiting to load forever.I imagine, the regression would be when Nightly set
network.trr.mode
to2
.
If you are in the US, the pref was probably turned on for you on Nightly a couple of releases ago.
If it only started failing recently, most likely it was a recent change that caused it.
If you're not able to run mozregression, we'll need you to gather some HTTP logs for us to figure out what's wrong:
https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging
Assignee | ||
Updated•5 years ago
|
Could you also post a stack trace?
What is that?
PS. Tested v4.0.5 mozregression-gui.exe
If you're not able to run mozregression, we'll need you to gather some HTTP logs for us to figure out what's wrong:
Well, I am not familiar with the process.
- I found a bug and reported it.
- I provide specific function that causes a PAC to fail (i.e.
dnsResolve(host)
) - I tested the bug with the
network.trr.mode
to further point towards the possible location of the problem. - I provide a pointer to timeline.
PAC is synchronous. If dnsResolve(host)
is not resolved synchronously, it fails to conform to PAC standard (as well as MDN documentation).
It would be better to let people who are more knowledgeable than me to take it from here.
STR
Here is a sample PAC to test. You can save it as a file and enter file location in Network Settings.
It will not have any proxy. It passes everything without a proxy (i.e. DIRECT).
All it does is for testing dnsResolve(host)
.
The alert()
will be logged in Browser Console as PAC-alert: .........................
function FindProxyForURL(url, host) {
alert(host);
alert(dnsResolve(host));
return 'DIRECT';
}
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 9•5 years ago
|
||
There are two problems here.
- is that cancelling the TRR channel does not cancel the proxy resolution, and TRRServiceChannel::OnStopRequest doesn't end up getting called.
- PAC resolution is stuck at this line
- I'm not yet sure why commenting it makes it be unstuck, but I suspect we need that too if we still want proxies to work with TRR
Fixing any of these two issues makes the problem go away. I only have a patch for 1. Still investigating 2.
Updated•5 years ago
|
Assignee | ||
Comment 10•5 years ago
|
||
[Tracking Requested - why for this release]:
People with a PAC script might see their DNS stops working when using TRR.
Comment 11•5 years ago
•
|
||
FWIW, I can also see DNS resolution failed without using TRRServiceChannel
.
I think this is some kind of a dead lock problem.
- In PAC script,
dnsResolve()
is called. Note that at this moment, proxy resolution is not done. - We create a TRRServiceChannel (or nsHttpChannel) to do TRR.
- The channel needs to do proxy resolution again.
- I am not sure what happens here, but the new channel might wait for the proxy resolution to be completed.
I think maybe we should use native DNS resolver for the DNS requests coming from PAC script.
Assignee | ||
Comment 12•5 years ago
|
||
(In reply to Kershaw Chang [:kershaw] from comment #11)
FWIW, I can also see DNS resolution failed without using
TRRServiceChannel
.I think this some kind of a dead lock problem.
- In PAC script,
dnsResolve()
is called. Note that at this moment, proxy resolution is not done.- We create a TRRServiceChannel (or nsHttpChannel) to do TRR.
- The channel needs to do proxy resolution again.
- I am not sure what happens here, but the new channel might wait for the proxy resolution to be completed.
The problem is that the PAC thread is only supposed to deal with one resolution at a time.
When dnsResolve generates another TRRServiceChannel for the resolution, this channel will wait until the first PAC resolve is complete, which it won't be.
I think maybe we should use native DNS resolver for the DNS requests coming from PAC script.
I agree this seems like the best approach.
Assignee | ||
Comment 13•5 years ago
|
||
Comment 14•5 years ago
|
||
Can we back out the regressing change for 78?
Updated•5 years ago
|
Assignee | ||
Comment 15•5 years ago
|
||
Depends on D78524
Assignee | ||
Comment 16•5 years ago
|
||
I also considered setting a flag if a PAC query originates from a
TRR service channel but that doesn't work because of this:
- We try to load test.com -> PAC script
- We try to load TRR channel -> PAC script with DISABLE_TRR flag
- PAC thread is still blocked on resolving test.com - we exit early
regardless of flag
Depends on D78525
Comment 17•5 years ago
|
||
Assignee | ||
Comment 18•5 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #14)
Can we back out the regressing change for 78?
I don't think that would make for a good user experience.
If we backout the regressing commit (may be complicated to do) then we don't get the deadlock, but every DNS request will have a 1.5 second delay when using a PAC that calls dnsResolve
when using TRR.
The delay is because the dnsResolve will still be blocked, but the TRR timer will fire and cancel the DNS resolution after the timeout expires.
Comment 19•5 years ago
|
||
Backed out 3 changesets for causing xpcshell failures in nsCOMPtr.h
Backout link: https://hg.mozilla.org/integration/autoland/rev/6e31fbe78c081b2841d92aabdf0e48db38856ea1
Failure logs:
Comment 20•5 years ago
|
||
Comment 21•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/9ffa44ac58b9
https://hg.mozilla.org/mozilla-central/rev/5a4fdac6f0eb
https://hg.mozilla.org/mozilla-central/rev/d9fd4f2d1835
Reporter | ||
Comment 22•5 years ago
|
||
Tracking Flags:
status-firefox-esr68:unaffected
status-firefox76:unaffected
status-firefox77:unaffected
tracking-firefox78:+
status-firefox78:affected
status-firefox79:fixed
Does above mean PAC will fail once Firefox 78 is released?
What happens to PAC users & Firefox 78?
Comment 23•5 years ago
|
||
It doesn't say anything about when 78 is released, it says the fix is not currently in the 78 branch (beta).
Reporter | ||
Comment 24•5 years ago
|
||
It doesn't say anything about when 78 is released, it says the fix is not currently in the 78 branch (beta).
I see... I misunderstood, since the bug was closed, I thought no further action would be taken.
Assignee | ||
Comment 25•5 years ago
|
||
Comment on attachment 9154579 [details]
Bug 1640091 - Make sure DNS requests originating from PAC scripts use TRR_DISABLED_MODE r=kershaw
Beta/Release Uplift Approval Request
- User impact if declined: Potential deadlock when using a PAC script that calls
dnsResolve
while TRR is also enabled. - Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: Yes
- If yes, steps to reproduce: Load PAC file specified in comment 8.
Turn TRR on via about:preferences or about:confignetwork.trr.mode
= 2
Browse some websites
See if the pages load. - List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): The risk is low, but non-zero. We have tests for this, but our code coverage isn't great.
The alternatives to this aren't great; we could disable TRR when PAC scripts are used, but that seems even riskier. And we'd still have a bug that proxy lookups wouldn't be cancelled when the channel is. - String changes made/needed:
Assignee | ||
Updated•5 years ago
|
Updated•5 years ago
|
Comment 26•5 years ago
|
||
Comment on attachment 9154579 [details]
Bug 1640091 - Make sure DNS requests originating from PAC scripts use TRR_DISABLED_MODE r=kershaw
approved for 78.0b6
Updated•5 years ago
|
Updated•5 years ago
|
Comment 27•5 years ago
|
||
bugherder uplift |
Comment 28•5 years ago
|
||
bugherder uplift |
fix mistake in conflict resolution. a=bustage
https://hg.mozilla.org/releases/mozilla-beta/rev/d9989f30a754
Comment 29•5 years ago
|
||
Valentin, the test seems to fail on beta, can you take a look?
https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&selectedTaskRun=erLU6ap3QlKpHOejrAvyFw.1&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunning%2Cpending%2Crunnable&revision=d9989f30a75498808e09ee0edf9f36f54c87a219
https://firefoxci.taskcluster-artifacts.net/erLU6ap3QlKpHOejrAvyFw/1/public/logs/live_backing.log
Comment 30•5 years ago
|
||
I think we also need the patches in bug 1641496 to make the test pass. Maybe the best way is to disable this test for socket process on beta.
To do this, just add skip-if = socketprocess_networking
below the test.
Updated•5 years ago
|
Comment 31•5 years ago
|
||
bugherder uplift |
Comment 32•5 years ago
|
||
Hi,
I'm having trouble reproducing the issue, would you mind verifying the fix Erosman?
Thanks!
Reporter | ||
Comment 33•5 years ago
|
||
Testing:
79.0a1 (2020-06-11) (64-bit)
network.trr.mode
= 2
dnsResolve()
in PAC works as excepted (synchronously).
Note: (maybe not relevant)
There is still a call to https://mozilla.cloudflare-dns.com/
which resolves synchronously to 104.16.249.249
Comment 34•5 years ago
|
||
Thanks!
When you find the time, could you also check the fix on the latest Firefox Beta 78?
Reporter | ||
Comment 35•5 years ago
|
||
Testing:
78.0b6 (64-bit)
network.trr.mode
= 0
(default)
dnsResolve()
in PAC works as excepted (synchronously).
network.trr.mode
= 2
dnsResolve()
in PAC works as excepted (synchronously).
Note: (For info only)
It seems only when network.trr.mode
= 2
Firefox makes calls to https://mozilla.cloudflare-dns.com/
Comment 36•5 years ago
|
||
Thank you for verifying !
Description
•