Closed Bug 1620824 Opened 5 years ago Closed 5 years ago

Tabs do not load after restart with network.trr.fetch_off_main_thread=true and under specific condition

Categories

(Core :: Networking: DNS, defect, P1)

75 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla76
Tracking Status
firefox-esr68 --- unaffected
firefox73 --- unaffected
firefox74 --- unaffected
firefox75 --- fixed
firefox76 --- fixed

People

(Reporter: Fanolian+BMO, Assigned: kershaw)

References

(Regression)

Details

(Keywords: nightly-community, regression, reproducible, Whiteboard: [necko-triaged])

Attachments

(8 files)

Attached video site load fails.mp4

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0
Build ID: 20200307213339

Tabs may not load properly only when right after starting the browser.
Here are the requirements which I cannot reduce any further:

  1. network.trr.fetch_off_main_thread is true
  2. Firefox Private Network extension is logged in (I use it outside US so your result may vary)
  3. uMatrix is enabled. Settings (and rules) leave as default.
  4. (Maybe) a slow computer, network, or Firefox profile

Steps to reproduce

  1. Use Nightly 2020-03-07 build in a new profile. Prepare the above stuff.
  2. Open https://www.reddit.com. Pinning the tab is not necessary but easier to test since the tab will automatically load right after starting the browser.
    ** I think any (heavy) site would work as long as it does not load instantly.
  3. Focus the tab, restart browser either by:
    a. Ctrl-Alt-R in Browser Console; or
    b. Enable Restore previous session, close with X and start browser again.
  4. Repeat Step 3 if you fail to reproduce the issue.

Actual result

(Please refer to the attached video.)
The tab fails to load. Reloading it brings everything back to normal.
The issue can be reproduced 100% on my main profile, which I do not want to share, but only occasionally in a new profile, as seen in the attached video.

The maximum number of pinned tabs failed to load I have seen is 3 which is the maximum number of concurrent restores after restarting the browser.

Notes

I cannot reproduce the issue if either requirement 1, 2, or 3 is omitted. Or maybe I did not re-test those conditions enough.

Other sites that fail to load include but not limited to logged-in Gmail, Google Calendar, and Pocket.

Kershaw, please have a look at this bug.

Flags: needinfo?(kershaw)
Priority: -- → P1
Whiteboard: [necko-triaged]

I toggle network.trr.fetch_off_main_thread to true manually and test in previous builds. The regression range is:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=d64dfccd44ed694c4f323ac7c27afbe327f497ca&tochange=4c8620247a5e626d21fa5818d9a636d813d29dd9

This should be regressed by bug 1615335.

I got this too, now with Build ID 20200308095244. But I use network.trr.mode = 3 (required) with a third-party pihole-based DoH resolver.

Upon restarting after the update, everything failed to load. Looking at the Browser Console window there is a GET for the page but no GET for the DNS to the TRR resolver like there usually is - and the page load times out without DNS resolved. Only sites in network.trr.excluded-domains work.

Conversely when I turn it off and reload the a page, I get an appropriate DNS request showing up (four, in fact - two before the page is loaded and two after).

In case it matters, this is a 2011-era netbook running a dual-core AMD Bobcat E-350. Pretty slow, but it normally works (though, sometimes when loading many tabs, I get an alternating pattern of one tab loading and one not for a while).

network.trr.bootstrapAddress 2a00:d880:5:bf0::7c93
network.trr.uri https://dns.aaflalo.me/dns-query
doesn't seem to matter if DNS requests are GET or POST.

I guess somehow the TRR background thread is not created.
Could you try to capture the http log?

Thanks.

Assignee: nobody → kershaw
Flags: needinfo?(kershaw)
Flags: needinfo?(greenreaper)
Flags: needinfo?(Fanolian+BMO)

It seems like something is happening, but when it's broken we get "TRR bad incoming DOH, eject!" and "TRR::On200Response DohDecode 80070057" (ERROR_INVALID_PARAMETER).

Flags: needinfo?(greenreaper)

When it's working sometimes there is a "TRR::On200Response DohDecode 80004005" (unspecified error) but later it works.

Maybe there is something different about how the HTTP part is handled, e.g. gzip processing?

This is a log using command line arguments. I start the browser with the profile used in comment 0, and close the browser when I see the favicon changes to a warning sign indicating the page cannot be loaded.

Flags: needinfo?(Fanolian+BMO) → needinfo?(kershaw)

I think the gzip is the issue. By default network.http.accept-encoding.secure == "gzip, deflate, br", but nsHttpCompresssConv is not invoked in the background thread. If I blank out network.http.accept-encoding.secur, the request succeeds.

The sizes are different in this case and the response does not contain the characteristic GZ header (1F 8B 08 00).

This trace was a force-refresh to cause DNS to be re-retrieved after setting network.http.accept-encoding.secure=''

(In reply to Laurence "GreenReaper" Parry from comment #10)

Created attachment 9131748 [details]
Log when network.trr.fetch_off_main_thread=true and network.http.accept-encoding.secure='' (working without gzip response)

The sizes are different in this case and the response does not contain the characteristic GZ header (1F 8B 08 00).

This trace was a force-refresh to cause DNS to be re-retrieved after setting network.http.accept-encoding.secure=''

Thanks for the log!
Yes, the problem is that TRRServiceChannel doesn't handle gzip encoding. I'll try to fix this.

Flags: needinfo?(kershaw)

Here is an easier way to reproduce that requires Firefox Private Network (FPN) only. This may be the same issue.

Steps to reproduce

  1. In a new profile in Nightly 2020-03-07 or 08 build, install FPN extension, log in and activate it.
  2. Go to about:addons.

Actual result

The recommendations cannot be loaded. Please refer to the attached video.

Workaround

Toggle network.trr.fetch_off_main_thread to false and reload about:addons.

Notes

Turning FPN off (and enable DoH in about:preferences) while leaving network.trr.fetch_off_main_thread = true can also fix the issue. The bug, however, does not present in Nightly Mar 05 build which introduced network.trr.fetch_off_main_thread but was defaulted to false.

(In reply to Fanolian from comment #12)

Created attachment 9131754 [details]
comment #12: conflict with FPN.mp4

Here is an easier way to reproduce that requires Firefox Private Network (FPN) only. This may be the same issue.

Steps to reproduce

  1. In a new profile in Nightly 2020-03-07 or 08 build, install FPN extension, log in and activate it.
  2. Go to about:addons.

Actual result

The recommendations cannot be loaded. Please refer to the attached video.

Workaround

Toggle network.trr.fetch_off_main_thread to false and reload about:addons.

Notes

Turning FPN off (and enable DoH in about:preferences) while leaving network.trr.fetch_off_main_thread = true can also fix the issue. The bug, however, does not present in Nightly Mar 05 build which introduced network.trr.fetch_off_main_thread but was defaulted to false.

Thanks! I can reproduce with this STR.
From the log, I saw that the TRRServiceChannel failed to connect to the proxy, since the proxy returned 407 response.
I think the root cause is that the FPN addon is unable to add Proxy-Authorization header to TRRServiceChannel, since TRRServiceChannel doesn't send http-on-* notifications at all.
There could be two workarounds:

  1. Let TRRServiceChannel ignore proxy settings, so TRRServiceChannel always connects to DoH server directly.
  2. Turn off network.trr.fetch_off_main_thread by FPN addon.

Valentin, Dragana, what do you think?

Flags: needinfo?(valentin.gosu)
Flags: needinfo?(dd.mozilla)

Taking a step back here (and apologies if it's not the right place): does it actually make sense to request compression for these (usually) very small requests?

If I'm reasing this correctly, we will now be switching back to the main thread to do decompression, before returning to the TRR thread, only to come back again - with context-switching eating into the gains of threading. And it's likely similar on the other end to compress it, perhaps increasing latency.

Moreover, the header specifying encodings likely increases the amount sent upstream (though maybe less for HTTP/2), which in common home environments can be an order of magnitude lower in bandwidth thank downstream.

Maybe it'd be more sensible to just remove Accept-Encoding from the request, unless there are some specific queries likely to be of a larger size? Or better, to specify a blank Accept-Encoding, as:

"If no Accept-Encoding field is in the request, any content-coding is considered acceptable by the user agent."

...so it could just decide gzip is fine, but:
"An Accept-Encoding header field with a combined field-value that is empty implies that the user agent does not want any Content-Encoding in response. If an Accept-Encoding header field is present in a request and none of the available representations for the response have a content-coding that is listed as acceptable, the origin server SHOULD send a response without any content-coding."
(In practice I suspect most servers would just send it plain without Accept-Encoding. But maybe that isn't universal.)

Obviously SHOULD is not MUST, and we should handle the situation where the response has a content encoding. But perhaps it should not be the default that we request.

(In reply to Laurence "GreenReaper" Parry from comment #15)

Taking a step back here (and apologies if it's not the right place): does it actually make sense to request compression for these (usually) very small requests?

If I'm reasing this correctly, we will now be switching back to the main thread to do decompression, before returning to the TRR thread, only to come back again - with context-switching eating into the gains of threading. And it's likely similar on the other end to compress it, perhaps increasing latency.

Moreover, the header specifying encodings likely increases the amount sent upstream (though maybe less for HTTP/2), which in common home environments can be an order of magnitude lower in bandwidth thank downstream.

Maybe it'd be more sensible to just remove Accept-Encoding from the request, unless there are some specific queries likely to be of a larger size? Or better, to specify a blank Accept-Encoding, as:

"If no Accept-Encoding field is in the request, any content-coding is considered acceptable by the user agent."

...so it could just decide gzip is fine, but:
"An Accept-Encoding header field with a combined field-value that is empty implies that the user agent does not want any Content-Encoding in response. If an Accept-Encoding header field is present in a request and none of the available representations for the response have a content-coding that is listed as acceptable, the origin server SHOULD send a response without any content-coding."
(In practice I suspect most servers would just send it plain without Accept-Encoding. But maybe that isn't universal.)

Obviously SHOULD is not MUST, and we should handle the situation where the response has a content encoding. But perhaps it should not be the default that we request.

This is a good suggestion. Thanks.
I'll update my patch and remove the Accept-Encoding header.

Thanks! I can reproduce with this STR.
From the log, I saw that the TRRServiceChannel failed to connect to the proxy, since the proxy returned 407 response.
I think the root cause is that the FPN addon is unable to add Proxy-Authorization header to TRRServiceChannel, since TRRServiceChannel doesn't send http-on-* notifications at all.
There could be two workarounds:

  1. Let TRRServiceChannel ignore proxy settings, so TRRServiceChannel always connects to DoH server directly.
  2. Turn off network.trr.fetch_off_main_thread by FPN addon.

Valentin, Dragana, what do you think?

Turn off network.trr.fetch_off_main_thread. TRR is not really used a lot with FPN. There are just special cases, like WebRTC, that are using it.

Flags: needinfo?(dd.mozilla)

(In reply to Dragana Damjanovic [:dragana] from comment #17)

Thanks! I can reproduce with this STR.
From the log, I saw that the TRRServiceChannel failed to connect to the proxy, since the proxy returned 407 response.
I think the root cause is that the FPN addon is unable to add Proxy-Authorization header to TRRServiceChannel, since TRRServiceChannel doesn't send http-on-* notifications at all.
There could be two workarounds:

  1. Let TRRServiceChannel ignore proxy settings, so TRRServiceChannel always connects to DoH server directly.
  2. Turn off network.trr.fetch_off_main_thread by FPN addon.

Valentin, Dragana, what do you think?

Turn off network.trr.fetch_off_main_thread. TRR is not really used a lot with FPN. There are just special cases, like WebRTC, that are using it.

I filed an issue for this.

Flags: needinfo?(valentin.gosu)
Pushed by kjang@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/306587cfab12 Add content encoding support for TRRServiceChannel r=valentin https://hg.mozilla.org/integration/autoland/rev/3db2193eebc1 Test if TRRServiceChannel can handle gzip content encoding,r=valentin

Leave this open until the fix in FPN extension is verified.

Keywords: leave-open

Comment on attachment 9131755 [details]
Bug 1620824 - Add content encoding support for TRRServiceChannel

Beta/Release Uplift Approval Request

  • User impact if declined: DoH is not working if the server returns a response content with gzip encoding.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Most of the patch did is sending an empty Accept-Encoding header, so it's not risky.
    The code is already verified by the trr unit test.
  • String changes made/needed:
Attachment #9131755 - Flags: approval-mozilla-beta?
Attachment #9132204 - Flags: approval-mozilla-beta?
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true

(In reply to Kershaw Chang [:kershaw] from comment #21)

Leave this open until the fix in FPN extension is verified.

Where/when is that happening?

Flags: needinfo?(kershaw)

(In reply to Julien Cristau [:jcristau] from comment #24)

(In reply to Kershaw Chang [:kershaw] from comment #21)

Leave this open until the fix in FPN extension is verified.

Where/when is that happening?

I am not sure when the FPN extension will be updated.
baku, do you know if we have a nightly version of FPN that can be used to verify if turning off network.trr.fetch_off_main_thread is working?
Thanks.

Flags: needinfo?(kershaw) → needinfo?(amarchesini)

I can provide an unsigned version. I'll share an unsigned version on slack.

Flags: needinfo?(amarchesini)

Hi Fanolian,

Would you like test the unsigned FPN extension?
I can send it to you via email.

Thanks.

Flags: needinfo?(Fanolian+BMO)

(In reply to Kershaw Chang [:kershaw] from comment #27)

Hi Fanolian,

Would you like test the unsigned FPN extension?
I can send it to you via email.

Thanks.

Sure. Thanks. You can send it to my bugzilla account email.

Flags: needinfo?(Fanolian+BMO) → needinfo?(kershaw)

Fanolian, in order to test the unsigned version, there are a few things to do:

  1. set pref extensions.experiments.enabled to true
  2. install the XPI from about:debugging, as a temporary addon.

If you have any question, you can write me directly on slack/matrix/email. Thanks.

Flags: needinfo?(Fanolian+BMO)

The issue in about:addons (see comment 12) is fixed with FPN v21. Thanks.
I cannot test the issue in comment 0. Extension installed in about:debugging is removed after a restart.

However I cannot get to FPN's login page if I try to install it as a permanent extension, with extensions.experiments.enabled=true and xpinstall.signatures.required=false.
FPN says Connection timeout, check your internet connection. when I click the extension icon on toolbar. In Browser Console (with Show Content Messages checked in the gear icon) there is an error regarding the problem:

Error: Attempt to postMessage on disconnected port   view.js:194:24

Flags: needinfo?(Fanolian+BMO)

(In reply to Fanolian from comment #30)

The issue in about:addons (see comment 12) is fixed with FPN v21. Thanks.
I cannot test the issue in comment 0. Extension installed in about:debugging is removed after a restart.

However I cannot get to FPN's login page if I try to install it as a permanent extension, with extensions.experiments.enabled=true and xpinstall.signatures.required=false.
FPN says Connection timeout, check your internet connection. when I click the extension icon on toolbar. In Browser Console (with Show Content Messages checked in the gear icon) there is an error regarding the problem:

Error: Attempt to postMessage on disconnected port   view.js:194:24

Thanks for testing.
I think we can close this bug, since it's verified that network.trr.fetch_off_main_thread is set to false by FPN extension and network.trr.fetch_off_main_thread is the required condition for the issue in comment 0.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(kershaw)
Resolution: --- → FIXED
Target Milestone: --- → mozilla76

Comment on attachment 9131755 [details]
Bug 1620824 - Add content encoding support for TRRServiceChannel

approved for 75.0b5

Attachment #9131755 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9132204 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: