Closed Bug 1649127 Opened 1 month ago Closed 1 month ago

Make sure we only set AddrHostRecord::mTRRUsed = true when we actually send the request

Categories

(Core :: Networking: DNS, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla80
Tracking Status
firefox79 --- fixed
firefox80 --- fixed

People

(Reporter: valentin, Assigned: valentin, NeedInfo)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged][trr])

Attachments

(2 files)

Previously we'd set mTRRUsed to true here even if no TRR lookups actually happened.

This has the potential to screw with our telemetry data.
We should instead make sure this is only set when we actually write the bytes to the socket.

Blocks: doh

Looking at the usage of mTRRUsed it may be that is a bit overloaded.

mTRRUsed is used for telemetry and in my opinion it is fine as it is now.
mTRRUsed is also used for BlockList hostnames: in this case we may need a more fine grain distinguish: blocklist if server returns an error, but do not block if connection to the TRR server fails.

One more note: our blockList is not really very broken, because it checks that the corresponding native lookup has succeeded. We only wrongly block something if TRR server is blocked (and maybe in some corner cases that should be rare, e.g. network comes up in between TRR and native request)

(In reply to Dragana Damjanovic [:dragana] from comment #3)

One more note: our blockList is not really very broken, because it checks that the corresponding native lookup has succeeded. We only wrongly block something if TRR server is blocked (and maybe in some corner cases that should be rare, e.g. network comes up in between TRR and native request)

I mentioned this in https://phabricator.services.mozilla.com/D81517#2499399 but I should probably explain it further.
I think it's broken, especially when IPv6 support is missing, because we will call into TrrLookup for an AAAA type, set mTRRUsed = true, exit because there's no IPv6 support, do a native lookup which succeeds, then blocklist the domain because it thinks that the TRR request failed, even though we didn't perform one.

This would explain a large part of the missing TRR records.

(In reply to Valentin Gosu [:valentin] (he/him) from comment #4)

(In reply to Dragana Damjanovic [:dragana] from comment #3)

One more note: our blockList is not really very broken, because it checks that the corresponding native lookup has succeeded. We only wrongly block something if TRR server is blocked (and maybe in some corner cases that should be rare, e.g. network comes up in between TRR and native request)

I mentioned this in https://phabricator.services.mozilla.com/D81517#2499399 but I should probably explain it further.
I think it's broken, especially when IPv6 support is missing, because we will call into TrrLookup for an AAAA type, set mTRRUsed = true, exit because there's no IPv6 support, do a native lookup which succeeds, then blocklist the domain because it thinks that the TRR request failed, even though we didn't perform one.

This would explain a large part of the missing TRR records.

OK, that is probably true, but the patch is not doing the right think.

What we want to know:
Temporary exclude hostnames that:

  • TRR request was send but TRR resolver could not resolve it (this mean that we have send a request and request and receive a response from the TRR server), (Question: should request that have timed out also be excluded? we should maybe try this out on nightly only)
  • the corresponding native request succeeded

We need more telemetry:

  • TRR was OK
  • TRR responded but response was corrupted,
  • TRR server responded with an error - host not found
  • TRR request timed out
  • connection to TRR server could not be established (we may collect a couple of errors here, e.g. connection reset, connection timeout, TRR server's hostname could not be resolved)
  • a host is temporary excluded
  • a host is permanently excluded (the list in the pref)
  • ipv6 not perform because ipv6 is disabled.
    Am I missing something?
    we will need to separate the above probe for ipv4 and IPv6.

We already have some of this telemetry: DNS_TRR_SUCCESS, DNS_TRR_BLACKLISTED2, etc. it may be useful to combine them or at least audit them.

When analyzing telemetry we should exclude users that have TRR disabled because of the OS parental control and VPNs. We should also collect telemetry about that as well.

We should have telemetry about the highest number of host in the temporary exclusion list at any point in time during a session.

For the temporal exclusion list we may check that both ipv6 and ipv4 has failed.

(In reply to Dragana Damjanovic [:dragana] from comment #5)

(In reply to Valentin Gosu [:valentin] (he/him) from comment #4)

This would explain a large part of the missing TRR records.

OK, that is probably true, but the patch is not doing the right think.

Could you explain why it is wrong?
Looking at where it is used: https://searchfox.org/mozilla-central/search?q=mTRRUsed&path=&case=false&regexp=false
It seems all the sites want to check if we actually got a response back from TRR or not. Do you mean the check should be even more strict, as to exclude errors that may occur after we sent the request?

Flags: needinfo?(dd.mozilla)

Sorry for late response.

First I have not seen your comment in phabricator when I wrote my comment above (discussion at too places...)

So I am ok in changing the behavior of TRRUsed if you change the telemetry as well. The comments in the patch and bugzilla did not say anything about that. I was saying it is wrong because it breaks the telemetry completely.

I would suggest to change the name of the variable, actually to add a new one so that telemetry is not corrupted. Also mTRRUsed is not anymore a good name for this variable.

You should double check if the timing information are set in the channel before OnStartRequest. Timing information are a bit weird and only used for diagnostics(not an critical functionality) so may be missing or wrong. If you are not sure that they are set, please propagate this information to the channel in a different way. If you want to test this please test it with http1.1 and http/2. Thinking of it ... maybe it is better to get this information in another way, because I am not sure but we may send the NS_NET_STATUS_SENDING_TO event when we actually do not send data because a connection the TRR server failed.

Flags: needinfo?(dd.mozilla)
Attachment #9160048 - Attachment description: Bug 1649127 - Make sure we only set AddrHostRecord::mTRRUsed = true when we actually send the request r=dragana → Bug 1649127 - Make sure we only set AddrHostRecord::mTRRUsed = true when TRRServiceChannel::AsyncOpen succeeds r=dragana

(In reply to Dragana Damjanovic [:dragana] from comment #9)

Sorry for late response.

First I have not seen your comment in phabricator when I wrote my comment above (discussion at too places...)

So I am ok in changing the behavior of TRRUsed if you change the telemetry as well. The comments in the patch and bugzilla did not say anything about that. I was saying it is wrong because it breaks the telemetry completely.

I would suggest to change the name of the variable, actually to add a new one so that telemetry is not corrupted. Also mTRRUsed is not anymore a good name for this variable.

You should double check if the timing information are set in the channel before OnStartRequest.

OnStartRequest is emited when we have responseHeaders, so surely the request has already been sent?

Timing information are a bit weird and only used for diagnostics(not an critical functionality) so may be missing or wrong.

We enable timing here

If you are not sure that they are set, please propagate this information to the channel in a different way. If you want to test this please test it with http1.1 and http/2. Thinking of it ... maybe it is better to get this information in another way, because I am not sure but we may send the NS_NET_STATUS_SENDING_TO event when we actually do not send data because a connection the TRR server failed.

I don't know whether we need to check http1.1 - we should never actually use that, right?
I was considering checking the status code of the channel in OnStartRequest, but that would not work if we did send the request, but the channel times out waiting for a response. 🙁

I'll also write some tests that the telemetry looks as we expect it to.

Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/16ff0a677ab2
Make sure we only set AddrHostRecord::mTRRUsed = true when TRRServiceChannel::AsyncOpen succeeds r=dragana,necko-reviewers

Due to a change in timing in this patch, when we reset the confirmation pref
at the end of the test, a TRR request would happen after we changed the prefs
leading to accessing a non-local IP in testing and causing a crash.
This should be gated on being in the correct mode anyway

Depends on D81517

Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/17e5b3f78f03
Make sure we only set AddrHostRecord::mTRRUsed = true when TRRServiceChannel::AsyncOpen succeeds r=dragana,necko-reviewers
https://hg.mozilla.org/integration/autoland/rev/d967c48fa147
Only trigger confirmation in TRR-first and TRR-only modes r=dragana,necko-reviewers

Let's try this again :)

Flags: needinfo?(valentin.gosu)
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/6054a345cd86
Make sure we only set AddrHostRecord::mTRRUsed = true when TRRServiceChannel::AsyncOpen succeeds r=dragana,necko-reviewers
https://hg.mozilla.org/integration/autoland/rev/ad363ca253e7
Only trigger confirmation in TRR-first and TRR-only modes r=dragana,necko-reviewers
Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla80

Comment on attachment 9160048 [details]
Bug 1649127 - Make sure we only set AddrHostRecord::mTRRUsed = true when TRRServiceChannel::AsyncOpen succeeds r=dragana

Beta/Release Uplift Approval Request

  • User impact if declined: TRR is incorrectly disabled for some host names.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This flag only controls whether to temporary disable TRR for a host name. It does not change functionaly in the way that could break DNS resolution. It will just decrease the number of hostnames for which TRR is temporary disabled.
  • String changes made/needed: none
Attachment #9160048 - Flags: approval-mozilla-beta?
Attachment #9161385 - Flags: approval-mozilla-beta?

Comment on attachment 9160048 [details]
Bug 1649127 - Make sure we only set AddrHostRecord::mTRRUsed = true when TRRServiceChannel::AsyncOpen succeeds r=dragana

Approved for 79.0b7.

Attachment #9160048 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9161385 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: qe-verify+
QA Whiteboard: [qa-triaged]

Hello,

Can you please provide some steps for QA to verify this issue?

Thank you!

Flags: needinfo?(valentin.gosu)
Flags: needinfo?(dd.mozilla)

I'm not sure manual verification is really possible with this.

Flags: needinfo?(valentin.gosu)
You need to log in before you can comment on or make changes to this bug.