Closed Bug 1696138 (CVE-2021-29986) Opened 4 years ago Closed 4 years ago

DNS Resolver thread crashes Firefox

Categories

(Core :: Networking: DNS, defect, P2)

Firefox 86
defect

Tracking

()

RESOLVED FIXED
91 Branch
Tracking Status
firefox-esr78 91+ fixed
firefox90 --- wontfix
firefox91 + fixed

People

(Reporter: pahhur, Assigned: valentin)

Details

(Keywords: csectype-uaf, reporter-external, sec-high, Whiteboard: [necko-triaged][reporter-external][sec-survey][adv-main91+][adv-esr78.13+])

Crash Data

Attachments

(3 files)

Attached file trouble.txt

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0

Steps to reproduce:

Linux Mint 20 64 bit, last apt upgrade Mar 1

Goto tiktok.com and scroll videos
(did that with versions 82, 84, and 86)

Actual results:

After a time, Firefox crashes totally (SIGSEGV), see bug ids in trouble.txt
Last bug report bp-9e9034a9-f2bf-42e5-98a3-a1b300210303

DNS over HTTPS is switched off, but when it is switched on (Cloudfare), this problem occurs more often

The reports show a problem in the DNS Resolver thread, mostly in the call to PR_GetAddrInfoByName

Tests were also done with new profiles

Troubleshooting information attached (trouble.txt)

Mint 20 was installed in Sep 20, and was regularly updated/upgraded
(No user changes to resolver functionality)

Expected results:

I expect Firefox not to crash, and handle the malfunctioning call correctly ;)

The Bugbug bot thinks this bug should belong to the 'Core::Networking: DNS' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → Networking: DNS
Product: Firefox → Core

Valentin, do you have an idea here?
Thanks.

Flags: needinfo?(valentin.gosu)

Add on: Slimjet does not crash on the same system, so it's
more likely to me that FF itself causes the SIGSEGV.

Thank you for the report!

(In reply to pahhur from comment #0)

(did that with versions 82, 84, and 86)

Where did you get the builds?
Could you try with an official nightly build from here? https://nightly.mozilla.org/

(In reply to pahhur from comment #3)

Add on: Slimjet does not crash on the same system, so it's
more likely to me that FF itself causes the SIGSEGV.

It may be that it uses a different system API for DNS resolution.
It seems from the crash reports that failures occur in the glibc code sysdeps/posix/getaddrinfo.c:1039

Assignee: nobody → valentin.gosu
Severity: -- → S3
Crash Signature: [@ gaih_inet.constprop.0]
Flags: needinfo?(valentin.gosu) → needinfo?(pahhur)
Priority: -- → P2
Whiteboard: [necko-triaged]

The builds were the official versions from the mozilla download sites.

I will try with the above mentioned nighly build.

The mentioned source code at lines 1039 contains a
canon = __strdup (canon);

which of course can cause a SIGSEV whenever canon is not well-formed
before the call. It remains the question if that is due to some illegal
parameter values to the function when it is called by FF.

Flags: needinfo?(pahhur)

(In reply to pahhur from comment #5)

The builds were the official versions from the mozilla download sites.

Thanks for letting me know. I just wanted to exclude the possibility of distro-packaged-builds being the cause of the issue.

The mentioned source code at lines 1039 contains a
canon = __strdup (canon);

which of course can cause a SIGSEV whenever canon is not well-formed
before the call. It remains the question if that is due to some illegal
parameter values to the function when it is called by FF.

Right. That is my question too.
Could you see if setting the network.dns.disableIPv6 to true improves things?

Flags: needinfo?(pahhur)

Could you see if setting the network.dns.disableIPv6 to true improves things?

Until now, no crash with that setting.
I keep you updated

Flags: needinfo?(pahhur)

Two more days without crashes.
Is anyone going to investigate which part of the now deactivated code might cause the crashes?

This crash is on Ubuntu 20.10 - which I assume has libc 2.32
Crashes at getaddrinfo.c:1027 - https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/posix/getaddrinfo.c;h=ed04e564f9711298fd7bb14fd53f13c8053694d2;hb=499a92df8b9fc64a054cf3b7f728f8967fc1da7d#l1027

This crash has libc 2.31 (as listed in the path) and crashes at 1039 - https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/posix/getaddrinfo.c;h=538691a59867e2868ce3a5a9da69f5e26472c4b5;hb=8b222fa38700422b4da6731806835f0bbf40920d#l1039

It seems the problem would be for family = at2->family; But it's all wrapped in a while (at2 != NULL).
I have no idea how this could crash.

Yep, it seems I looked at an outdated source version.
gnu_get_libc_version() returns 2.31 on my system.
Thus, it should be the same sources than listed in the second example.

The family = at2->family; can not only fail, if at2 initially contains malformed data.
It can also fail as a result of memory corruption that occured before that line.

Confirming and restricting access to this bug. This is a use-after-free crash, it seems like we're accessing a dead object because r13 contains the poison pattern in almost every crash under these signatures.

Group: core-security
Status: UNCONFIRMED → NEW
Crash Signature: [@ gaih_inet.constprop.0] → [@ gaih_inet.constprop.0] [@ gaih_inet]
Ever confirmed: true
Keywords: csectype-uaf
Group: core-security → network-core-security
Keywords: sec-high
Whiteboard: [necko-triaged] → [necko-triaged][reporter-external]

Quick update on this: currently we don't really have much of a guess as to the mechanism of the crash, but I just started looking into whether we can identify any correlations here between the crash dumps (writing a script since there's apparently not enough crashes for the correlations tab to show anything). I'll post my findings when they're available. (cheers to gsvelto for giving me some pointers!)

FWIW from some preliminary tinkering:

  1. libc version distribution across the 120 crashes: { '2.31' => 77, '2.32' => 28, '2.33' => 15 }
  2. libc 2.31 was released on Feb 1 2020: https://sourceware.org/legacy-ml/libc-announce/2020/msg00001.html
  3. Oldest Fx version that crashed: 78.5.0esr (Nov 2020)

(In reply to Nihanth Subramanya [:nhnt11] from comment #13)

FWIW from some preliminary tinkering:

  1. libc version distribution across the 120 crashes: { '2.31' => 77, '2.32' => 28, '2.33' => 15 }
  2. libc 2.31 was released on Feb 1 2020: https://sourceware.org/legacy-ml/libc-announce/2020/msg00001.html
  3. Oldest Fx version that crashed: 78.5.0esr (Nov 2020)

Also for posterity the first reported crash was in Sept 2020, and then nothing until mid-Dec 2020.

I analyzed a bunch of crashes and came to the conclusion that this is a Linux-specific duplicate of bug 1459680 - which also proved elusive. Here's what both bugs have in common: there are several DNS resolver threads in each crash and at least two of them are resolving an address. It seems like there must be a race in our DNS resolution code that is causing the UAF. This also explains why we don't always see an UAF: the register that contains the poison pattern ends up containing garbage which might look like valid data or a pointer, but it's not, it's just junk read from some re-used memory.

I had a look at nsHostResolver.cpp and it feels like this could be an issue in the synchronization in there. This is well over my head so I'll leave a more detailed analysis to Nihanth.

Daniel, this bug is hidden but bug 1459680 is not. I'm not leaving any comments in there - nor I want to link it to this bug - because this looks dangerous. Given that the issue has to do with DNS resolution, it's a race and it involves an UAF, I feel like it's something that an attacker might be able to use (for example by controlling which domains are being queried). How should we proceed?

Flags: needinfo?(dveditz)

CC'ing Kris Wright and Christian Holler who work on TSan because this might have tripped some errors there too.

It would be good if we had some proof this is a real UAF, but right now I'm doubtful.

The only variable that could potentially be freed when the getaddrinfo call is done is the hostname.
If that were the case, I think the crash would be different.

Secondly, in comment 6 the reporter mentions that disabling IPv6 fixes the crash. That may very well be a coincidence, but it could very well mean there's a bug in getaddrinfo when resolving IPv6 domains.

(In reply to Valentin Gosu [:valentin] (he/him) from comment #18)

It would be good if we had some proof this is a real UAF, but right now I'm doubtful.

There are ample crash reports where the failing address is the poison pattern, in others it's garbage so we' re definitely accessing an object/buffer that was freed or overwritten.

The only variable that could potentially be freed when the getaddrinfo call is done is the hostname.
If that were the case, I think the crash would be different.

Why do you think so? The host string here uses ref-counted storage, so we're not making a copy of the contents in that assignment. If whoever owns the original string alters the storage in another thread then all bets are off because nsTSubstring isn't thread-safe.

Secondly, in comment 6 the reporter mentions that disabling IPv6 fixes the crash. That may very well be a coincidence, but it could very well mean there's a bug in getaddrinfo when resolving IPv6 domains.

That wouldn't explain why we see the same issue on Windows though.

(In reply to Gabriele Svelto [:gsvelto] from comment #16)

Daniel, this bug is hidden but bug 1459680 is not. I'm not leaving any comments in there - nor I want to link it to this bug - because this looks dangerous.

I think I agree with toshi that the current scary symptoms in bug 1459680 are due to something hooking our process and screwing it up. The original symptoms ("they mostly appear to be wild-pointer issues") may have been more like this bug, but unfortunately those crashes are gone.

Given that the issue has to do with DNS resolution, it's a race and it involves an UAF, I feel like it's something that an attacker might be able to use (for example by controlling which domains are being queried). How should we proceed?

I only see 9 crashes with this bug's signature in the past month. 4 of those show the UAF poison value, always in r13 if that's a clue. Scary, but low volume. That makes it hard for both us AND attackers to figure out, at least.

Flags: needinfo?(dveditz)

(In reply to Gabriele Svelto [:gsvelto] from comment #19)

Why do you think so? The host string here uses ref-counted storage, so we're not making a copy of the contents in that assignment. If whoever owns the original string alters the storage in another thread then all bets are off because nsTSubstring isn't thread-safe.

If the string gets freed that's either something messing up the string's refcount (double-free of a hostRecord), or writing to its storage (which is more difficult to figure out).
This patch will make a copy of the hostname before calling into getaddrinfo.
If the crashes move to this location, it validates the double-free theory.
If the crashes keep happening in getaddrinfo that means there's a bug in there.

Comment on attachment 9227401 [details]
Bug 1696138 - Make a copy of the host before calling getaddrinfo r=#necko

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: Unclear given that we're unsure what's the cause of the crashes.
    This is a speculative patch that would move the crashes to the location of the copy, if we are using a released string, or if the crashes keep happening in libc it means there's a bug in the implementation of getaddrinfo
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: Unknown
  • Which older supported branches are affected by this flaw?: all
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: Yes
  • If not, how different, hard to create, and risky will they be?: Trivial patch. Should apply cleanly or otherwise easy to backport.
  • How likely is this patch to cause regressions; how much testing does it need?: Low chance of a perf regression since we're now making copies of the string (which is limited to a max of 253 chars).
    The code guarded by a pref so we can revert at any time.
Attachment #9227401 - Flags: sec-approval?

Comment on attachment 9227401 [details]
Bug 1696138 - Make a copy of the host before calling getaddrinfo r=#necko

Approved to land

Attachment #9227401 - Flags: sec-approval? → sec-approval+
Keywords: leave-open

Make a copy of the host before calling getaddrinfo r=necko-reviewers,kershaw
https://hg.mozilla.org/integration/autoland/rev/b6a926e5deeccbe3c59af07a42e65869e940b5e5

Comment on attachment 9227401 [details]
Bug 1696138 - Make a copy of the host before calling getaddrinfo r=#necko

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: This is sec-high :)
  • User impact if declined:
  • Fix Landed on Version: 91
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): We're copying a string before passing into a system API, which is not very risky, and this is also behind a pref which can be flipped if needed. The change has been baking in 91 and seems effective.
  • String or UUID changes made by this patch:
Attachment #9227401 - Flags: approval-mozilla-esr78?
Status: NEW → RESOLVED
Closed: 4 years ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → 91 Branch

Comment on attachment 9227401 [details]
Bug 1696138 - Make a copy of the host before calling getaddrinfo r=#necko

Approved for 78.13esr.

Attachment #9227401 - Flags: approval-mozilla-esr78? → approval-mozilla-esr78+
Group: network-core-security → core-security-release

As part of a security bug pattern analysis, we are requesting your help with a high level analysis of this bug. It is our hope to develop static analysis (or potentially runtime/dynamic analysis) in the future to identify classes of bugs.

Please visit this google form to reply.

Flags: needinfo?(valentin.gosu)
Whiteboard: [necko-triaged][reporter-external] → [necko-triaged][reporter-external][sec-survey]
Flags: needinfo?(valentin.gosu)
QA Whiteboard: [post-critsmash-triage]
Flags: qe-verify-
Whiteboard: [necko-triaged][reporter-external][sec-survey] → [necko-triaged][reporter-external][sec-survey][adv-main91+]
Attached file advisory.txt
Alias: CVE-2021-29986
Whiteboard: [necko-triaged][reporter-external][sec-survey][adv-main91+] → [necko-triaged][reporter-external][sec-survey][adv-main91+][adv-esr78.13+]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: