1696138 - (CVE-2021-29986) DNS Resolver thread crashes Firefox

pahhur

Reporter

Description

•

4 years ago

Attached file trouble.txt — Details

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0

Steps to reproduce:

Linux Mint 20 64 bit, last apt upgrade Mar 1

Goto tiktok.com and scroll videos
(did that with versions 82, 84, and 86)

Actual results:

After a time, Firefox crashes totally (SIGSEGV), see bug ids in trouble.txt
Last bug report bp-9e9034a9-f2bf-42e5-98a3-a1b300210303

DNS over HTTPS is switched off, but when it is switched on (Cloudfare), this problem occurs more often

The reports show a problem in the DNS Resolver thread, mostly in the call to PR_GetAddrInfoByName

Tests were also done with new profiles

Troubleshooting information attached (trouble.txt)

Mint 20 was installed in Sep 20, and was regularly updated/upgraded
(No user changes to resolver functionality)

Expected results:

I expect Firefox not to crash, and handle the malfunctioning call correctly ;)

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

4 years ago

The Bugbug bot thinks this bug should belong to the 'Core::Networking: DNS' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → Networking: DNS

Product: Firefox → Core

Kershaw Chang [:kershaw]

Comment 2

•

4 years ago

Valentin, do you have an idea here?
Thanks.

Flags: needinfo?(valentin.gosu)

pahhur

Reporter

Comment 3

•

4 years ago

Add on: Slimjet does not crash on the same system, so it's
more likely to me that FF itself causes the SIGSEGV.

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 4

•

4 years ago

Thank you for the report!

(In reply to pahhur from comment #0)

(did that with versions 82, 84, and 86)

Where did you get the builds?
Could you try with an official nightly build from here? https://nightly.mozilla.org/

(In reply to pahhur from comment #3)

Add on: Slimjet does not crash on the same system, so it's
more likely to me that FF itself causes the SIGSEGV.

It may be that it uses a different system API for DNS resolution.
It seems from the crash reports that failures occur in the glibc code sysdeps/posix/getaddrinfo.c:1039

Assignee: nobody → valentin.gosu

Severity: -- → S3

Crash Signature: [@ gaih_inet.constprop.0]

Flags: needinfo?(valentin.gosu) → needinfo?(pahhur)

Priority: -- → P2

Whiteboard: [necko-triaged]

pahhur

Reporter

Comment 5

•

4 years ago

The builds were the official versions from the mozilla download sites.

I will try with the above mentioned nighly build.

The mentioned source code at lines 1039 contains a
canon = __strdup (canon);

which of course can cause a SIGSEV whenever canon is not well-formed
before the call. It remains the question if that is due to some illegal
parameter values to the function when it is called by FF.

Flags: needinfo?(pahhur)

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 6

•

4 years ago

(In reply to pahhur from comment #5)

The builds were the official versions from the mozilla download sites.

Thanks for letting me know. I just wanted to exclude the possibility of distro-packaged-builds being the cause of the issue.

The mentioned source code at lines 1039 contains a
canon = __strdup (canon);

which of course can cause a SIGSEV whenever canon is not well-formed
before the call. It remains the question if that is due to some illegal
parameter values to the function when it is called by FF.

Right. That is my question too.
Could you see if setting the network.dns.disableIPv6 to true improves things?

Flags: needinfo?(pahhur)

pahhur

Reporter

Comment 7

•

4 years ago

Could you see if setting the network.dns.disableIPv6 to true improves things?

Until now, no crash with that setting.
I keep you updated

Flags: needinfo?(pahhur)

pahhur

Reporter

Comment 8

•

4 years ago

Two more days without crashes.
Is anyone going to investigate which part of the now deactivated code might cause the crashes?

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 9

•

4 years ago

This crash is on Ubuntu 20.10 - which I assume has libc 2.32
Crashes at getaddrinfo.c:1027 - https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/posix/getaddrinfo.c;h=ed04e564f9711298fd7bb14fd53f13c8053694d2;hb=499a92df8b9fc64a054cf3b7f728f8967fc1da7d#l1027

This crash has libc 2.31 (as listed in the path) and crashes at 1039 - https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/posix/getaddrinfo.c;h=538691a59867e2868ce3a5a9da69f5e26472c4b5;hb=8b222fa38700422b4da6731806835f0bbf40920d#l1039

It seems the problem would be for family = at2->family; But it's all wrapped in a while (at2 != NULL).
I have no idea how this could crash.

pahhur

Reporter

Comment 10

•

4 years ago

Yep, it seems I looked at an outdated source version.
gnu_get_libc_version() returns 2.31 on my system.
Thus, it should be the same sources than listed in the second example.

The family = at2->family; can not only fail, if at2 initially contains malformed data.
It can also fail as a result of memory corruption that occured before that line.

Gabriele Svelto [:gsvelto]

Comment 11

•

4 years ago

Confirming and restricting access to this bug. This is a use-after-free crash, it seems like we're accessing a dead object because r13 contains the poison pattern in almost every crash under these signatures.

Group: core-security

Status: UNCONFIRMED → NEW

Crash Signature: [@ gaih_inet.constprop.0] → [@ gaih_inet.constprop.0] [@ gaih_inet]

Ever confirmed: true

Keywords: csectype-uaf

Andrew McCreight [:mccr8]

Updated

•

4 years ago

Group: core-security → network-core-security

Daniel Veditz [:dveditz]

Updated

•

4 years ago

Keywords: sec-high

Daniel Veditz [:dveditz]

Updated

•

4 years ago

Whiteboard: [necko-triaged] → [necko-triaged][reporter-external]

(inactive) Nihanth Subramanya [:nhnt11]

Comment 12

•

4 years ago

Quick update on this: currently we don't really have much of a guess as to the mechanism of the crash, but I just started looking into whether we can identify any correlations here between the crash dumps (writing a script since there's apparently not enough crashes for the correlations tab to show anything). I'll post my findings when they're available. (cheers to gsvelto for giving me some pointers!)

(inactive) Nihanth Subramanya [:nhnt11]

Comment 13

•

4 years ago

FWIW from some preliminary tinkering:

libc version distribution across the 120 crashes: { '2.31' => 77, '2.32' => 28, '2.33' => 15 }
libc 2.31 was released on Feb 1 2020: https://sourceware.org/legacy-ml/libc-announce/2020/msg00001.html
Oldest Fx version that crashed: 78.5.0esr (Nov 2020)

(inactive) Nihanth Subramanya [:nhnt11]

Comment 14

•

4 years ago

(In reply to Nihanth Subramanya [:nhnt11] from comment #13)

FWIW from some preliminary tinkering:

libc version distribution across the 120 crashes: { '2.31' => 77, '2.32' => 28, '2.33' => 15 }

libc 2.31 was released on Feb 1 2020: https://sourceware.org/legacy-ml/libc-announce/2020/msg00001.html

Oldest Fx version that crashed: 78.5.0esr (Nov 2020)

Also for posterity the first reported crash was in Sept 2020, and then nothing until mid-Dec 2020.

Gabriele Svelto [:gsvelto]

Comment 15

•

4 years ago

I analyzed a bunch of crashes and came to the conclusion that this is a Linux-specific duplicate of bug 1459680 - which also proved elusive. Here's what both bugs have in common: there are several DNS resolver threads in each crash and at least two of them are resolving an address. It seems like there must be a race in our DNS resolution code that is causing the UAF. This also explains why we don't always see an UAF: the register that contains the poison pattern ends up containing garbage which might look like valid data or a pointer, but it's not, it's just junk read from some re-used memory.

Gabriele Svelto [:gsvelto]

Comment 16

•

4 years ago

I had a look at nsHostResolver.cpp and it feels like this could be an issue in the synchronization in there. This is well over my head so I'll leave a more detailed analysis to Nihanth.

Daniel, this bug is hidden but bug 1459680 is not. I'm not leaving any comments in there - nor I want to link it to this bug - because this looks dangerous. Given that the issue has to do with DNS resolution, it's a race and it involves an UAF, I feel like it's something that an attacker might be able to use (for example by controlling which domains are being queried). How should we proceed?

Flags: needinfo?(dveditz)

Gabriele Svelto [:gsvelto]

Comment 17

•

4 years ago

CC'ing Kris Wright and Christian Holler who work on TSan because this might have tripped some errors there too.

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 18

•

4 years ago

It would be good if we had some proof this is a real UAF, but right now I'm doubtful.

The only variable that could potentially be freed when the getaddrinfo call is done is the hostname.
If that were the case, I think the crash would be different.

Secondly, in comment 6 the reporter mentions that disabling IPv6 fixes the crash. That may very well be a coincidence, but it could very well mean there's a bug in getaddrinfo when resolving IPv6 domains.

Gabriele Svelto [:gsvelto]

Comment 19

•

4 years ago

(In reply to Valentin Gosu [:valentin] (he/him) from comment #18)

It would be good if we had some proof this is a real UAF, but right now I'm doubtful.

There are ample crash reports where the failing address is the poison pattern, in others it's garbage so we' re definitely accessing an object/buffer that was freed or overwritten.

The only variable that could potentially be freed when the getaddrinfo call is done is the hostname.
If that were the case, I think the crash would be different.

Why do you think so? The host string here uses ref-counted storage, so we're not making a copy of the contents in that assignment. If whoever owns the original string alters the storage in another thread then all bets are off because nsTSubstring isn't thread-safe.

Secondly, in comment 6 the reporter mentions that disabling IPv6 fixes the crash. That may very well be a coincidence, but it could very well mean there's a bug in getaddrinfo when resolving IPv6 domains.

That wouldn't explain why we see the same issue on Windows though.

Daniel Veditz [:dveditz]

Comment 20

•

4 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #16)

Daniel, this bug is hidden but bug 1459680 is not. I'm not leaving any comments in there - nor I want to link it to this bug - because this looks dangerous.

I think I agree with toshi that the current scary symptoms in bug 1459680 are due to something hooking our process and screwing it up. The original symptoms ("they mostly appear to be wild-pointer issues") may have been more like this bug, but unfortunately those crashes are gone.

Given that the issue has to do with DNS resolution, it's a race and it involves an UAF, I feel like it's something that an attacker might be able to use (for example by controlling which domains are being queried). How should we proceed?

I only see 9 crashes with this bug's signature in the past month. 4 of those show the UAF poison value, always in r13 if that's a clue. Scary, but low volume. That makes it hard for both us AND attackers to figure out, at least.

Flags: needinfo?(dveditz)

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 21

•

4 years ago

Attached file Bug 1696138 - Make a copy of the host before calling getaddrinfo r=#necko — Details

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 22

•

4 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #19)

Why do you think so? The host string here uses ref-counted storage, so we're not making a copy of the contents in that assignment. If whoever owns the original string alters the storage in another thread then all bets are off because nsTSubstring isn't thread-safe.

If the string gets freed that's either something messing up the string's refcount (double-free of a hostRecord), or writing to its storage (which is more difficult to figure out).
This patch will make a copy of the hostname before calling into getaddrinfo.
If the crashes move to this location, it validates the double-free theory.
If the crashes keep happening in getaddrinfo that means there's a bug in there.

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 23

•

4 years ago

Comment on attachment 9227401 [details]
Bug 1696138 - Make a copy of the host before calling getaddrinfo r=#necko

Security Approval Request

How easily could an exploit be constructed based on the patch?: Unclear given that we're unsure what's the cause of the crashes.
This is a speculative patch that would move the crashes to the location of the copy, if we are using a released string, or if the crashes keep happening in libc it means there's a bug in the implementation of getaddrinfo
Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: Unknown
Which older supported branches are affected by this flaw?: all
If not all supported branches, which bug introduced the flaw?: None
Do you have backports for the affected branches?: Yes
If not, how different, hard to create, and risky will they be?: Trivial patch. Should apply cleanly or otherwise easy to backport.
How likely is this patch to cause regressions; how much testing does it need?: Low chance of a perf regression since we're now making copies of the string (which is limited to a max of 253 chars).
The code guarded by a pref so we can revert at any time.

Attachment #9227401 - Flags: sec-approval?

Tom Ritter [:tjr]

Comment 24

•

4 years ago

Comment on attachment 9227401 [details]
Bug 1696138 - Make a copy of the host before calling getaddrinfo r=#necko

Approved to land

Attachment #9227401 - Flags: sec-approval? → sec-approval+

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

4 years ago

Keywords: leave-open

Julien Cristau [:jcristau]

Comment 25

•

4 years ago

Make a copy of the host before calling getaddrinfo r=necko-reviewers,kershaw
https://hg.mozilla.org/integration/autoland/rev/b6a926e5deeccbe3c59af07a42e65869e940b5e5

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 26

•

4 years ago

https://hg.mozilla.org/mozilla-central/rev/b6a926e5deec

(inactive) Nihanth Subramanya [:nhnt11]

Comment 27

•

4 years ago

Comment on attachment 9227401 [details]
Bug 1696138 - Make a copy of the host before calling getaddrinfo r=#necko

ESR Uplift Approval Request

If this is not a sec:{high,crit} bug, please state case for ESR consideration: This is sec-high :)
User impact if declined:
Fix Landed on Version: 91
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): We're copying a string before passing into a system API, which is not very risky, and this is also behind a pref which can be flipped if needed. The change has been baking in 91 and seems effective.
String or UUID changes made by this patch:

Attachment #9227401 - Flags: approval-mozilla-esr78?

Pascal Chevrel:pascalc

Updated

•

4 years ago

status-firefox90: --- → affected

status-firefox91: --- → fixed

status-firefox92: --- → fixed

status-firefox-esr78: --- → affected

status-firefox-esr91: --- → fixed

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

Status: NEW → RESOLVED

Closed: 4 years ago

status-firefox90: affected → wontfix

status-firefox92: fixed → ---

status-firefox-esr91: fixed → ---

tracking-firefox91: --- → +

tracking-firefox-esr78: --- → 91+

Keywords: leave-open

Resolution: --- → FIXED

Target Milestone: --- → 91 Branch

Ryan VanderMeulen [:RyanVM]

Comment 28

•

4 years ago

Comment on attachment 9227401 [details]
Bug 1696138 - Make a copy of the host before calling getaddrinfo r=#necko

Approved for 78.13esr.

Attachment #9227401 - Flags: approval-mozilla-esr78? → approval-mozilla-esr78+

Ryan VanderMeulen [:RyanVM]

Comment 29

•

4 years ago

uplift

https://hg.mozilla.org/releases/mozilla-esr78/rev/65d51d00fb9e

status-firefox-esr78: affected → fixed

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

Group: network-core-security → core-security-release

BugBot [:suhaib / :marco/ :calixte]

Comment 30

•

4 years ago

As part of a security bug pattern analysis, we are requesting your help with a high level analysis of this bug. It is our hope to develop static analysis (or potentially runtime/dynamic analysis) in the future to identify classes of bugs.

Please visit this google form to reply.

Flags: needinfo?(valentin.gosu)

Whiteboard: [necko-triaged][reporter-external] → [necko-triaged][reporter-external][sec-survey]

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

4 years ago

Flags: needinfo?(valentin.gosu)

Cornel Ionce [:noni]

Updated

•

4 years ago

QA Whiteboard: [post-critsmash-triage]

Flags: qe-verify-

Tom Ritter [:tjr]

Updated

•

4 years ago

Whiteboard: [necko-triaged][reporter-external][sec-survey] → [necko-triaged][reporter-external][sec-survey][adv-main91+]

Tom Ritter [:tjr]

Comment 31

•

4 years ago

Attached file advisory.txt — Details

Tom Ritter [:tjr]

Updated

•

4 years ago

Alias: CVE-2021-29986

Tom Ritter [:tjr]

Updated

•

4 years ago

Whiteboard: [necko-triaged][reporter-external][sec-survey][adv-main91+] → [necko-triaged][reporter-external][sec-survey][adv-main91+][adv-esr78.13+]

Daniel Veditz [:dveditz]

Updated

•

3 years ago

Group: core-security-release

David Lawrence [:dkl]

Updated

•

1 year ago

Keywords: reporter-external

trouble.txt 4 years ago pahhur 22.73 KB, text/plain		Details
Bug 1696138 - Make a copy of the host before calling getaddrinfo r=#necko 4 years ago Valentin Gosu [:valentin] (he/him) 48 bytes, text/x-phabricator-request	RyanVM : approval-mozilla-esr78+ tjr : sec-approval+	Details \| Review
advisory.txt 4 years ago Tom Ritter [:tjr] 316 bytes, text/plain		Details