DNS Resolver thread crashes Firefox
Categories
(Core :: Networking: DNS, defect, P2)
Tracking
()
People
(Reporter: pahhur, Assigned: valentin)
Details
(Keywords: csectype-uaf, reporter-external, sec-high, Whiteboard: [necko-triaged][reporter-external][sec-survey][adv-main91+][adv-esr78.13+])
Crash Data
Attachments
(3 files)
22.73 KB,
text/plain
|
Details | |
48 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-esr78+
tjr
:
sec-approval+
|
Details | Review |
316 bytes,
text/plain
|
Details |
User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0
Steps to reproduce:
Linux Mint 20 64 bit, last apt upgrade Mar 1
Goto tiktok.com and scroll videos
(did that with versions 82, 84, and 86)
Actual results:
After a time, Firefox crashes totally (SIGSEGV), see bug ids in trouble.txt
Last bug report bp-9e9034a9-f2bf-42e5-98a3-a1b300210303
DNS over HTTPS is switched off, but when it is switched on (Cloudfare), this problem occurs more often
The reports show a problem in the DNS Resolver thread, mostly in the call to PR_GetAddrInfoByName
Tests were also done with new profiles
Troubleshooting information attached (trouble.txt)
Mint 20 was installed in Sep 20, and was regularly updated/upgraded
(No user changes to resolver functionality)
Expected results:
I expect Firefox not to crash, and handle the malfunctioning call correctly ;)
Comment 1•4 years ago
|
||
The Bugbug bot thinks this bug should belong to the 'Core::Networking: DNS' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.
Add on: Slimjet does not crash on the same system, so it's
more likely to me that FF itself causes the SIGSEGV.
Assignee | ||
Comment 4•4 years ago
|
||
Thank you for the report!
(In reply to pahhur from comment #0)
(did that with versions 82, 84, and 86)
Where did you get the builds?
Could you try with an official nightly build from here? https://nightly.mozilla.org/
(In reply to pahhur from comment #3)
Add on: Slimjet does not crash on the same system, so it's
more likely to me that FF itself causes the SIGSEGV.
It may be that it uses a different system API for DNS resolution.
It seems from the crash reports that failures occur in the glibc code sysdeps/posix/getaddrinfo.c:1039
The builds were the official versions from the mozilla download sites.
I will try with the above mentioned nighly build.
The mentioned source code at lines 1039 contains a
canon = __strdup (canon);
which of course can cause a SIGSEV whenever canon is not well-formed
before the call. It remains the question if that is due to some illegal
parameter values to the function when it is called by FF.
Assignee | ||
Comment 6•4 years ago
|
||
(In reply to pahhur from comment #5)
The builds were the official versions from the mozilla download sites.
Thanks for letting me know. I just wanted to exclude the possibility of distro-packaged-builds being the cause of the issue.
The mentioned source code at lines 1039 contains a
canon = __strdup (canon);which of course can cause a SIGSEV whenever canon is not well-formed
before the call. It remains the question if that is due to some illegal
parameter values to the function when it is called by FF.
Right. That is my question too.
Could you see if setting the network.dns.disableIPv6
to true
improves things?
Could you see if setting the
network.dns.disableIPv6
totrue
improves things?
Until now, no crash with that setting.
I keep you updated
Two more days without crashes.
Is anyone going to investigate which part of the now deactivated code might cause the crashes?
Assignee | ||
Comment 9•4 years ago
|
||
This crash is on Ubuntu 20.10 - which I assume has libc 2.32
Crashes at getaddrinfo.c:1027 - https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/posix/getaddrinfo.c;h=ed04e564f9711298fd7bb14fd53f13c8053694d2;hb=499a92df8b9fc64a054cf3b7f728f8967fc1da7d#l1027
This crash has libc 2.31 (as listed in the path) and crashes at 1039 - https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/posix/getaddrinfo.c;h=538691a59867e2868ce3a5a9da69f5e26472c4b5;hb=8b222fa38700422b4da6731806835f0bbf40920d#l1039
It seems the problem would be for family = at2->family;
But it's all wrapped in a while (at2 != NULL)
.
I have no idea how this could crash.
Reporter | ||
Comment 10•4 years ago
|
||
Yep, it seems I looked at an outdated source version.
gnu_get_libc_version() returns 2.31 on my system.
Thus, it should be the same sources than listed in the second example.
The family = at2->family;
can not only fail, if at2 initially contains malformed data.
It can also fail as a result of memory corruption that occured before that line.
Comment 11•4 years ago
|
||
Confirming and restricting access to this bug. This is a use-after-free crash, it seems like we're accessing a dead object because r13 contains the poison pattern in almost every crash under these signatures.
Updated•4 years ago
|
Updated•4 years ago
|
Comment 12•4 years ago
|
||
Quick update on this: currently we don't really have much of a guess as to the mechanism of the crash, but I just started looking into whether we can identify any correlations here between the crash dumps (writing a script since there's apparently not enough crashes for the correlations tab to show anything). I'll post my findings when they're available. (cheers to gsvelto for giving me some pointers!)
Comment 13•4 years ago
|
||
FWIW from some preliminary tinkering:
- libc version distribution across the 120 crashes:
{ '2.31' => 77, '2.32' => 28, '2.33' => 15 }
- libc 2.31 was released on Feb 1 2020: https://sourceware.org/legacy-ml/libc-announce/2020/msg00001.html
- Oldest Fx version that crashed: 78.5.0esr (Nov 2020)
Comment 14•4 years ago
|
||
(In reply to Nihanth Subramanya [:nhnt11] from comment #13)
FWIW from some preliminary tinkering:
- libc version distribution across the 120 crashes:
{ '2.31' => 77, '2.32' => 28, '2.33' => 15 }
- libc 2.31 was released on Feb 1 2020: https://sourceware.org/legacy-ml/libc-announce/2020/msg00001.html
- Oldest Fx version that crashed: 78.5.0esr (Nov 2020)
Also for posterity the first reported crash was in Sept 2020, and then nothing until mid-Dec 2020.
Comment 15•4 years ago
|
||
I analyzed a bunch of crashes and came to the conclusion that this is a Linux-specific duplicate of bug 1459680 - which also proved elusive. Here's what both bugs have in common: there are several DNS resolver threads in each crash and at least two of them are resolving an address. It seems like there must be a race in our DNS resolution code that is causing the UAF. This also explains why we don't always see an UAF: the register that contains the poison pattern ends up containing garbage which might look like valid data or a pointer, but it's not, it's just junk read from some re-used memory.
Comment 16•4 years ago
|
||
I had a look at nsHostResolver.cpp and it feels like this could be an issue in the synchronization in there. This is well over my head so I'll leave a more detailed analysis to Nihanth.
Daniel, this bug is hidden but bug 1459680 is not. I'm not leaving any comments in there - nor I want to link it to this bug - because this looks dangerous. Given that the issue has to do with DNS resolution, it's a race and it involves an UAF, I feel like it's something that an attacker might be able to use (for example by controlling which domains are being queried). How should we proceed?
Comment 17•4 years ago
|
||
CC'ing Kris Wright and Christian Holler who work on TSan because this might have tripped some errors there too.
Assignee | ||
Comment 18•4 years ago
|
||
It would be good if we had some proof this is a real UAF, but right now I'm doubtful.
The only variable that could potentially be freed when the getaddrinfo call is done is the hostname.
If that were the case, I think the crash would be different.
Secondly, in comment 6 the reporter mentions that disabling IPv6 fixes the crash. That may very well be a coincidence, but it could very well mean there's a bug in getaddrinfo when resolving IPv6 domains.
Comment 19•4 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #18)
It would be good if we had some proof this is a real UAF, but right now I'm doubtful.
There are ample crash reports where the failing address is the poison pattern, in others it's garbage so we' re definitely accessing an object/buffer that was freed or overwritten.
The only variable that could potentially be freed when the getaddrinfo call is done is the hostname.
If that were the case, I think the crash would be different.
Why do you think so? The host string here uses ref-counted storage, so we're not making a copy of the contents in that assignment. If whoever owns the original string alters the storage in another thread then all bets are off because nsTSubstring isn't thread-safe.
Secondly, in comment 6 the reporter mentions that disabling IPv6 fixes the crash. That may very well be a coincidence, but it could very well mean there's a bug in getaddrinfo when resolving IPv6 domains.
That wouldn't explain why we see the same issue on Windows though.
Comment 20•4 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #16)
Daniel, this bug is hidden but bug 1459680 is not. I'm not leaving any comments in there - nor I want to link it to this bug - because this looks dangerous.
I think I agree with toshi that the current scary symptoms in bug 1459680 are due to something hooking our process and screwing it up. The original symptoms ("they mostly appear to be wild-pointer issues") may have been more like this bug, but unfortunately those crashes are gone.
Given that the issue has to do with DNS resolution, it's a race and it involves an UAF, I feel like it's something that an attacker might be able to use (for example by controlling which domains are being queried). How should we proceed?
I only see 9 crashes with this bug's signature in the past month. 4 of those show the UAF poison value, always in r13
if that's a clue. Scary, but low volume. That makes it hard for both us AND attackers to figure out, at least.
Assignee | ||
Comment 21•4 years ago
|
||
Assignee | ||
Comment 22•4 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #19)
Why do you think so? The host string here uses ref-counted storage, so we're not making a copy of the contents in that assignment. If whoever owns the original string alters the storage in another thread then all bets are off because nsTSubstring isn't thread-safe.
If the string gets freed that's either something messing up the string's refcount (double-free of a hostRecord), or writing to its storage (which is more difficult to figure out).
This patch will make a copy of the hostname before calling into getaddrinfo.
If the crashes move to this location, it validates the double-free theory.
If the crashes keep happening in getaddrinfo that means there's a bug in there.
Assignee | ||
Comment 23•4 years ago
|
||
Comment on attachment 9227401 [details]
Bug 1696138 - Make a copy of the host before calling getaddrinfo r=#necko
Security Approval Request
- How easily could an exploit be constructed based on the patch?: Unclear given that we're unsure what's the cause of the crashes.
This is a speculative patch that would move the crashes to the location of the copy, if we are using a released string, or if the crashes keep happening in libc it means there's a bug in the implementation of getaddrinfo - Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: Unknown
- Which older supported branches are affected by this flaw?: all
- If not all supported branches, which bug introduced the flaw?: None
- Do you have backports for the affected branches?: Yes
- If not, how different, hard to create, and risky will they be?: Trivial patch. Should apply cleanly or otherwise easy to backport.
- How likely is this patch to cause regressions; how much testing does it need?: Low chance of a perf regression since we're now making copies of the string (which is limited to a max of 253 chars).
The code guarded by a pref so we can revert at any time.
Comment 24•4 years ago
|
||
Comment on attachment 9227401 [details]
Bug 1696138 - Make a copy of the host before calling getaddrinfo r=#necko
Approved to land
Assignee | ||
Updated•4 years ago
|
Comment 25•4 years ago
|
||
Make a copy of the host before calling getaddrinfo r=necko-reviewers,kershaw
https://hg.mozilla.org/integration/autoland/rev/b6a926e5deeccbe3c59af07a42e65869e940b5e5
![]() |
||
Comment 26•4 years ago
|
||
Comment 27•4 years ago
|
||
Comment on attachment 9227401 [details]
Bug 1696138 - Make a copy of the host before calling getaddrinfo r=#necko
ESR Uplift Approval Request
- If this is not a sec:{high,crit} bug, please state case for ESR consideration: This is sec-high :)
- User impact if declined:
- Fix Landed on Version: 91
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): We're copying a string before passing into a system API, which is not very risky, and this is also behind a pref which can be flipped if needed. The change has been baking in 91 and seems effective.
- String or UUID changes made by this patch:
Updated•4 years ago
|
Updated•4 years ago
|
Comment 28•4 years ago
|
||
Comment on attachment 9227401 [details]
Bug 1696138 - Make a copy of the host before calling getaddrinfo r=#necko
Approved for 78.13esr.
Comment 29•4 years ago
|
||
uplift |
Updated•4 years ago
|
Comment 30•4 years ago
|
||
As part of a security bug pattern analysis, we are requesting your help with a high level analysis of this bug. It is our hope to develop static analysis (or potentially runtime/dynamic analysis) in the future to identify classes of bugs.
Please visit this google form to reply.
Assignee | ||
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 31•4 years ago
|
||
Updated•4 years ago
|
Updated•4 years ago
|
Updated•3 years ago
|
Updated•9 months ago
|
Description
•