Hard to explain macOS-specific use-after-free crashes
Categories
(Core :: Memory Allocator, defect)
Tracking
()
People
(Reporter: gsvelto, Assigned: gsvelto)
References
Details
(Keywords: csectype-race, sec-moderate, Whiteboard: [sec-survey][adv-main83-])
Attachments
(2 files, 1 obsolete file)
47 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
RyanVM
:
approval-mozilla-esr78+
|
Details | Review |
9.34 KB,
text/plain
|
Details |
Filing this bug to allow discussing the issue in a non-security-sensitive bug. I've linked all the related bugs as "See Also". In a nutshell what we're dealing with here are:
- Crashes that manifest themselves only on macOS but that affect code that is not platform-specific
- They crash with apparently impossible conditions. Bug 1654335 where we did most of the investigation points to a race in some mutex-protected code. However the locking is sound and if it weren't we should see crashes on other platforms too, but we don't.
- They all show some form of UAF, either in the access that caused the crash itself, in an index that was used and which lead to the crash or on the stack itself
- In the stack trace the function where the UAF access happened is usually the last one found via CFI stack-walking, all frames below that frame are found via stack-scanning which suggests the frame is somehow corrupt
I'm filing this under the Memory Allocator component because all the crashes seem to happen close to an allocation point (often resizing an nsTArray
instance) so the allocator might be involved somehow - but it also might not, or it might not be the actual cause of the crash. Locking, memory barriers and code generation also come to mind given those are the only platform-specific parts involved in the crashes.
Assignee | ||
Comment 1•4 years ago
|
||
FYI I'm currently trying to reproduce this by browsing one of the sites that have been reported more often in crash reports with a debugger attached, in the hope of inspecting the live browser state (or at least finding some kind of STR):
https://www.sfgate.com/
https://torontosun.com
https://www.mysanantonio.com/
Assignee | ||
Comment 2•4 years ago
|
||
Many hours of testing but no dice. Tomorrow I'll try writing a stress test that uses plenty of JS timers and see if it gets us somewhere.
Comment 3•4 years ago
|
||
Could it be a mutex pthread bug?
a while back, we encountered what looked like an impossible bug, and it turned out to be a bug in Window's CriticalSection code, we had to modify our Monitor implementation because of it.
In the media code, we have seen completely impossible crashes happening in code with sound locking mechanism, where the only explanation was that tasks that supposedly ran serially would have run concurrently, I suggested at the time was a race in a code involving mutexes that dispatches our tasks in the TaskQueue code
https://searchfox.org/mozilla-central/source/xpcom/threads/TaskQueue.cpp#134-178
never found the explanation.
I may be completely off track here, however i thought it was worth mentioning.
Comment 4•4 years ago
|
||
Shot in the dark: What about a use-after-destroy in the monitor itself? Or a use after failure to initialize?
Assignee | ||
Comment 5•4 years ago
|
||
(In reply to Jean-Yves Avenard [:jya] from comment #3)
Could it be a mutex pthread bug?
Yes, that's a possibility. One of my guesses is that there might be a leaky memory barrier somewhere so a thread sees a version of the object that's already been freed. The only explanation for that is that a write in one thread "escaped" the memory barrier that should be associated with the mutex and that's only possible if the mutex is buggy.
(In reply to Steven Michaud [:smichaud] (Retired) from comment #4)
Shot in the dark: What about a use-after-destroy in the monitor itself? Or a use after failure to initialize?
We check for initialization errors and MOZ_CRASH()
if we fail so that can't happen, same for the calls to pthread_mutex_trylock()
, if one fails in an unexpected way Firefox will crash but it will be a controlled crash triggered via MOZ_CRASH()
.
That being said your question prompted me to look at our platform-specific implementation and there's something I had forgotten: the Mutex implementation on macOS is different than on other POSIX-compatible platforms. We basically implement an adaptive mutex on top of a regular one because the platform lacks PTHREAD_MUTEX_ADAPTIVE_NP
-style mutexes (see bug 1457882). Our implementation appears sound and shouldn't leak anything, but it does use pthread_mutex_trylock()
instead of pthread_mutex_lock()
which gets us in a different path in Apple's code.
I've poked around Apple's implementation (the sources are available here) but nothing stands out.
It's a shame we don't have crash reports reaching back to bug 1457882 because that might have provided a hint about the cause if the crashes started after that change.
Assignee | ||
Comment 6•4 years ago
|
||
I dug around a bit to figure out the reasoning behind bug 1457882 and it turns out that most of the slowness in macOS mutexes is that they call into the kernel even for the uncontested case (see the discussion here). So, if Apple's mutex code was buggy there would have been no way to see it before bug 1457882.
I think that the only way to figure this out is to try and write a piece of code that brutally exercises our Mutex implementation with multiple threads and see if it leaks stuff.
Comment 7•4 years ago
|
||
I dug around a bit to figure out the reasoning behind bug 1457882 and it turns out that most of the slowness in macOS mutexes is that they call into the kernel even for the uncontested case (see the discussion here). So, if Apple's mutex code was buggy there would have been no way to see it before bug 1457882.
Per that same thread, it looks like the default behavior changed with respect to uncontended locks requiring switching into the kernel in macOS 10.14. Is it possible to conditionally disable the change from bug 1457882 based on macOS version? It would be interesting to see whether this issue persists without it.
Assignee | ||
Comment 8•4 years ago
|
||
(In reply to Jon Bauman [:jbauman:] from comment #7)
Per that same thread, it looks like the default behavior changed with respect to uncontended locks requiring switching into the kernel in macOS 10.14. Is it possible to conditionally disable the change from bug 1457882 based on macOS version? It would be interesting to see whether this issue persists without it.
We could get the same behavior on every version using PTHREAD_MUTEX_POLICY_FIRSTFIT_NP
explicitly. We'd have to first check that this doesn't regress performance though. If it provides satisfactory performance I'd be much in favor of using it, the less workarounds for platform deficiencies we have the better.
Assignee | ||
Comment 9•4 years ago
|
||
I haven't had time to work on this yet but there's something I remember and that might be relevant. We have crashes on macOS where we fail to destroy mutexes, see this query. This is something we don't have on other platforms, I wonder if it's related.
Assignee | ||
Comment 10•4 years ago
|
||
Some more investigation unveiled that we already tried using first-fit mutex but failed, see bug 1353787.
Comment 11•4 years ago
|
||
Your query doesn't work, presumably because https://crash-stats.mozilla.org doesn't recognize moz_crash_reason_raw
. The following does work, though:
(In reply to Gabriele Svelto [:gsvelto] from comment #9)
Comment 12•4 years ago
|
||
Actually, those crashes also happen on other platforms than the Mac:
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 13•4 years ago
|
||
Assignee | ||
Comment 14•4 years ago
|
||
(In reply to Steven Michaud [:smichaud] (Retired) from comment #12)
Actually, those crashes also happen on other platforms than the Mac:
Yes but they're not very common and many are from Android which uses bionic rather than glibc, that probably has different failure-modes for mutexes.
Assignee | ||
Comment 15•4 years ago
|
||
Given we haven't established yet what's causing this but there's strong hints that some of our locks are "leaking" threads into the sections they protect I've prepared a patch that replaces our homegrown spin-lock with a first-fit lock. My plan is to land this and then observe if the issue goes away in the associated bugs in the nightly channel. If it does we can let it ride the trains and uplift to ESR. If this doesn't address the issue on nightly I plan on reverting the change, more on this later.
Some background on the patch: I've investigated how these locks work on various versions of macOS and it's unclear where they are supported. The relevant header bits were introduced in macOS 10.7 and they're gated by checks that ensure it's not used in 10.6 or earlier. However Apple's POSIX threads documentation claims that these locks weren't available before 10.14 and setting the appropriate policy will fail on 10.13 or earlier. If this wasn't confusing enough the underlying OS-specific lock used to implement them (os_unfair_lock
) has been introduced in 10.12. All in all it's hard to tell what's going on so I made the code ignore failures if we can't get the lock we want. We'll fall back to the old, slower locks.
As for performance the first-fit locks appear to be no slower than our homegrown spinlock. On macOS 10.13 or earlier (or 10.11 depending on what documentation we believe) we're going to take a performance hit but I think that's acceptable if the issue is fixed. This is the reason why I would back this out in case the problem isn't fixed.
Mike what do you think of this plan? My patch is in WIP state but I've already tested it on try. If this plan sounds good to you I'll ask for review on it.
Assignee | ||
Comment 16•4 years ago
|
||
Redirecting the NI? to :spohl. See comment 15.
Updated•4 years ago
|
Updated•4 years ago
|
Comment 17•4 years ago
|
||
Comment 18•4 years ago
|
||
Backed out for SM bustages on ProtectedData.h
Backout link: https://hg.mozilla.org/integration/autoland/rev/a48bbbeadfbbb163e9c75ec27994962624c0b98c
Log link: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=319059751&repo=autoland&lineNumber=5734
Assignee | ||
Comment 19•4 years ago
|
||
Ouch, will test all the builds before landing next time.
Comment 20•4 years ago
|
||
Comment 21•4 years ago
|
||
bugherder |
Assignee | ||
Comment 22•4 years ago
|
||
There haven't been any crashes on nightly in all the linked bugs since I landed the patch. It's still too early to be sure this issue is fixed but this is certainly promising.
Comment 23•4 years ago
|
||
Look forward to see if this fixes some rare crashes in the media MSE stack that mainly occurred on mac and that I could only explain with a faulty mutex.
Assignee | ||
Comment 24•4 years ago
|
||
There have been no crashes in nightly since this landed on all the linked bugs. I think we nailed it. Time for uplifts.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 25•4 years ago
|
||
Comment on attachment 9181217 [details]
Bug 1665411 - Use first-fit mutexes on macOS r=spohl
Beta/Release Uplift Approval Request
- User impact if declined: Firefox crashes randomly
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): We haven't seen any new crashes coming out of this change - in fact crash volume on macOS for the nightly channel seems to have dropped - but this patch changes a fair bit of code that is at the heart of our synchronization primitives so I don't feel like we can consider it low-risk.
- String changes made/needed: none
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 26•4 years ago
|
||
Comment on attachment 9181217 [details]
Bug 1665411 - Use first-fit mutexes on macOS r=spohl
ESR Uplift Approval Request
- If this is not a sec:{high,crit} bug, please state case for ESR consideration: This bug is causing random crashes in Firefox. Most of the crashes manifest themselves as UAF since we're potentially leaking dead objects out of critical sections protected by mutexes.
- User impact if declined: Firefox crashes randomly.
- Fix Landed on Version: 84
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): See comment 25
- String or UUID changes made by this patch: none
Assignee | ||
Comment 27•4 years ago
|
||
A side-effect of this patch is that our Mutex primitives will be quite a bit slower for users of macOS 10.13 and older, but somewhat faster for 10.14+
Assignee | ||
Comment 28•4 years ago
|
||
Here's the patch adapted so that it applies to the ESR78 sources
Comment 29•4 years ago
|
||
Comment on attachment 9181217 [details]
Bug 1665411 - Use first-fit mutexes on macOS r=spohl
Let's take it on Beta first, approved for 83 beta 5, thanks.
Comment 30•4 years ago
|
||
bugherder uplift |
Comment 31•4 years ago
|
||
Backed out for causing deadlock on older OS X versions (e.g. OS X 10.12), see Bug 1673826
https://hg.mozilla.org/integration/autoland/rev/595a2bc2f25aa2a6ef4ce123f1ad88b374c48a3e
Comment 32•4 years ago
|
||
Backout merged to beta:
https://hg.mozilla.org/releases/mozilla-beta/rev/0c12232ef3de
Comment 33•4 years ago
|
||
Comment on attachment 9181217 [details]
Bug 1665411 - Use first-fit mutexes on macOS r=spohl
Cancelling uplift request for ESR78 due to issue which caused backout on central and beta.
Pascal, please revert the approval for beta to prevent this bug from getting listed in the list of bugs with patches waiting for uplift.
Updated•4 years ago
|
Comment 34•4 years ago
|
||
Backout merged: https://hg.mozilla.org/mozilla-central/rev/595a2bc2f25a
Assignee | ||
Comment 35•4 years ago
|
||
Oh boy, this gets better and better. To set the mutex attribute that requests first-fit mutexes I used the _PTHREAD_MUTEX_POLICY_FIRSTFIT
define instead of PTHREAD_MUTEX_POLICY_FIRSTFIT_NP
because we don't have the latter in the macOS 10.11 SDK (used in automation) and the latest SDK defines the latter as the former... But the actual values of that definition differ in different SDKs! The attribute ends up being defined as 2
in 10.11 (and I think up to 10.13) and 3
in 10.14+. So the previous patch enabled buggy first-fit mutexes on macOS 10.13 and older but did not enable them on 10.14+ which is the exact opposite of what I wanted to achieve. Back to the drawing board.
Assignee | ||
Comment 36•4 years ago
|
||
I've prepared a new patch that should avoid the issue we experienced in bug 1673826 but I don't have a 10.12 machine to test it on. Alexandru, could you please give it a spin and report back? The build with the patch applied is available here:
Comment 37•4 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #36)
I've prepared a new patch that should avoid the issue we experienced in bug 1673826 but I don't have a 10.12 machine to test it on. Alexandru, could you please give it a spin and report back? The build with the patch applied is available here:
I see no hangs while testing the above build on macOS 10.12. I tried the same scenarios: loading multiple pages, restarting/closing Firefox, use session restore, enable/disable tracking protection, left Firefox opened for like 10mins or so (I usually got a hang here) and it seems that Firefox it's working fine. If there are some different scenarios that should I test please let me know. Thank you!
Assignee | ||
Comment 38•4 years ago
|
||
Thanks a lot for the super-quick check! Landing this
Assignee | ||
Comment 39•4 years ago
|
||
Refreshed the patch for uplifting to ESR78. We'll have to wait a bit longer to be sure it's stable given the previous backout.
Comment 40•4 years ago
|
||
Comment 41•4 years ago
|
||
bugherder |
Comment 42•4 years ago
|
||
Since the status are different for nightly and release, what's the status for beta?
For more information, please visit auto_nag documentation.
Assignee | ||
Updated•4 years ago
|
Comment 43•4 years ago
|
||
The patch landed in nightly and beta is affected.
:gsvelto, is this bug important enough to require an uplift?
If not please set status_beta
to wontfix
.
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 44•4 years ago
|
||
This is important enough to uplift but I won't do it until I'm sure we fixed the issue which IMHO requires at least a week without crashes on nightly.
Comment 45•4 years ago
|
||
Are we going to assign a CVE for this bug and publish a generic security advisory instead of individually discussing the specific cases that we know this fixes?
Comment 46•4 years ago
|
||
Unless Dan disagrees, I think yes, this would get a CVE but the individual issues would not. I'm not sure if this is a sec-high or a sec-moderate, the latter possibly because exploitation of this seems very difficult given the difficulty of reproducing the issue reliably.
Assignee | ||
Comment 47•4 years ago
|
||
Note that I don't have a reduced case that where I can reliably reproduce the issue. I'll try writing one tomorrow.
Comment 48•4 years ago
|
||
Gabriele, if you want to request the uplift, don't forget that we build our last beta before RC week tomorrow!
Assignee | ||
Comment 49•4 years ago
|
||
It's a bit early but I'd like to try. I'll first look at all the crash signatures, if there's no more reports on nightlies from the past three days I'll ask for uplift.
Updated•4 years ago
|
Updated•4 years ago
|
Comment 50•4 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #45)
Are we going to assign a CVE for this bug and publish a generic security advisory instead of individually discussing the specific cases that we know this fixes?
CVEs enumerate the underlying bugs/fixes. Symptoms don't get separate CVEs unless they require additional fixes.
Assignee | ||
Comment 51•4 years ago
|
||
Comment on attachment 9181217 [details]
Bug 1665411 - Use first-fit mutexes on macOS r=spohl
Beta/Release Uplift Approval Request
- User impact if declined: Firefox crashes randomly
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): We haven't seen any new crashes coming out of this change - in fact crash volume on macOS for the nightly channel seems to have dropped - but this patch changes a fair bit of code that is at the heart of our synchronization primitives so I don't feel like we can consider it low-risk. The previous landing caused regressions but the root cause for those was in the macOS SDK we use in automation for release builds, not the patch itself. That issue has been addressed.
- String changes made/needed: none
Comment 52•4 years ago
|
||
(In reply to Daniel Veditz [:dveditz] from comment #50)
CVEs enumerate the underlying bugs/fixes. Symptoms don't get separate CVEs unless they require additional fixes.
I asked mainly because I don't see any CVE-related tracking flags on this bug.
Comment 54•4 years ago
|
||
Comment on attachment 9181217 [details]
Bug 1665411 - Use first-fit mutexes on macOS r=spohl
Approved for beta, thanks.
Comment 55•4 years ago
|
||
bugherder uplift |
Comment 56•4 years ago
•
|
||
I ran a Try push of the ESR78 patch and everything looks fine with the CI results.
Alexandru, can you please test the build below on various macOS versions (especially <10.13) to see how it does there as well? It would be really helpful if we had some insight into that before we potentially uplift and build the 78.5esr RC next week. Thanks!
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/KmudXzK1QPqbxA0DpZcuyQ/runs/0/artifacts/public/build/target.dmg
Updated•4 years ago
|
Comment 57•4 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #56)
I ran a Try push of the ESR78 patch and everything looks fine with the CI results.
Alexandru, can you please test the build below on various macOS versions (especially <10.13) to see how it does there as well? It would be really helpful if we had some insight into that before we potentially uplift and build the 78.5esr RC next week. Thanks!
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/KmudXzK1QPqbxA0DpZcuyQ/runs/0/artifacts/public/build/target.dmg
Hello Ryan!
We did an exploratory spot-check on macOS 10.13, 10.15, and macOS 11.0.1 (20B5022a) with the above build and everything looks good.
Also, we tested using macOS 10.10/10.11/10.12 on two machines, covering the scenarios from bug 1673826 and several random ones that we had in mind. It seems that the provided build is stable, no crashes or hangs encountered.
As a side note, the build provided in comment 56 has an available update on macOS 10.11 and 10.10. Downloading the update results in an update failed error in the About Nightly window and attempting to stage update: Nightly 79.0a1
in the browser console. This update does not occur on macOS 10.12 with the same build. I know that these are not related but maybe worth knowing.
The fix was verified with 83.0b9 (20201105203649) and 84.0a1 (20201105213748) as well with a quick exploratory spot-check on macOS 10.12, 10.13, 10.15 and 11.0.1 (20B5022a). No crashes or hangs encountered. Leaving the qe+ flag to verify the fix on the official esr build.
Updated•4 years ago
|
Assignee | ||
Comment 59•4 years ago
|
||
Comment on attachment 9181217 [details]
Bug 1665411 - Use first-fit mutexes on macOS r=spohl
ESR Uplift Approval Request
- If this is not a sec:{high,crit} bug, please state case for ESR consideration: This bug is causing random crashes in Firefox. Most of the crashes manifest themselves as UAF since we're potentially leaking dead objects out of critical sections protected by mutexes
- User impact if declined: Firefox crashes randomly
- Fix Landed on Version: 84
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): See comment 51
- String or UUID changes made by this patch: none
Comment 60•4 years ago
|
||
Comment on attachment 9181217 [details]
Bug 1665411 - Use first-fit mutexes on macOS r=spohl
Looking through the various dependent bugs, I'm not entirely convinced we've got this completely sorted (or maybe they're not all due to this same root cause), but given QA's positive report and no sign of any new issues caused by this patch, let's give it a shot in 78.5esr and see what effect it has there.
Comment 61•4 years ago
|
||
bugherder uplift |
Assignee | ||
Comment 62•4 years ago
|
||
There still appear to be some crashes in 83.0b9 so I don't think we can call this fixed yet :-( We'll have to wait and see what happens on ESR where the volume is higher. My guess is that the issue is mitigated if not fixed for macOS 10.11, 10.12 and 10.13 but appears to still be an issue with more recent version (we'll have to see the volume to understand how bad it is).
Comment 63•4 years ago
|
||
Hello!
We performed a spot-check with Firefox 78.5.0esr(20201108144729) from comment 61 on macOS 10.10/10.11/10.12/10.13/10.15 and macOS 11(20B5022a). No crashes or hangs encountered.
Assignee | ||
Comment 64•4 years ago
|
||
Looking at the remaining crashes in linked bugs it appears that my fix effectively mitigated the issue on macOS 10.13 and older but did not solve it completely on 10.15+. I will file a follow up to address the remaining issues in 10.15+. The upside is that this confirmed my guess that this is an issue within the synchronization primitives we use so we now what we have to deal with.
Comment 65•4 years ago
|
||
As part of a security bug pattern analysis, we are requesting your help with a high level analysis of this bug. It is our hope to develop static analysis (or potentially runtime/dynamic analysis) in the future to identify classes of bugs.
Please visit this google form to reply.
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Description
•