startup Crash in [@ mozilla::freestanding::SharedSection::Layout::Resolve | mozilla::freestanding::SharedSection::EnsureWriteCopyViewOnce]
Categories
(Firefox :: Launcher Process, defect, P2)
Tracking
()
People
(Reporter: aryx, Assigned: rkraesig)
References
(Regression)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(3 files)
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
pascalc
:
approval-mozilla-esr115+
|
Details | Review |
3.71 KB,
patch
|
Details | Diff | Splinter Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
Startup crash new in Firefox 117.0. 11 crashes from 10 installations for that (+30 for beta and Nightly builds)
Crash report: https://crash-stats.mozilla.org/report/index/c463b8fd-0e5f-4ebb-bb04-3decf0230831
Reason: EXCEPTION_ACCESS_VIOLATION_READ
Top 2 frames of crashing thread:
0 firefox.exe mozilla::freestanding::SharedSection::Layout::Resolve browser/app/winlauncher/freestanding/SharedSection.cpp:264
0 firefox.exe mozilla::freestanding::SharedSection::EnsureWriteCopyViewOnce browser/app/winlauncher/freestanding/SharedSection.cpp:247
Assignee | ||
Comment 2•9 months ago
|
||
The crash is at least adjacent to the third-party DLL blocklist code (it's happening in a worker thread while the main thread is in our patched_LdrLoadDll
), and virtually all of the crashes also seem to have one of the following third-party DLLs loaded:
wslbdhm64.dll
(by "TPZ SOLUCOES DIGITAIS LTDA") (as seen previously),ExploitProtection64.dll
(by "G DATA Software AG") (as seen previously), orWRusr.dll
(by "Webroot") (as seen previously).
Just based on the versions and timing, this seems to have been something introduced in 117 Nightly and uplifted to no later than 115.2.0esr.
:gstoll/:yannis, does this suggest anything? 115.2.0esr does have bug 1842088 and bug 1845111, but it's not obvious to me how those would cause issues.
Assignee | ||
Updated•9 months ago
|
Reporter | ||
Updated•9 months ago
|
Comment 3•9 months ago
|
||
Hmm. I'm pretty sure bug 1845111 is unrelated. I suppose bug 1842088 could have had some effect here.
I took a look at this crash report - it has WRusr.dll loaded. (note that earlier versions of WRusr.dll are on our blocklist, but this one is late enough that it shouldn't be getting blocked) The value in rax
is 5, which suggests that NtMapViewOfSection
might be returning ERROR_ACCESS_DENIED
. (although ZwMapViewOfSection
is supposed to return an NTSTATUS
, not sure if this is true for the "Nt" version as well?)
The minidump indicates that sSectionHandle
is null, although I'm not sure how much we can trust that. Maybe we should just be checking for that case in SharedSection::EnsureWriteCopyViewOnce()
?
It also seems a bit racy that we're trying to set up the sWriteCopyView
while a DLL is loading on another thread...
Assignee | ||
Comment 4•9 months ago
•
|
||
(In reply to Greg Stoll from comment #3)
The minidump indicates that
sSectionHandle
is null, although I'm not sure how much we can trust that.
We can't trust static variables values' at all, I'm afraid. (I assume they're not actually in the minidump, and that they only appear to have a value because the minidump specifies where the .data
or .bss
segments get loaded. :gsvelto would know.)
I took a look at this crash report - it has WRusr.dll loaded. (note that earlier versions of WRusr.dll are on our blocklist, but this one is late enough that it shouldn't be getting blocked) The value in
rax
is 5, which suggests thatNtMapViewOfSection
might be returningERROR_ACCESS_DENIED
. (althoughZwMapViewOfSection
is supposed to return anNTSTATUS
, not sure if this is true for the "Nt" version as well?)
It is, but we immediately convert it; interpreting it as ERROR_ACCESS_DENIED
seems to be the right call, assuming it went through that code path at all...
... which it almost certainly must have done. If it didn't, then rax
would be a value left over from the __security_cookie
computation -- which could be anything, but is vastly more likely to have junk in the high bits than to be 0x0000000000000005
.
On the other hand, if it did go through that code path, that means that NtMapViewOfSection
is storing a value into mView
(here, r14 = 0000025D24870000
), despite also returning a failure code. Which would be very odd, but I can't trivially rule it out...
Ah.
NtMapViewOfSection
itself probably never does that, but we're hooking it -- and bug 1733532 / patch D182982, which was also added in 117 and uplifted to 115.0.2esr, added a return path to our hook which returns ERROR_ACCESS_DENIED
without zeroing *aBaseAddress
.
Marking as regression accordingly.
Assignee | ||
Comment 5•9 months ago
|
||
Assignee | ||
Comment 6•9 months ago
|
||
NtMapViewOfSection doesn't write to its out-pointer arguments if it
returns an error. Mimic its behavior a little more closely by ensuring
that whatever values were present are restored if we unmap and abort.
Avoid taking any pointers to the stack while doing so.
Updated•9 months ago
|
Comment 7•9 months ago
•
|
||
(In reply to Ray Kraesig [:rkraesig] from comment #4)
NtMapViewOfSection
itself probably never does that, but we're hooking it -- and bug 1733532 / patch D182982, which was also added in 117 and uplifted to 115.0.2esr, added a return path to our hook which returnsERROR_ACCESS_DENIED
without zeroing*aBaseAddress
.
Good catch! This is definitely what is happening here, patched_NtMapViewOfSection
intended to block the mapping, but mView
is still pointing the view that we just unmapped.
(In reply to Ray Kraesig [:rkraesig] from comment #5)
(Addendum: in fairness, there are a number of existing code paths (e.g., [1], [2]) in our hook that do the same thing. That they seem never to have triggered this crash is possibly relevant?)
The patch you attached should prevent the crash but it is indeed concerning that we would be blocking our own intentional call to NtMapViewOfSection
here and I'm not sure how we would react to that. It would be nice to understand why we would decide to block here. The new logic implemented in patched_NtMapViewOfSection
with bug 1842088 may be too restrictive and need an adjustment.
Most likely, as you pointed out, we take the following new failure path:
PUBLIC_OBJECT_BASIC_INFORMATION obi;
NTSTATUS ntStatus = ::NtQueryObject(aSection, ObjectBasicInformation, &obi,
sizeof(obi), nullptr);
if (!NT_SUCCESS(ntStatus)) {
::NtUnmapViewOfSection(aProcess, *aBaseAddress);
return STATUS_ACCESS_DENIED;
}
I wonder if we may be observing a race condition where, due to interaction with third-party products, we somehow end up having another thread close the section handle between the call to NtMapViewOfSection
and the call to NtQueryObject
? Could we sometimes have another thread reach SharedSection::Reset
while we are in SharedSection::EnsureWriteCopyViewOnce
?
Edit: In most crashes the main thread is in mozilla::CreateAndStorePreXULSkeletonUI
, which is called right after mozilla::freestanding::gSharedSection.ConvertToReadOnly();
and takes some time to complete, so I would suspect the following. On most computers, we only have one thread when we reach the code that converts the shared section to read-only, so no risk of a race condition. But third-party products can start a thread in our process before we reach that point, in which case they make the race condition possible.
I will try to see if I can artificially force the race condition to occur with artificial delays, and see if I get the same crash by doing that.
Assignee | ||
Comment 8•9 months ago
|
||
On most computers, we only have one thread when we reach the code that converts the shared section to read-only, so no risk of a race condition. But third-party products can start a thread in our process before we reach that point, in which case they make the race condition possible.
I will try to see if I can artificially force the race condition to occur with artificial delays, and see if I get the same crash by doing that.
Hm. Even if you can't force it that way, that does sound like a plausible- and worrying-enough scenario that it might be a good idea to just go ahead and protect sSectionHandle
(etc.?) with a SRWLock or the like, instead of (in addition to?) the current RtlRunOnce setup.
Assignee | ||
Updated•9 months ago
|
Pushed by rkraesig@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ae60c822514a Save off and restore values in patched_NtMapViewOfSection's arguments r=gstoll
Comment 10•9 months ago
|
||
bugherder |
Comment 11•8 months ago
•
|
||
I think the sequence of events below leads to the NtQueryObject
failure and the follow-up crash:
MainThread | Third-Party Thread
---------- | ------------------
main() |
| LoadLibrary("thirdparty.dll")
| patched_NtMapViewOfSection()
| stub_NtMapViewOfSection()
| AfterMapViewOfExecutableSection()
| SharedSection::GetKernel32Exports()
| SharedSection::EnsureWriteCopyView()
SharedSection::ConvertToReadOnly() |
SharedSection::Reset() |
RtlRunOnceInitialize(&sEnsureOnce) |
| RtlRunOnceExecuteOnce(&sEnsureOnce)
| SharedSection::EnsureWriteCopyViewOnce()
| nt::AutoMappedView(sSectionHandle)
| patched_NtMapViewOfSection(aSection)
| stub_NtMapViewOfSection(aSection)
CloseHandle(sSectionHandle); |
| NtQueryObject(aSection)
v
By forcing events to occur in this order with the attached patch (where I call SharedSection::GetKernel32Exports
directly and not through LoadLibrary
), I do get a very similar crash (note: this is before D187737):
1:005> r
rax=0000000000000005 rbx=0000000000000000 rcx=69b7c8805b380000
rdx=0000000000000000 rsi=00007ff6d3e646e8 rdi=000001fb889e0000
rip=00007ff6d3e1841f rsp=000000ccea1ffad0 rbp=00007ff6d3e18380
r8=0000000000000005 r9=0000000000000061 r10=0000000000000004
r11=00000000000002c0 r12=0000000000000000 r13=0000000000000000
r14=0000000000000001 r15=0000000000000000
iopl=0 nv up ei pl nz na po nc
cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010206
firefox!mozilla::freestanding::SharedSection::Layout::Resolve [inlined in firefox!mozilla::freestanding::SharedSection::EnsureWriteCopyViewOnce+0x9f]:
00007ff6`d3e1841f 8b07 mov eax,dword ptr [rdi] ds:000001fb`889e0000=????????
1:005> k
# Child-SP RetAddr Call Site
00 (Inline Function) --------`-------- firefox!mozilla::freestanding::SharedSection::Layout::Resolve [C:\mozilla-source-alt\mozilla-unified\browser\app\winlauncher\freestanding\SharedSection.cpp @ 303]
01 000000cc`ea1ffad0 00007ffb`1ed823ea firefox!mozilla::freestanding::SharedSection::EnsureWriteCopyViewOnce+0x9f [C:\mozilla-source-alt\mozilla-unified\browser\app\winlauncher\freestanding\SharedSection.cpp @ 285]
02 000000cc`ea1ffb50 00007ff6`d3e1670b ntdll!RtlRunOnceExecuteOnce+0x9a
03 (Inline Function) --------`-------- firefox!mozilla::freestanding::SharedSection::EnsureWriteCopyView+0x46 [C:\mozilla-source-alt\mozilla-unified\browser\app\winlauncher\freestanding\SharedSection.cpp @ 291]
04 000000cc`ea1ffb90 00007ff6`d3e019a0 firefox!mozilla::freestanding::SharedSection::GetKernel32Exports+0x4b [C:\mozilla-source-alt\mozilla-unified\browser\app\winlauncher\freestanding\SharedSection.cpp @ 365]
05 000000cc`ea1ffbc0 00007ffb`1cd3257d firefox!RacyThreadFunc+0x10 [C:\mozilla-source-alt\mozilla-unified\browser\app\nsBrowserApp.cpp @ 278]
06 000000cc`ea1ffbf0 00007ffa`c50576d4 KERNEL32!BaseThreadInitThunk+0x1d
07 (Inline Function) --------`-------- mozglue!mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared>,void (*)(int, void *, void *)>::operator()+0x15 [C:\mozilla-source-alt\mozilla-unified\toolkit\xre\dllservices\mozglue\nsWindowsDllInterceptor.h @ 150]
08 000000cc`ea1ffc20 00007ffb`1edaaa68 mozglue!patched_BaseThreadInitThunk+0x94 [C:\mozilla-source-alt\mozilla-unified\toolkit\xre\dllservices\mozglue\WindowsDllBlocklist.cpp @ 566]
09 000000cc`ea1ffc90 00000000`00000000 ntdll!RtlUserThreadStart+0x28
A non-crashing variant of the race condition can also occur with stub_NtMapViewOfSection
failing if it received the handle that was just closed on the main thread by SharedSection::Reset
. This variant was present before bug 1842088 and is still present today. I think the consequence could be that we would silently fail to load and use the blocklist? I will propose a patch to fix this.
Comment 12•8 months ago
|
||
Updated•8 months ago
|
Comment 13•8 months ago
|
||
Third-party products can start threads in our main process, which can load DLLs before the main thread has gone past SharedSection::ConvertToReadOnly. This patch therefore protects the use of the shared section within patched_NtMapViewOfSection, to guarantee that there is no race condition where the shared section could get converted while being used.
Assignee | ||
Comment 14•8 months ago
|
||
:RyanVM has asked me if this should be picked up for uplift to release. I'm tentatively going to say the first patch can be, but the second shouldn't yet:
- The first patch should avert the crash, and has little risk of inducing new obvious issues. It would leave the DLL blocklist silently unapplied when the crash would have occurred; but that's probably preferable, especially since it's a) very unlikely and b) temporary (until the second patch comes through).
- The second patch should fix the failure-to-apply, but it involves adding a new mutex, which I fear (perhaps more out of paranoia than out of reason) could induce hangs or deadlock at startup. I'd feel more comfortable letting it bake in Nightly for a week.
However, there's an obvious possible bias there, seeing as how I wrote the first patch and not the second. 🙃 I could probably be talked out of either half of my position if :yannis or :gstoll have an argument on the matter. ni?ing them for comment or abstention.
Comment 15•8 months ago
|
||
It already missed that deadline. We'd be targeting a Beta uplift for 118 at this point.
Assignee | ||
Comment 16•8 months ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #15)
It already missed that deadline. We'd be targeting a Beta uplift for 118 at this point.
Ah, well. Cancelling ni?s; sorry for the spam.
Updated•8 months ago
|
Comment 17•8 months ago
|
||
Pushed by yjuglaret@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d65c86dc2295 Protect patched_NtMapViewOfSection against SharedSection::ConvertToReadOnly. r=gstoll
Comment 18•8 months ago
|
||
Adding other signatures for this crash, including one with quite high volume.
Updated•8 months ago
|
Comment 20•8 months ago
|
||
bugherder |
Assignee | ||
Comment 21•8 months ago
|
||
Comment on attachment 9352130 [details]
Bug 1850969 - Save off and restore values in patched_NtMapViewOfSection's arguments r?yjuglaret,gstoll
Beta/Release Uplift Approval Request
- User impact if declined: Transient crashes at startup when initializing the blocklist while a thread injected by third-party software is loading a DLL.
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Patch simply saves and restores some variables, hopefully preventing a crash on failure.
Actually preventing the failure is the other patch's job; it'll be considered for uplift after some time has passed for testing.
- String changes made/needed:
- Is Android affected?: Yes
Comment 22•8 months ago
|
||
Talking to Ray, it sounds like we might want to take the mitigation patch soon but let Yannis' patch bake more before trying to uplift. If that's the case, it would probably be better to split this into two different bugs to make it easier to keep track of what's landed where and when. It gets messy otherwise.
Assignee | ||
Comment 23•8 months ago
|
||
Going ahead and requesting the first patch for uplift, since it's low-risk on its own.
Also re-ni?ing :yannis for the same reason as last time: do you think this calls for an accelerated uplift of the second patch? (Sorry: I misunderstood where we were in the cycle, and was too hasty in cancelling it yesterday.)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #22)
[...] it would probably be better to split this into two different bugs to make it easier to keep track of what's landed where and when.
Now that I look again, that might be tricky: the second patch has already landed in-tree with this bug number, so I'm not sure we can reasonably remove it from this bug. Having a separate bug specifically for uplift consideration should be no trouble, though.
Comment 24•8 months ago
•
|
||
(In reply to Ray Kraesig [:rkraesig] from comment #23)
Also re-ni?ing :yannis for the same reason as last time: do you think this calls for an accelerated uplift of the second patch? (Sorry: I misunderstood where we were in the cycle, and was too hasty in cancelling it yesterday.)
Taking in the first patch without the second should bring us back to a similar situation as before bug 1842088, where the race condition would still exist but with mostly benign consequences. For people who would have crashed here, I think we would now instead be failing to honor the dynamic DLL blocklist (the static blocklist should still work), because sWriteCopyView
would stay nullptr
in the main process.
I am a bit worried that there are scenarios where the race condition could theoretically, I think, lead to different crashes (the second patch already fixes these scenarios as well). However, if these variations could realistically occur timing-wise, I think we should already be seeing them, and I don't find crashes that would match. Here is what they would look like:
MainThread | Third-Party Thread
---------- | ------------------
main() |
| LoadLibrary("thirdparty.dll")
| patched_NtMapViewOfSection()
| stub_NtMapViewOfSection()
| AfterMapViewOfExecutableSection()
| SharedSection::GetKernel32Exports()
| return &writeCopyView->mK32Exports
SharedSection::ConvertToReadOnly() |
SharedSection::Reset() |
nt::~AutoMappedView |
nt::AutoMappedView::Unmap |
| ... in AfterMapViewOfExecutableSection(),
| k32Exports points to unmapped memory ...
v
MainThread | Third-Party Thread
---------- | ------------------
main() |
| LoadLibrary("thirdparty.dll")
| patched_NtMapViewOfSection()
| stub_NtMapViewOfSection()
| AfterMapViewOfExecutableSection()
| SharedSection::SearchBlocklist()
| writeCopyView->SearchBlocklist()
SharedSection::ConvertToReadOnly() |
SharedSection::Reset() |
nt::~AutoMappedView |
nt::AutoMappedView::Unmap |
| ... in SharedSection::Layout::SearchBlocklist(),
| the this pointer points to unmapped memory,
| and/or we return a pointer to unmapped memory ...
v
Overall, I hope both patches ultimately make it to 118 release, but waiting for a week before asking the second patch's uplift sounds fine to me.
Comment 25•8 months ago
|
||
The bug is marked as tracked for firefox118 (beta) and tracked for firefox119 (nightly). However, the bug still has low severity.
:gcp, could you please increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.
For more information, please visit BugBot documentation.
Updated•8 months ago
|
Comment 26•8 months ago
|
||
Comment on attachment 9352130 [details]
Bug 1850969 - Save off and restore values in patched_NtMapViewOfSection's arguments r?yjuglaret,gstoll
Approved for 118.0b9 (last beta) thanks.
Comment 27•8 months ago
|
||
uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/0cd1c75a4e19
Comment 28•8 months ago
|
||
(marking as fixed for 118 for tracking, even though this is just a mitigation)
Assignee | ||
Comment 29•8 months ago
|
||
Comment on attachment 9352130 [details]
Bug 1850969 - Save off and restore values in patched_NtMapViewOfSection's arguments r?yjuglaret,gstoll
Removing stray esr115? flag.
:pascalc/:RyanVM -- would you prefer to keep ESR and mainline in sync here, or just uplift both patches to ESR in one shot later?
Comment 30•8 months ago
•
|
||
i am not seeing significant crashes on ESR115 (a couple) and this bug was marked as regressed by a patch that landed in 117, is ESR really affected?
Nevermind, I just saw that we did uplift the regressor to the 115 branch. In that case, yes we probably want to fix ESR115 at the same time as 118, the low volume may just be that we don't have migrated yet our 102 population to the new ESR.
Assignee | ||
Comment 31•8 months ago
|
||
Comment on attachment 9352130 [details]
Bug 1850969 - Save off and restore values in patched_NtMapViewOfSection's arguments r?yjuglaret,gstoll
ESR Uplift Approval Request
- If this is not a sec:{high,crit} bug, please state case for ESR consideration: Per beta uplift request.
- User impact if declined: Per beta uplift request.
- Fix Landed on Version: 118
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Per beta uplift request.
Comment 32•8 months ago
|
||
Comment on attachment 9352130 [details]
Bug 1850969 - Save off and restore values in patched_NtMapViewOfSection's arguments r?yjuglaret,gstoll
Approved for ESR 115.3, thanks.
Comment 33•8 months ago
|
||
uplift |
https://hg.mozilla.org/releases/mozilla-esr115/rev/72057601fb07
Comment 34•8 months ago
|
||
bugherder uplift |
Comment 35•8 months ago
|
||
Should this bug be closed or are there more patches pending?
Assignee | ||
Comment 36•8 months ago
|
||
There are no more patches pending; this bug can be closed, assuming that it's possible to uplift :yannis' patch from the other bug... which I'm not seeing a way to do, now that I look at it. :RyanVM, since you suggested it back in comment 22: what did you have in mind?
Comment 37•8 months ago
|
||
Yeah, this is pretty messy since Yannis' patch already landed in this bug :(. Let's close this bug out since nothing else is going to happen here. I also resolved bug 1852841 since we already did fix that bug on m-c as well by way of this one. But if we can get a patch attached to that bug for uplift nominations, that would be good I guess.
Updated•8 months ago
|
Description
•