Closed Bug 1850969 Opened 9 months ago Closed 8 months ago

startup Crash in [@ mozilla::freestanding::SharedSection::Layout::Resolve | mozilla::freestanding::SharedSection::EnsureWriteCopyViewOnce]

Categories

(Firefox :: Launcher Process, defect, P2)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
119 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox-esr115 --- fixed
firefox117 --- wontfix
firefox118 + fixed
firefox119 + fixed

People

(Reporter: aryx, Assigned: rkraesig)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(3 files)

Startup crash new in Firefox 117.0. 11 crashes from 10 installations for that (+30 for beta and Nightly builds)

Crash report: https://crash-stats.mozilla.org/report/index/c463b8fd-0e5f-4ebb-bb04-3decf0230831

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 2 frames of crashing thread:

0  firefox.exe  mozilla::freestanding::SharedSection::Layout::Resolve  browser/app/winlauncher/freestanding/SharedSection.cpp:264
0  firefox.exe  mozilla::freestanding::SharedSection::EnsureWriteCopyViewOnce  browser/app/winlauncher/freestanding/SharedSection.cpp:247

Ray, could you take a look into this?

Flags: needinfo?(rkraesig)

The crash is at least adjacent to the third-party DLL blocklist code (it's happening in a worker thread while the main thread is in our patched_LdrLoadDll), and virtually all of the crashes also seem to have one of the following third-party DLLs loaded:

Just based on the versions and timing, this seems to have been something introduced in 117 Nightly and uplifted to no later than 115.2.0esr.

:gstoll/:yannis, does this suggest anything? 115.2.0esr does have bug 1842088 and bug 1845111, but it's not obvious to me how those would cause issues.

Flags: needinfo?(yjuglaret)
Flags: needinfo?(gstoll)
Flags: needinfo?(rkraesig)

Hmm. I'm pretty sure bug 1845111 is unrelated. I suppose bug 1842088 could have had some effect here.

I took a look at this crash report - it has WRusr.dll loaded. (note that earlier versions of WRusr.dll are on our blocklist, but this one is late enough that it shouldn't be getting blocked) The value in rax is 5, which suggests that NtMapViewOfSection might be returning ERROR_ACCESS_DENIED. (although ZwMapViewOfSection is supposed to return an NTSTATUS, not sure if this is true for the "Nt" version as well?)

The minidump indicates that sSectionHandle is null, although I'm not sure how much we can trust that. Maybe we should just be checking for that case in SharedSection::EnsureWriteCopyViewOnce()?

It also seems a bit racy that we're trying to set up the sWriteCopyView while a DLL is loading on another thread...

Flags: needinfo?(gstoll)

(In reply to Greg Stoll from comment #3)

The minidump indicates that sSectionHandle is null, although I'm not sure how much we can trust that.

We can't trust static variables values' at all, I'm afraid. (I assume they're not actually in the minidump, and that they only appear to have a value because the minidump specifies where the .data or .bss segments get loaded. :gsvelto would know.)

I took a look at this crash report - it has WRusr.dll loaded. (note that earlier versions of WRusr.dll are on our blocklist, but this one is late enough that it shouldn't be getting blocked) The value in rax is 5, which suggests that NtMapViewOfSection might be returning ERROR_ACCESS_DENIED. (although ZwMapViewOfSection is supposed to return an NTSTATUS, not sure if this is true for the "Nt" version as well?)

It is, but we immediately convert it; interpreting it as ERROR_ACCESS_DENIED seems to be the right call, assuming it went through that code path at all...

... which it almost certainly must have done. If it didn't, then rax would be a value left over from the __security_cookie computation -- which could be anything, but is vastly more likely to have junk in the high bits than to be 0x0000000000000005.

On the other hand, if it did go through that code path, that means that NtMapViewOfSection is storing a value into mView (here, r14 = 0000025D24870000), despite also returning a failure code. Which would be very odd, but I can't trivially rule it out...

Ah.

NtMapViewOfSection itself probably never does that, but we're hooking it -- and bug 1733532 / patch D182982, which was also added in 117 and uplifted to 115.0.2esr, added a return path to our hook which returns ERROR_ACCESS_DENIED without zeroing *aBaseAddress.

Marking as regression accordingly.

Keywords: regression
Regressed by: 1733532

(Addendum: in fairness, there are a number of existing code paths (e.g., [1], [2]) in our hook that do the same thing. That they seem never to have triggered this crash is possibly relevant?)

NtMapViewOfSection doesn't write to its out-pointer arguments if it
returns an error. Mimic its behavior a little more closely by ensuring
that whatever values were present are restored if we unmap and abort.

Avoid taking any pointers to the stack while doing so.

Assignee: nobody → rkraesig
Status: NEW → ASSIGNED

(In reply to Ray Kraesig [:rkraesig] from comment #4)

NtMapViewOfSection itself probably never does that, but we're hooking it -- and bug 1733532 / patch D182982, which was also added in 117 and uplifted to 115.0.2esr, added a return path to our hook which returns ERROR_ACCESS_DENIED without zeroing *aBaseAddress.

Good catch! This is definitely what is happening here, patched_NtMapViewOfSection intended to block the mapping, but mView is still pointing the view that we just unmapped.

(In reply to Ray Kraesig [:rkraesig] from comment #5)

(Addendum: in fairness, there are a number of existing code paths (e.g., [1], [2]) in our hook that do the same thing. That they seem never to have triggered this crash is possibly relevant?)

The patch you attached should prevent the crash but it is indeed concerning that we would be blocking our own intentional call to NtMapViewOfSection here and I'm not sure how we would react to that. It would be nice to understand why we would decide to block here. The new logic implemented in patched_NtMapViewOfSection with bug 1842088 may be too restrictive and need an adjustment.

Most likely, as you pointed out, we take the following new failure path:

  PUBLIC_OBJECT_BASIC_INFORMATION obi;
  NTSTATUS ntStatus = ::NtQueryObject(aSection, ObjectBasicInformation, &obi,
                                      sizeof(obi), nullptr);
  if (!NT_SUCCESS(ntStatus)) {
    ::NtUnmapViewOfSection(aProcess, *aBaseAddress);
    return STATUS_ACCESS_DENIED;
  }

I wonder if we may be observing a race condition where, due to interaction with third-party products, we somehow end up having another thread close the section handle between the call to NtMapViewOfSection and the call to NtQueryObject? Could we sometimes have another thread reach SharedSection::Reset while we are in SharedSection::EnsureWriteCopyViewOnce?

Edit: In most crashes the main thread is in mozilla::CreateAndStorePreXULSkeletonUI, which is called right after mozilla::freestanding::gSharedSection.ConvertToReadOnly(); and takes some time to complete, so I would suspect the following. On most computers, we only have one thread when we reach the code that converts the shared section to read-only, so no risk of a race condition. But third-party products can start a thread in our process before we reach that point, in which case they make the race condition possible.

I will try to see if I can artificially force the race condition to occur with artificial delays, and see if I get the same crash by doing that.

Flags: needinfo?(yjuglaret)

On most computers, we only have one thread when we reach the code that converts the shared section to read-only, so no risk of a race condition. But third-party products can start a thread in our process before we reach that point, in which case they make the race condition possible.

I will try to see if I can artificially force the race condition to occur with artificial delays, and see if I get the same crash by doing that.

Hm. Even if you can't force it that way, that does sound like a plausible- and worrying-enough scenario that it might be a good idea to just go ahead and protect sSectionHandle (etc.?) with a SRWLock or the like, instead of (in addition to?) the current RtlRunOnce setup.

Severity: -- → S3
Priority: -- → P2
Keywords: leave-open
Pushed by rkraesig@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ae60c822514a
Save off and restore values in patched_NtMapViewOfSection's arguments  r=gstoll

I think the sequence of events below leads to the NtQueryObject failure and the follow-up crash:

MainThread                               |  Third-Party Thread
----------                               |  ------------------
main()                                   |
                                         |  LoadLibrary("thirdparty.dll")
                                         |    patched_NtMapViewOfSection()
                                         |      stub_NtMapViewOfSection()
                                         |    AfterMapViewOfExecutableSection()
                                         |      SharedSection::GetKernel32Exports()
                                         |        SharedSection::EnsureWriteCopyView()
SharedSection::ConvertToReadOnly()       |
  SharedSection::Reset()                 |
    RtlRunOnceInitialize(&sEnsureOnce)   |
                                         |        RtlRunOnceExecuteOnce(&sEnsureOnce)
                                         |          SharedSection::EnsureWriteCopyViewOnce()
                                         |            nt::AutoMappedView(sSectionHandle)
                                         |              patched_NtMapViewOfSection(aSection)
                                         |                stub_NtMapViewOfSection(aSection)
    CloseHandle(sSectionHandle);         |
                                         |                NtQueryObject(aSection)
                                         v

By forcing events to occur in this order with the attached patch (where I call SharedSection::GetKernel32Exports directly and not through LoadLibrary), I do get a very similar crash (note: this is before D187737):

1:005> r
rax=0000000000000005 rbx=0000000000000000 rcx=69b7c8805b380000
rdx=0000000000000000 rsi=00007ff6d3e646e8 rdi=000001fb889e0000
rip=00007ff6d3e1841f rsp=000000ccea1ffad0 rbp=00007ff6d3e18380
 r8=0000000000000005  r9=0000000000000061 r10=0000000000000004
r11=00000000000002c0 r12=0000000000000000 r13=0000000000000000
r14=0000000000000001 r15=0000000000000000
iopl=0         nv up ei pl nz na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010206
firefox!mozilla::freestanding::SharedSection::Layout::Resolve [inlined in firefox!mozilla::freestanding::SharedSection::EnsureWriteCopyViewOnce+0x9f]:
00007ff6`d3e1841f 8b07            mov     eax,dword ptr [rdi] ds:000001fb`889e0000=????????
1:005> k
 # Child-SP          RetAddr               Call Site
00 (Inline Function) --------`--------     firefox!mozilla::freestanding::SharedSection::Layout::Resolve [C:\mozilla-source-alt\mozilla-unified\browser\app\winlauncher\freestanding\SharedSection.cpp @ 303]
01 000000cc`ea1ffad0 00007ffb`1ed823ea     firefox!mozilla::freestanding::SharedSection::EnsureWriteCopyViewOnce+0x9f [C:\mozilla-source-alt\mozilla-unified\browser\app\winlauncher\freestanding\SharedSection.cpp @ 285]
02 000000cc`ea1ffb50 00007ff6`d3e1670b     ntdll!RtlRunOnceExecuteOnce+0x9a
03 (Inline Function) --------`--------     firefox!mozilla::freestanding::SharedSection::EnsureWriteCopyView+0x46 [C:\mozilla-source-alt\mozilla-unified\browser\app\winlauncher\freestanding\SharedSection.cpp @ 291]
04 000000cc`ea1ffb90 00007ff6`d3e019a0     firefox!mozilla::freestanding::SharedSection::GetKernel32Exports+0x4b [C:\mozilla-source-alt\mozilla-unified\browser\app\winlauncher\freestanding\SharedSection.cpp @ 365]
05 000000cc`ea1ffbc0 00007ffb`1cd3257d     firefox!RacyThreadFunc+0x10 [C:\mozilla-source-alt\mozilla-unified\browser\app\nsBrowserApp.cpp @ 278]
06 000000cc`ea1ffbf0 00007ffa`c50576d4     KERNEL32!BaseThreadInitThunk+0x1d
07 (Inline Function) --------`--------     mozglue!mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared>,void (*)(int, void *, void *)>::operator()+0x15 [C:\mozilla-source-alt\mozilla-unified\toolkit\xre\dllservices\mozglue\nsWindowsDllInterceptor.h @ 150]
08 000000cc`ea1ffc20 00007ffb`1edaaa68     mozglue!patched_BaseThreadInitThunk+0x94 [C:\mozilla-source-alt\mozilla-unified\toolkit\xre\dllservices\mozglue\WindowsDllBlocklist.cpp @ 566]
09 000000cc`ea1ffc90 00000000`00000000     ntdll!RtlUserThreadStart+0x28

A non-crashing variant of the race condition can also occur with stub_NtMapViewOfSection failing if it received the handle that was just closed on the main thread by SharedSection::Reset. This variant was present before bug 1842088 and is still present today. I think the consequence could be that we would silently fail to load and use the blocklist? I will propose a patch to fix this.

Regressed by: 1842088
No longer regressed by: 1733532

Third-party products can start threads in our main process, which can load DLLs before the main thread has gone past SharedSection::ConvertToReadOnly. This patch therefore protects the use of the shared section within patched_NtMapViewOfSection, to guarantee that there is no race condition where the shared section could get converted while being used.

:RyanVM has asked me if this should be picked up for uplift to release. I'm tentatively going to say the first patch can be, but the second shouldn't yet:

  • The first patch should avert the crash, and has little risk of inducing new obvious issues. It would leave the DLL blocklist silently unapplied when the crash would have occurred; but that's probably preferable, especially since it's a) very unlikely and b) temporary (until the second patch comes through).
  • The second patch should fix the failure-to-apply, but it involves adding a new mutex, which I fear (perhaps more out of paranoia than out of reason) could induce hangs or deadlock at startup. I'd feel more comfortable letting it bake in Nightly for a week.

However, there's an obvious possible bias there, seeing as how I wrote the first patch and not the second. 🙃 I could probably be talked out of either half of my position if :yannis or :gstoll have an argument on the matter. ni?‍ing them for comment or abstention.

Flags: needinfo?(yjuglaret)
Flags: needinfo?(gstoll)

It already missed that deadline. We'd be targeting a Beta uplift for 118 at this point.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #15)

It already missed that deadline. We'd be targeting a Beta uplift for 118 at this point.

Ah, well. Cancelling ni?‍s; sorry for the spam.

Flags: needinfo?(yjuglaret)
Flags: needinfo?(gstoll)
Pushed by yjuglaret@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d65c86dc2295
Protect patched_NtMapViewOfSection against SharedSection::ConvertToReadOnly. r=gstoll

Adding other signatures for this crash, including one with quite high volume.

Crash Signature: [@ mozilla::freestanding::SharedSection::Layout::Resolve | mozilla::freestanding::SharedSection::EnsureWriteCopyViewOnce] → [@ mozilla::freestanding::SharedSection::Layout::Resolve | mozilla::freestanding::SharedSection::EnsureWriteCopyViewOnce] [@ mozilla::freestanding::SharedSection::Layout::Resolve | mozilla::freestanding::SharedSection::EnsureWriteCopyViewOnce | RtlRunOnc…

[Tracking Requested - why for this release]:

Comment on attachment 9352130 [details]
Bug 1850969 - Save off and restore values in patched_NtMapViewOfSection's arguments r?yjuglaret,gstoll

Beta/Release Uplift Approval Request

  • User impact if declined: Transient crashes at startup when initializing the blocklist while a thread injected by third-party software is loading a DLL.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Patch simply saves and restores some variables, hopefully preventing a crash on failure.

Actually preventing the failure is the other patch's job; it'll be considered for uplift after some time has passed for testing.

  • String changes made/needed:
  • Is Android affected?: Yes
Attachment #9352130 - Flags: approval-mozilla-esr115?
Attachment #9352130 - Flags: approval-mozilla-beta?

Talking to Ray, it sounds like we might want to take the mitigation patch soon but let Yannis' patch bake more before trying to uplift. If that's the case, it would probably be better to split this into two different bugs to make it easier to keep track of what's landed where and when. It gets messy otherwise.

Going ahead and requesting the first patch for uplift, since it's low-risk on its own.

Also re-ni?‍ing :yannis for the same reason as last time: do you think this calls for an accelerated uplift of the second patch? (Sorry: I misunderstood where we were in the cycle, and was too hasty in cancelling it yesterday.)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #22)

[...] it would probably be better to split this into two different bugs to make it easier to keep track of what's landed where and when.

Now that I look again, that might be tricky: the second patch has already landed in-tree with this bug number, so I'm not sure we can reasonably remove it from this bug. Having a separate bug specifically for uplift consideration should be no trouble, though.

Flags: needinfo?(yjuglaret)
Flags: needinfo?(pascalc)
Blocks: 1852841

(In reply to Ray Kraesig [:rkraesig] from comment #23)

Also re-ni?‍ing :yannis for the same reason as last time: do you think this calls for an accelerated uplift of the second patch? (Sorry: I misunderstood where we were in the cycle, and was too hasty in cancelling it yesterday.)

Taking in the first patch without the second should bring us back to a similar situation as before bug 1842088, where the race condition would still exist but with mostly benign consequences. For people who would have crashed here, I think we would now instead be failing to honor the dynamic DLL blocklist (the static blocklist should still work), because sWriteCopyView would stay nullptr in the main process.

I am a bit worried that there are scenarios where the race condition could theoretically, I think, lead to different crashes (the second patch already fixes these scenarios as well). However, if these variations could realistically occur timing-wise, I think we should already be seeing them, and I don't find crashes that would match. Here is what they would look like:

MainThread                               |  Third-Party Thread
----------                               |  ------------------
main()                                   |
                                         |  LoadLibrary("thirdparty.dll")
                                         |    patched_NtMapViewOfSection()
                                         |      stub_NtMapViewOfSection()
                                         |    AfterMapViewOfExecutableSection()
                                         |      SharedSection::GetKernel32Exports()
                                         |        return &writeCopyView->mK32Exports
  SharedSection::ConvertToReadOnly()     |
    SharedSection::Reset()               |
      nt::~AutoMappedView                |
        nt::AutoMappedView::Unmap        |
                                         |      ... in AfterMapViewOfExecutableSection(),
                                         |          k32Exports points to unmapped memory ...
                                         v
MainThread                               |  Third-Party Thread
----------                               |  ------------------
main()                                   |
                                         |  LoadLibrary("thirdparty.dll")
                                         |    patched_NtMapViewOfSection()
                                         |      stub_NtMapViewOfSection()
                                         |    AfterMapViewOfExecutableSection()
                                         |      SharedSection::SearchBlocklist()
                                         |        writeCopyView->SearchBlocklist()
  SharedSection::ConvertToReadOnly()     |
    SharedSection::Reset()               |
      nt::~AutoMappedView                |
        nt::AutoMappedView::Unmap        |
                                         |        ... in SharedSection::Layout::SearchBlocklist(),
                                         |            the this pointer points to unmapped memory,
                                         |            and/or we return a pointer to unmapped memory ... 
                                         v

Overall, I hope both patches ultimately make it to 118 release, but waiting for a week before asking the second patch's uplift sounds fine to me.

Flags: needinfo?(yjuglaret)

The bug is marked as tracked for firefox118 (beta) and tracked for firefox119 (nightly). However, the bug still has low severity.

:gcp, could you please increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(gpascutto)
Severity: S3 → S2
Flags: needinfo?(gpascutto)

Comment on attachment 9352130 [details]
Bug 1850969 - Save off and restore values in patched_NtMapViewOfSection's arguments r?yjuglaret,gstoll

Approved for 118.0b9 (last beta) thanks.

Flags: needinfo?(pascalc)
Attachment #9352130 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

(marking as fixed for 118 for tracking, even though this is just a mitigation)

Comment on attachment 9352130 [details]
Bug 1850969 - Save off and restore values in patched_NtMapViewOfSection's arguments r?yjuglaret,gstoll

Removing stray esr115? flag.

:pascalc/:RyanVM -- would you prefer to keep ESR and mainline in sync here, or just uplift both patches to ESR in one shot later?

Attachment #9352130 - Flags: approval-mozilla-esr115?

i am not seeing significant crashes on ESR115 (a couple) and this bug was marked as regressed by a patch that landed in 117, is ESR really affected?
Nevermind, I just saw that we did uplift the regressor to the 115 branch. In that case, yes we probably want to fix ESR115 at the same time as 118, the low volume may just be that we don't have migrated yet our 102 population to the new ESR.

Comment on attachment 9352130 [details]
Bug 1850969 - Save off and restore values in patched_NtMapViewOfSection's arguments r?yjuglaret,gstoll

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Per beta uplift request.
  • User impact if declined: Per beta uplift request.
  • Fix Landed on Version: 118
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Per beta uplift request.
Attachment #9352130 - Flags: approval-mozilla-esr115?

Comment on attachment 9352130 [details]
Bug 1850969 - Save off and restore values in patched_NtMapViewOfSection's arguments r?yjuglaret,gstoll

Approved for ESR 115.3, thanks.

Attachment #9352130 - Flags: approval-mozilla-esr115? → approval-mozilla-esr115+

Should this bug be closed or are there more patches pending?

Flags: needinfo?(rkraesig)

There are no more patches pending; this bug can be closed, assuming that it's possible to uplift :yannis' patch from the other bug... which I'm not seeing a way to do, now that I look at it. :RyanVM, since you suggested it back in comment 22: what did you have in mind?

Flags: needinfo?(rkraesig) → needinfo?(ryanvm)

Yeah, this is pretty messy since Yannis' patch already landed in this bug :(. Let's close this bug out since nothing else is going to happen here. I also resolved bug 1852841 since we already did fix that bug on m-c as well by way of this one. But if we can get a patch attached to that bug for uplift nominations, that would be good I guess.

Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Flags: needinfo?(ryanvm)
Resolution: --- → FIXED
Target Milestone: --- → 119 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: