Closed Bug 1604008 Opened 5 years ago Closed 5 years ago

Crash in [@ mozilla::interceptor::ReadOnlyTargetBytes<T>::EnsureLimit]

Categories

(Core :: mozglue, defect, P1)

Unspecified
Windows 7
defect

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox-esr68 --- unaffected
firefox72 --- unaffected
firefox73 + wontfix
firefox74 + disabled
firefox75 --- disabled

People

(Reporter: jseward, Assigned: toshi)

References

(Regression)

Details

(Keywords: crash, regression, topcrash, Whiteboard: zeropatch)

Crash Data

Attachments

(3 files, 1 obsolete file)

This bug is for crash report bp-1148a11f-339e-4d5c-a38f-0c56b0191214.

This crash appears in two different installations of the Windows nightly
20191212095326. There were 13 crashes for these 2 installations.

Top 10 frames of crashing thread:

0 firefox.exe void mozilla::interceptor::ReadOnlyTargetBytes<mozilla::interceptor::MMPolicyOutOfProcess>::EnsureLimit mozglue/misc/interceptor/TargetFunction.h:519
1 firefox.exe static class mozilla::interceptor::ReadOnlyTargetFunction<mozilla::interceptor::MMPolicyOutOfProcess> mozilla::interceptor::WindowsDllPatcherBase<mozilla::interceptor::VMSharingPolicyUnique<mozilla::interceptor::MMPolicyOutOfProcess> >::ResolveRedirectedAddress mozglue/misc/interceptor/PatcherBase.h:30
2 firefox.exe bool mozilla::interceptor::WindowsDllDetourPatcher<mozilla::interceptor::VMSharingPolicyUnique<mozilla::interceptor::MMPolicyOutOfProcess> >::AddHook mozglue/misc/interceptor/PatcherDetour.h:352
3 firefox.exe bool mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyUnique<mozilla::interceptor::MMPolicyOutOfProcess> >::AddDetour mozglue/misc/nsWindowsDllInterceptor.h:438
4 firefox.exe bool mozilla::interceptor::FuncHookCrossProcess<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyUnique<mozilla::interceptor::MMPolicyOutOfProcess> >, long  mozglue/misc/nsWindowsDllInterceptor.h:229
5 firefox.exe mozilla::InitializeDllBlocklistOOP browser/app/winlauncher/DllBlocklistInit.cpp:50
6 xul.dll mozilla::SandboxBroker::LaunchApp security/sandbox/win/src/sandboxbroker/sandboxBroker.cpp:339
7 xul.dll class RefPtr<mozilla::MozPromise<void*, mozilla::ipc::LaunchError, 0> > mozilla::ipc::WindowsProcessLauncher::DoLaunch ipc/glue/GeckoChildProcessHost.cpp:1490
8 xul.dll class RefPtr<mozilla::MozPromise<mozilla::ipc::LaunchResults, mozilla::ipc::LaunchError, 0> > mozilla::ipc::BaseProcessLauncher::PerformAsyncLaunch ipc/glue/GeckoChildProcessHost.cpp:957
9 xul.dll nsresult mozilla::detail::ProxyRunnable<mozilla::MozPromise<mozilla::ipc::LaunchResults, mozilla::ipc::LaunchError, 0>, RefPtr<mozilla::MozPromise<mozilla::ipc::LaunchResults, mozilla::ipc::LaunchError, 0> >  xpcom/threads/MozPromise.h:1343

Flags: needinfo?(aklotz)

I seem to get this crash myself on Windows 10. Nightly only, not stable. Started a few weeks ago. Have no clue what causes it. Matching crash symbols. Does not happen on Linux install.

Assignee: nobody → tkikuchi

Let's remove the release asserts as a quick patch to start, then investigate the failures after.

Flags: needinfo?(aklotz)

I looked into a few of these crashes and it seems like the one consistent element is 0PatchLoaderX64.dll. I tried installing 0patch in a VM and even with no patches enabled Nightly crashes on startup.

Assignee: tkikuchi → nobody

Sorry, messed up in mid-air

Assignee: nobody → tkikuchi

I removed 0patch and the startup crash stopped.

0Patch support said the issue lays in something writing to xul.dll. He attached a screenshot in the above link. Turns out the issue is a Firefox problem triggered by 0Patch. Other programs theoretically can cause it, too.

Thanks for reporting it to 0patch, Michael. I haven't been able to access the request on zendesk, though, even after creating an account.

Attached image Firefox_Nightly_Crash.png —

Above is attached some information the tech found. Should help patch the issue.

Hi there, Mitja Kolsek of 0patch here. I'll be happy to help reproduce this issue if needed. Also, if there's any suspicion that 0patch might be responsible for at least a part of this problem, I'd like to coordinate our actions. Other than that, kudos to the Firefox team for building and maintaining my favorite browser! ;)

I've got a consistent local repro with 0patch. Let me start debugging and root-causing.

Ok, the root cause is clear now. 0Patch injects 0PatchLoaderX64.dll into firefox by hooking ntdll!NtTestAlert via kernel callback of process creation.

In the browser process, 0PatchLoaderX64.dll replaces the export table entry for LdrLoadDll with a trampoline when it's loaded. After that, when the browser process spawns a tab process and tries to hook LdrLoadDll, it actually tries to hook the browser process's trampoline, which does not exist in the child process. That's why the browser process hits this MOZ_RELEASE_ASSERT.

0PatchLoaderX64.dll is loaded in the launcher process, but somehow it does not modify the export table.

The fix would be either 1) or 2), or both. Let me think how we can implement it.

  1. to get a real address of LdrLoadDll even though the export table is tampered
  2. safely to fail to detour if a target address is not readable in a target process

Since they have a representative in this bug, maybe we can just ask them to not do that?

Thanks for your analysis, Toshihito!

0patch Agent indeed injects a DLL (0PatchLoaderX64.dll for 64bit processes) into every newly created process via kernel callback, and into all already-running processes upon 0patch Service startup. Both are implemented by the MadCodeHook library (http://www.madshi.net/madCodeHookDescription.htm, used by several products out there) which we have licensed and integrated into our product.

Once our DLL is loaded, it hooks a single Windows API functions, LdrLoadDll, using MadCodeHook's hooking mechanism. We absolutely need to do this at this point because it allows us to detect when a process loads a new executable module that we should apply a micropatch to. We are in the early stages of development of a new micropatching engine which will neither use hooks nor inject a DLL at all but this is where we're at right now.

Obviously hooking is a problem when two actors want to hook the same function. MadCodeHook's hooking is making sure that multiple "hookers" (pun intended) can add and delete hooks in arbitrary order but only if they all use the same method of hooking. This is probably not the case here though.

That said, we unfortunately can't stop hooking LdrLoadDll for now, but I'm curious about two things:

  1. With 0patch installed, my Firefox Nightly crashes only sometimes, which I assume is a case of a race condition (if we hook before Firefox does => crash; if Firefox hooks first => no crash, but probably also 0patch doesn't work properly)

  2. With 0patch removed, I still get occasional crashes (see my comment and screenshot of WinDbg above) - are these two different crashes? (Sorry, I'm not very good at reading Firefox crash reports and attaching WinDbg to firefox.exe always gives me the same access violation on a "mov dword ptr [0], ecx instruction", whether 0patch Agent is installed or not.)

Regressed by: 1522830
Has Regression Range: --- → yes

I tried the approach 2). It worked, but I needed to change multiple places. Probably it's not a good approach..

Thank you for the information! So this injection library tries to hook ntdll!LdrLoadDll first, and if it's already hooked by someone, it goes to the export table and replaces the entry. I think that's why EnsureLimit hits the assert only in the browser.

(In reply to Mitja Kolsek from comment #17)

  1. With 0patch installed, my Firefox Nightly crashes only sometimes, which I assume is a case of a race condition (if we hook before Firefox does => crash; if Firefox hooks first => no crash, but probably also 0patch doesn't work properly)

  2. With 0patch removed, I still get occasional crashes (see my comment and screenshot of WinDbg above) - are these two different crashes? (Sorry, I'm not very good at reading Firefox crash reports and attaching WinDbg to firefox.exe always gives me the same access violation on a "mov dword ptr [0], ecx instruction", whether 0patch Agent is installed or not.)

I think you're hitting a different issue. The crash at EnsureLimit 100% happens and there is no timing issue. It's actually very strange to hit a crash at mov dword ptr [0], ecx because that instruction always causes AV and it should not be in the code. Somebody tried to modified the code, but it was not finished..?

In the attached Firefox_Nightly_Crash.png, debugger did not load symbols. Can you configure your debugger to use Mozilla's symbol server? Then we will see actual functions names on debugger.

Okay, this is a WinDbg screenshot of the crash I'm getting everytime I launch Nightly with WinDbg attached on Windows 7 64bit (fully updated).

When WindowsDllInterceptor detours a function in a remote process, it calculates
a target address via GetProcAddress in the caller's process first, and detours
that address in the target process. If the caller's export table was modified, the
target address might be invalid in the target process.

With this patch, WindowsDllInterceptor uses the target process's export table to
calculate a target function address.

(In reply to Mitja Kolsek from comment #19)

Created attachment 9117514 [details]
Firefox_Nightly_Crash-with_symbols.png

Okay, this is a WinDbg screenshot of the crash I'm getting everytime I launch Nightly with WinDbg attached on Windows 7 64bit (fully updated).

Thank you for providing a new screenshot. Looks like you're hitting this assert. I searched the crash reports, but I didn't find the same one (If there is data, the signature would be this).

We can conclude the occasional crash is a different issue from of this bug (crash at EnsureLimit). We can discuss it in a new bug, but it would be difficult to investigate without a repro.

Whiteboard: zeropatch

The priority flag is not set for this bug.
:glandium, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(mh+mozilla)
Priority: -- → P3

Getting crash on startup after update of Nightly. Tried clean profile, safe mode, and downloading latest nightly as of today.
I checked the Windows process list on this Windows 10 VM where I have admin access and I do not see the word 0patch.

Admittedly Windows is not my primary OS so I could be looking in the wrong place.

[Tracking Requested - why for this release]:
can we paper over the issue with the approach from comment #2 in beta at least? in the first hours of 73.0b1 this signature is accounting for a third of all browser crashes - though admittedly it hasn't gotten much usage and this figure isn't reliable yet.

Changing the priority to p1 as the bug is tracked by a release manager for the current beta.
See What Do You Triage for more information

Priority: P3 → P1
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(jmathies)

(In reply to Myron from comment #26)

Just found and followed this bug report and the same problem exists in the beta 1 & 2 release of Firefox 73. I've had to revert back to Forefox 72 public release.

Forgot to mention I'm on Windows 10 Pro 1909.

.... also realised forgot to report same was happening on a second computer where 0patch has never been installed.

Yeah, I'm almost positive my VM here at work does not have 0patch unless it hides from me pretty well.

I had to go back to stable. It's a particularly sucky bug since it crashes on startup on clean profiles, so the only way to update is https://ftp.mozilla.org and blowing away the existing nightly folder. But ofc no point in updating until it is fixed.

Based on the comment above, I checked my registry and there is no 0patch entry under HKLM → Software but things are still crashing. I guess it could be some other agent that does the same thing.

(In reply to [:philipp] from comment #24)

[Tracking Requested - why for this release]:
can we paper over the issue with the approach from comment #2 in beta at least? in the first hours of 73.0b1 this signature is accounting for a third of all browser crashes - though admittedly it hasn't gotten much usage and this figure isn't reliable yet.

This is one of the top crashes showing up in the 73 betas at this point. Can we proceed with a wallpaper for now to stop the bleeding?

Flags: needinfo?(tkikuchi)

Aaron, sorry for pinging you again. Do you need more time to review my patch? If so, I'll prepare a different patch just to remove the release assert once I verify it stops the crash.

Flags: needinfo?(tkikuchi) → needinfo?(aklotz)

I'd like to note that Firefox 72.0.1 is affected as well.

(In reply to Mitja Kolsek from comment #34)

I'd like to note that Firefox 72.0.1 is affected as well.

we only see this particular crash signature in firefox 73 and upwards. can you take a look in about:crashes and provide a crash id generated from firefox 72, if it's affected as well? thank you

Flags: needinfo?(mitja.kolsek)

My review will be done later today.

Flags: needinfo?(aklotz)

Thanks, Aaron.

This issue happens only when the launcher process is enabled. Disabling the launcher process via registry would be a temporary workaround (It's not automatically disabled like a launcher failure because of the crash).

Removing MOZ_RELEASE_ASSERT did not mitigate the problem. With that, the crash is gone, but all tab processes fail to start.

[removed incorrect comment about stable - jumped to conclusions after firefox did not launch first couple of times I doubleclicked the icon]

Flags: needinfo?(jmathies)
Pushed by ccoroiu@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f0890a32d6bb Use a target process's export table to cross-process detour. r=aklotz
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla74

Please nominate this for Beta approval when you get a chance.

EDIT: Though we may want to hold off on uplifting until bug 1608645 has been assessed.

Flags: needinfo?(tkikuchi)
Regressions: 1608645

I think this is ready for a Beta approval request now alongside bug 1608645.

Comment on attachment 9117893 [details]
Bug 1604008 - Use a target process's export table to cross-process detour. r=aklotz

Beta/Release Uplift Approval Request

  • User impact if declined: Firefox fails to start if 0Patch is installed.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: Bug 1608645
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The fix is to implement a part of GetProcAddress on our own, which is fully covered by a unittest. This caused a regression Bug 1608645 when an antivirus software modifies memory in an unexpected way. Given that the fix for Bug 1608645 was landed and we don't have any other regression reports from Nighly in the last 10 days, we can uplift this fix with a low risk.
  • String changes made/needed: None
Flags: needinfo?(tkikuchi)
Attachment #9117893 - Flags: approval-mozilla-beta?

I added a request, but can we hold up until tomorrow? I want to wait one more night to make sure the fix for Bug 1608645 does not cause a new problem.

Comment on attachment 9117893 [details]
Bug 1604008 - Use a target process's export table to cross-process detour. r=aklotz

Topcrash fix. Approved for 73.0b9.

Flags: needinfo?(mitja.kolsek)
Attachment #9117893 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9117893 - Flags: approval-mozilla-beta+ → approval-mozilla-beta?

Comment on attachment 9117893 [details]
Bug 1604008 - Use a target process's export table to cross-process detour. r=aklotz

I'm not crazy about shipping this issue in 73, but I'm also worried about these patches going out in an RC release given the regression-prone nature of this area of code. Let's leave this on the radar for a possible 73 dot release ride-along in case this explodes post-release, but otherwise we'll let it ride Fx74 to release.

Attachment #9117893 - Flags: approval-mozilla-beta? → approval-mozilla-release?

73rc1 has crashed in the same way as the betas with 0patch installed. When I implement 0patch's workaround to exclude module firefox.exe FF 73rc1 starts.

Confirmed. "stable" firefox 73 is now crashing on all the machines at work. Unfortunately, I can't use the 0patch workaround because there's
no hint of 0Patch in the registry. Going to guess some similar tool is crashing everything.
What sucks about this, once it is fixed, is that AFAIK the only solution is to tell everyone to redownload and reinstall firefox. It's not possible to use firefox autoupdate to fix this.

Hm. We do have something called Cylance though - perhaps it is doing something similar to 0patch and GData

Backout by csabou@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b71aa86c8109 Backed out changeset f0890a32d6bb as requested by tkikuchi (toshi).

For those following along from home, comment 53 was because the patch in bug 1610790 obsoletes what landed here. We'll be closely monitoring how it does on Nightly and Beta and go from there when making a decision about what to do about 73 at this point. At this point, more me-too reports probably aren't going to help move anything along, however.

Setting the bug flags to hopefully encapsulate what was said in comment 54 (that all the work related for this set of related crashes is moving to bug 1610790, not that we've stopped trying to implement a fix for this on the Firefox side of things).

Resolution: FIXED → WONTFIX
Target Milestone: mozilla74 → ---
Crash Signature: [@ mozilla::interceptor::ReadOnlyTargetBytes<T>::EnsureLimit] → [@ mozilla::interceptor::ReadOnlyTargetBytes<T>::EnsureLimit] [@ firefox.exe@0xf636]
Attachment #9117893 - Flags: approval-mozilla-release?
Attachment #9117893 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: