Closed Bug 1599015 Opened 2 months ago Closed 1 month ago

Crash in [@ mozilla::interceptor::WindowsDllDetourPatcher<T>::CreateTrampoline]

Categories

(Core :: mozglue, defect, P2)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla73
Tracking Status
firefox-esr68 --- wontfix
firefox71 --- wontfix
firefox72 --- wontfix
firefox73 --- fixed

People

(Reporter: gsvelto, Assigned: toshi)

References

(Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-88cbeeff-3c98-4306-9411-825750191124.

Top 10 frames of crashing thread:

0 mozglue.dll mozilla::interceptor::WindowsDllDetourPatcher&lt;mozilla::interceptor::VMSharingPolicyShared&lt;mozilla::interceptor::MMPolicyInProcess, 1> >::CreateTrampoline mozglue/misc/interceptor/PatcherDetour.h:725
1 mozglue.dll mozilla::interceptor::WindowsDllDetourPatcher&lt;mozilla::interceptor::VMSharingPolicyShared&lt;mozilla::interceptor::MMPolicyInProcess, 1> >::AddHook mozglue/misc/interceptor/PatcherDetour.h:379
2 mozglue.dll static int mozilla::interceptor::FuncHook&lt;mozilla::interceptor::WindowsDllInterceptor&lt;mozilla::interceptor::VMSharingPolicyShared&lt;mozilla::interceptor::MMPolicyInProcess, 1> >, long  mozglue/misc/nsWindowsDllInterceptor.h
3 ntdll.dll RtlRunOnceExecuteOnce 
4 kernelbase.dll kernelbase.dll@0x6207a 
5 mozglue.dll ceil 
6 mozglue.dll mozilla::baseprofiler::InitializeWin64ProfilerHooks mozglue/baseprofiler/core/platform-win32.cpp:324
7 xul.dll static void locked_profiler_start tools/profiler/core/platform.cpp:4015
8 xul.dll static void profiler_start tools/profiler/core/platform.cpp:4157
9 xul.dll class mozilla::ipc::IPCResult mozilla::ProfilerChild::RecvStart tools/profiler/gecko/ProfilerChild.cpp:30

Just two crashes from buildid 20191123094742, CC'ing :toshi because he's the last one who touched this code in bug 1596930. Note that both crashes are coming from the rdd process (remote data decoder).

Actually there were older crashes which I hadn't noticed, and they're all in the rdd process, maybe we should move this to the Audio/Video component?

Component: Launcher Process → Audio/Video
Product: Firefox → Core

I'm still analyzing a couple of raw dumps. This happened here, where our detour tried to put a value to a trampoline area to hook LdrUnloadDll for the profiler, but the memory region was PAGE_EXECUTE_READ, not writable, for some unknown reason, resulting in AV.

I don't see anything rdd-process-specific yet. Let me spend more time on figuring out what happened.

Priority: -- → P2

Okay, I've got a consistent repro of this crash. If we start Gecko Profiler when the rdd process is running (e.g. opening youtube.com is enough), this crash 100% occurs.

The reason why the trampoline area was not writable was this call to VirtualProtect failed. And the reason was ProcessDynamicCodePolicy is enabled in the rdd process when it started. The purpose of ProcessDynamicCodePolicy is to precisely prevent this kind of detour behavior.

At least out detour should just fail without crash if VirtualProtect failed for whatever reason. I think it's not a difficult fix.

Bob, I think you're an expert of this sandbox code. Do you have any suggestion/idea about how we should deal with a profiler in a sandbox process?

Flags: needinfo?(bobowencode)

(In reply to Toshihito Kikuchi [:toshi] from comment #3)
...

At least out detour should just fail without crash if VirtualProtect failed for whatever reason. I think it's not a difficult fix.

I agree that we probably shouldn't crash, but aklotz is probably the best person to answer on this.

Bob, I think you're an expert of this sandbox code. Do you have any suggestion/idea about how we should deal with a profiler in a sandbox process?

It looks like the hooks are there to prevent a possible deadlock during stack unwinding, I guess the profiler guys are best able to answer whether or not ignoring the failures is likely to often cause problems.

Over a solution to the hook failure, perhaps we can set the hooks from a different process.
In theory they could be set before the sandbox is lowered, but you'd need to know you wanted to profile at RDD process start-up, so probably not all that workable.

Flags: needinfo?(bobowencode)

Thanks, Bob. I agree that we need opinions from the profiler side.

Gerald, our detour cannot work in a sandbox process because of ProcessDynamicCodePolicy. This means the hooks in InitializeWin64ProfilerHooks fail to be planted in the rdd process (and other sandbox processes if that policy is on).

Currently our detour cannot handle this situation, resulting in crash. From detour side, we can make our hook API just fail without crash. On the profiler side, can you check the profiler will not cause any unwanted problems even if InitializeWin64ProfilerHooks fails to hook the functions?

Flags: needinfo?(gsquelart)

I'm afraid that if InitializeWin64ProfilerHooks fails, Firefox would likely crash later on when the profiler is sampling a stack while a DLL is being loaded.
But I'm far from being the expert on this...
David and/or Markus, what do you think?

Notes: This looks like this crash is in the Base Profiler's InitializeWin64ProfilerHooks that lives in mozglue. I'm guessing this issue didn't happen with the same code in libxul(?)
Since bug 1492121 (landed in Firefox 69), MOZ_PROFILER_BASE is enabled by default on Windows, and so Firefox only uses the mozglue InitializeWin64ProfilerHooks. Now, this is only really needed when doing startup profiling, so one workaround would be to only call mozglue's InitializeWin64ProfilerHooks in this case, so at least it wouldn't affect as many users.

Flags: needinfo?(mstange)
Flags: needinfo?(dmajor)
Flags: needinfo?(gsquelart)

(In reply to Gerald Squelart [:gerald] (he/him) from comment #6)

Since bug 1492121 (landed in Firefox 69), MOZ_PROFILER_BASE is enabled by default on Windows, and so Firefox only uses the mozglue InitializeWin64ProfilerHooks. Now, this is only really needed when doing startup profiling, so one workaround would be to only call mozglue's InitializeWin64ProfilerHooks in this case, so at least it wouldn't affect as many users.

If InitializeWin64ProfilerHooks is needed only for startup profiling, maybe we can skip ProcessDynamicCodePolicy if startup profiling is enabled because it's not a normal user scenario and we may not have to enable full protection?

I'm going to defer to David Major or Aaron Klotz on this.

I think the DLL loading interceptor is mostly only useful in the parent process - that's where much of the startup time is spent in DLL loading. For other process it may be less useful.

I'm not a fan of weakening the sandbox when startup profiling is enabled. We want to profile realistic work loads, and if this particular feature of the sandbox creates a performance issue of some kind, we want to be able to catch it.

Flags: needinfo?(mstange)

(In reply to Toshihito Kikuchi [:toshi] from comment #7)

(In reply to Gerald Squelart [:gerald] (he/him) from comment #6)

Since bug 1492121 (landed in Firefox 69), MOZ_PROFILER_BASE is enabled by default on Windows, and so Firefox only uses the mozglue InitializeWin64ProfilerHooks. Now, this is only really needed when doing startup profiling, so one workaround would be to only call mozglue's InitializeWin64ProfilerHooks in this case, so at least it wouldn't affect as many users.

If InitializeWin64ProfilerHooks is needed only for startup profiling, maybe we can skip ProcessDynamicCodePolicy if startup profiling is enabled because it's not a normal user scenario and we may not have to enable full protection?

If this is important only for startup profiling then we might be OK, because it looks like if MOZ_PROFILER_STARTUP env var is set, it initializes really early, before we lower the sandbox (I think but needs testing).
So, in that case I think the hooks might work.
If profiling is enabled once the process is already running (and most of the DLL loading has happened) maybe we can ignore the failure without too much risk.

(In reply to Bob Owen (:bobowen) from comment #9)

If this is important only for startup profiling then we might be OK, because it looks like if MOZ_PROFILER_STARTUP env var is set, it initializes really early, before we lower the sandbox (I think but needs testing).
So, in that case I think the hooks might work.
If profiling is enabled once the process is already running (and most of the DLL loading has happened) maybe we can ignore the failure without too much risk.

That's a good point. Actually you're right. When MOZ_PROFILER_STARTUP=1 is set, I confirmed profiler was initialized first from here, and then ProcessDynamicCodePolicy was set from here. The hooks succeeded in the rdd process. So the workaround Gerald suggested above sounds like a good solution.

(In reply to Toshihito Kikuchi [:toshi] from comment #10)

When MOZ_PROFILER_STARTUP=1 is set, I confirmed profiler was initialized first from here

FYI, the Base Profiler is currently controlled by MOZ_BASE_PROFILER_STARTUP=1 (notice the extra "_BASE"), in which case the profiler is initialized from here in naBrowserApp.cpp's main function, and must call mozglue's InitializeWin64ProfilerHooks. It's even earlier -- good or bad?

Now, as an extra kink: In WIP bug 1586939 I'm looking at combining MOZ_PROFILER_STARTUP and MOZ_BASE_PROFILER_STARTUP. In particular this will mean that even when the user starts the profiler manually (i.e., not during Firefox startup), if afterwards a new process is created (content or otherwise) it will effectively run the Base Profiler initialization there and call mozglue's InitializeWin64ProfilerHooks.

When MOZ_BASE_PROFILER_STARTUP=1 is set, I can see InitializeWin64ProfilerHooks is called twice in the rdd process, but both happened before ProcessDynamicCodePolicy is set. No crash was observed.

Here are the two callstacks.

00 00000073`019ff368 00007fff`b7e3236b mozglue!mozilla::baseprofiler::InitializeWin64ProfilerHooks
01 00000073`019ff370 00007fff`b7e30e4f mozglue!mozilla::baseprofiler::locked_profiler_start+0x82b
02 00000073`019ff450 00007ff7`cd14133f mozglue!mozilla::baseprofiler::profiler_init+0x61f
03 (Inline Function) --------`-------- firefox!mozilla::baseprofiler::AutoProfilerInit::AutoProfilerInit+0x6
04 00000073`019ff530 00007ff7`cd141232 firefox!NS_internal_main+0x3f
05 00000073`019ff700 00007ff7`cd190188 firefox!wmain+0x232
06 (Inline Function) --------`-------- firefox!invoke_main+0x22
07 00000073`019ff7c0 00007fff`cf527bd4 firefox!__scrt_common_main_seh+0x10c
08 00000073`019ff800 00007fff`d0e0ced1 KERNEL32!BaseThreadInitThunk+0x14
09 00000073`019ff830 00000000`00000000 ntdll!RtlUserThreadStart+0x21

00 00000073`019ff0f8 00007fff`7fca456c mozglue!mozilla::baseprofiler::InitializeWin64ProfilerHooks
01 00000073`019ff100 00007fff`7fca361e xul!locked_profiler_start+0x9c
02 00000073`019ff1f0 00007fff`7ff2b1a4 xul!profiler_init+0x7ee
03 (Inline Function) --------`-------- xul!mozilla::AutoProfilerInit::AutoProfilerInit+0xa
04 00000073`019ff2e0 00007ff7`cd1415d0 xul!XRE_InitChildProcess+0xf4
05 (Inline Function) --------`-------- firefox!content_process_main+0x8a
06 00000073`019ff530 00007ff7`cd141232 firefox!NS_internal_main+0x2d0
07 00000073`019ff700 00007ff7`cd190188 firefox!wmain+0x232
08 (Inline Function) --------`-------- firefox!invoke_main+0x22
09 00000073`019ff7c0 00007fff`cf527bd4 firefox!__scrt_common_main_seh+0x10c
0a 00000073`019ff800 00007fff`d0e0ced1 KERNEL32!BaseThreadInitThunk+0x14
0b 00000073`019ff830 00000000`00000000 ntdll!RtlUserThreadStart+0x21

As a historical note, the Win64 profiler hooks are there to prevent deadlocks that can happen when the profiler suspends various threads for sampling. Prior to the hooks, the Windows profiler used to have a bad reputation for causing trouble for performance engineers, often blocking their work. I would be extremely reluctant to bypass the hooks, even if things appear fine at first, it's often only later that the subtle issues get reported. In fact I might even go so far as to say that if setting those hooks fails, profiling should be disabled for that process.

Flags: needinfo?(dmajor)

My current patches to enable more decoders in rdd process appear to permafail on this issue on try. Eg: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=279555549&repo=try&lineNumber=7866

Is there something that I've done to make the issue worse? And if so, can I mitigate it?

:toshi, based on comment 13 and others, I think ideally we should disable profiling if setting the hook fails... So, would you be able to "fail graciously" instead of crashing, and return the success/error status from InitializeWin64ProfilerHooks()?

Flags: needinfo?(tkikuchi)

Sure thing. Let me assign this bug to myself..

Flags: needinfo?(tkikuchi)
Assignee: nobody → tkikuchi
Component: Audio/Video → mozglue
OS: Windows 10 → Windows
OS: Windows → Windows 10

I noticed this crash does no longer happen on the latest Nightly. I believe it's because Bug 1522830 changed our hook behavior in a child process.
More specifically, we don't hook kernel32!BaseThreadInitThunk in a child process anymore, which means we never have a trampoline region even before we turn on the dynamic code policy. In that case, it seems that our detour can handle an error correctly.

Gerald, I'll still plan to prepare a patch of this kind of crash as a DiD fix, but it will not block profiler side's work. Now two hook APIs in InitializeWin64ProfilerHooks return false in the rdd process when Gecko Profiler starts.

Our detour allocates a trampoline with PAGE_EXECUTE_READ first, and then makes
it writable before use. If the dynamic code policy is enabled after allocation,
we fail to change the attribute, and crash the process because we try to write
data into a readonly page. We need to check the validity of a trampoline before
writing data.

Pushed by cbrindusan@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a39663ca4c56
Graciously return a failure if we fail to change the attribute of a trampoline.  r=handyman,dmajor
Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla73
You need to log in before you can comment on or make changes to this bug.