Crash in [@ mozilla::interceptor::WindowsDllDetourPatcher<T>::CreateTrampoline]
Categories
(Core :: mozglue, defect, P2)
Tracking
()
People
(Reporter: gsvelto, Assigned: toshi)
References
Details
(Keywords: crash)
Crash Data
Attachments
(1 file)
This bug is for crash report bp-88cbeeff-3c98-4306-9411-825750191124.
Top 10 frames of crashing thread:
0 mozglue.dll mozilla::interceptor::WindowsDllDetourPatcher<mozilla::interceptor::VMSharingPolicyShared<mozilla::interceptor::MMPolicyInProcess, 1> >::CreateTrampoline mozglue/misc/interceptor/PatcherDetour.h:725
1 mozglue.dll mozilla::interceptor::WindowsDllDetourPatcher<mozilla::interceptor::VMSharingPolicyShared<mozilla::interceptor::MMPolicyInProcess, 1> >::AddHook mozglue/misc/interceptor/PatcherDetour.h:379
2 mozglue.dll static int mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared<mozilla::interceptor::MMPolicyInProcess, 1> >, long mozglue/misc/nsWindowsDllInterceptor.h
3 ntdll.dll RtlRunOnceExecuteOnce
4 kernelbase.dll kernelbase.dll@0x6207a
5 mozglue.dll ceil
6 mozglue.dll mozilla::baseprofiler::InitializeWin64ProfilerHooks mozglue/baseprofiler/core/platform-win32.cpp:324
7 xul.dll static void locked_profiler_start tools/profiler/core/platform.cpp:4015
8 xul.dll static void profiler_start tools/profiler/core/platform.cpp:4157
9 xul.dll class mozilla::ipc::IPCResult mozilla::ProfilerChild::RecvStart tools/profiler/gecko/ProfilerChild.cpp:30
Just two crashes from buildid 20191123094742, CC'ing :toshi because he's the last one who touched this code in bug 1596930. Note that both crashes are coming from the rdd
process (remote data decoder).
Reporter | ||
Comment 1•5 years ago
|
||
Actually there were older crashes which I hadn't noticed, and they're all in the rdd
process, maybe we should move this to the Audio/Video component?
Assignee | ||
Comment 2•5 years ago
|
||
I'm still analyzing a couple of raw dumps. This happened here, where our detour tried to put a value to a trampoline area to hook LdrUnloadDll
for the profiler, but the memory region was PAGE_EXECUTE_READ
, not writable, for some unknown reason, resulting in AV.
I don't see anything rdd-process-specific yet. Let me spend more time on figuring out what happened.
Updated•5 years ago
|
Assignee | ||
Comment 3•5 years ago
•
|
||
Okay, I've got a consistent repro of this crash. If we start Gecko Profiler when the rdd process is running (e.g. opening youtube.com is enough), this crash 100% occurs.
The reason why the trampoline area was not writable was this call to VirtualProtect
failed. And the reason was ProcessDynamicCodePolicy
is enabled in the rdd process when it started. The purpose of ProcessDynamicCodePolicy
is to precisely prevent this kind of detour behavior.
At least out detour should just fail without crash if VirtualProtect
failed for whatever reason. I think it's not a difficult fix.
Bob, I think you're an expert of this sandbox code. Do you have any suggestion/idea about how we should deal with a profiler in a sandbox process?
Comment 4•5 years ago
|
||
(In reply to Toshihito Kikuchi [:toshi] from comment #3)
...
At least out detour should just fail without crash if
VirtualProtect
failed for whatever reason. I think it's not a difficult fix.
I agree that we probably shouldn't crash, but aklotz is probably the best person to answer on this.
Bob, I think you're an expert of this sandbox code. Do you have any suggestion/idea about how we should deal with a profiler in a sandbox process?
It looks like the hooks are there to prevent a possible deadlock during stack unwinding, I guess the profiler guys are best able to answer whether or not ignoring the failures is likely to often cause problems.
Over a solution to the hook failure, perhaps we can set the hooks from a different process.
In theory they could be set before the sandbox is lowered, but you'd need to know you wanted to profile at RDD process start-up, so probably not all that workable.
Assignee | ||
Comment 5•5 years ago
|
||
Thanks, Bob. I agree that we need opinions from the profiler side.
Gerald, our detour cannot work in a sandbox process because of ProcessDynamicCodePolicy
. This means the hooks in InitializeWin64ProfilerHooks
fail to be planted in the rdd process (and other sandbox processes if that policy is on).
Currently our detour cannot handle this situation, resulting in crash. From detour side, we can make our hook API just fail without crash. On the profiler side, can you check the profiler will not cause any unwanted problems even if InitializeWin64ProfilerHooks
fails to hook the functions?
I'm afraid that if InitializeWin64ProfilerHooks
fails, Firefox would likely crash later on when the profiler is sampling a stack while a DLL is being loaded.
But I'm far from being the expert on this...
David and/or Markus, what do you think?
Notes: This looks like this crash is in the Base Profiler's InitializeWin64ProfilerHooks
that lives in mozglue. I'm guessing this issue didn't happen with the same code in libxul(?)
Since bug 1492121 (landed in Firefox 69), MOZ_PROFILER_BASE
is enabled by default on Windows, and so Firefox only uses the mozglue InitializeWin64ProfilerHooks
. Now, this is only really needed when doing startup profiling, so one workaround would be to only call mozglue's InitializeWin64ProfilerHooks
in this case, so at least it wouldn't affect as many users.
Assignee | ||
Comment 7•5 years ago
|
||
(In reply to Gerald Squelart [:gerald] (he/him) from comment #6)
Since bug 1492121 (landed in Firefox 69),
MOZ_PROFILER_BASE
is enabled by default on Windows, and so Firefox only uses the mozglueInitializeWin64ProfilerHooks
. Now, this is only really needed when doing startup profiling, so one workaround would be to only call mozglue'sInitializeWin64ProfilerHooks
in this case, so at least it wouldn't affect as many users.
If InitializeWin64ProfilerHooks
is needed only for startup profiling, maybe we can skip ProcessDynamicCodePolicy
if startup profiling is enabled because it's not a normal user scenario and we may not have to enable full protection?
Comment 8•5 years ago
|
||
I'm going to defer to David Major or Aaron Klotz on this.
I think the DLL loading interceptor is mostly only useful in the parent process - that's where much of the startup time is spent in DLL loading. For other process it may be less useful.
I'm not a fan of weakening the sandbox when startup profiling is enabled. We want to profile realistic work loads, and if this particular feature of the sandbox creates a performance issue of some kind, we want to be able to catch it.
Comment 9•5 years ago
|
||
(In reply to Toshihito Kikuchi [:toshi] from comment #7)
(In reply to Gerald Squelart [:gerald] (he/him) from comment #6)
Since bug 1492121 (landed in Firefox 69),
MOZ_PROFILER_BASE
is enabled by default on Windows, and so Firefox only uses the mozglueInitializeWin64ProfilerHooks
. Now, this is only really needed when doing startup profiling, so one workaround would be to only call mozglue'sInitializeWin64ProfilerHooks
in this case, so at least it wouldn't affect as many users.If
InitializeWin64ProfilerHooks
is needed only for startup profiling, maybe we can skipProcessDynamicCodePolicy
if startup profiling is enabled because it's not a normal user scenario and we may not have to enable full protection?
If this is important only for startup profiling then we might be OK, because it looks like if MOZ_PROFILER_STARTUP env var is set, it initializes really early, before we lower the sandbox (I think but needs testing).
So, in that case I think the hooks might work.
If profiling is enabled once the process is already running (and most of the DLL loading has happened) maybe we can ignore the failure without too much risk.
Assignee | ||
Comment 10•5 years ago
|
||
(In reply to Bob Owen (:bobowen) from comment #9)
If this is important only for startup profiling then we might be OK, because it looks like if MOZ_PROFILER_STARTUP env var is set, it initializes really early, before we lower the sandbox (I think but needs testing).
So, in that case I think the hooks might work.
If profiling is enabled once the process is already running (and most of the DLL loading has happened) maybe we can ignore the failure without too much risk.
That's a good point. Actually you're right. When MOZ_PROFILER_STARTUP=1
is set, I confirmed profiler was initialized first from here, and then ProcessDynamicCodePolicy
was set from here. The hooks succeeded in the rdd process. So the workaround Gerald suggested above sounds like a good solution.
(In reply to Toshihito Kikuchi [:toshi] from comment #10)
When
MOZ_PROFILER_STARTUP=1
is set, I confirmed profiler was initialized first from here
FYI, the Base Profiler is currently controlled by MOZ_BASE_PROFILER_STARTUP=1
(notice the extra "_BASE"), in which case the profiler is initialized from here in naBrowserApp.cpp's main
function, and must call mozglue's InitializeWin64ProfilerHooks
. It's even earlier -- good or bad?
Now, as an extra kink: In WIP bug 1586939 I'm looking at combining MOZ_PROFILER_STARTUP
and MOZ_BASE_PROFILER_STARTUP
. In particular this will mean that even when the user starts the profiler manually (i.e., not during Firefox startup), if afterwards a new process is created (content or otherwise) it will effectively run the Base Profiler initialization there and call mozglue's InitializeWin64ProfilerHooks
.
Assignee | ||
Comment 12•5 years ago
|
||
When MOZ_BASE_PROFILER_STARTUP=1
is set, I can see InitializeWin64ProfilerHooks
is called twice in the rdd process, but both happened before ProcessDynamicCodePolicy is set. No crash was observed.
Here are the two callstacks.
00 00000073`019ff368 00007fff`b7e3236b mozglue!mozilla::baseprofiler::InitializeWin64ProfilerHooks
01 00000073`019ff370 00007fff`b7e30e4f mozglue!mozilla::baseprofiler::locked_profiler_start+0x82b
02 00000073`019ff450 00007ff7`cd14133f mozglue!mozilla::baseprofiler::profiler_init+0x61f
03 (Inline Function) --------`-------- firefox!mozilla::baseprofiler::AutoProfilerInit::AutoProfilerInit+0x6
04 00000073`019ff530 00007ff7`cd141232 firefox!NS_internal_main+0x3f
05 00000073`019ff700 00007ff7`cd190188 firefox!wmain+0x232
06 (Inline Function) --------`-------- firefox!invoke_main+0x22
07 00000073`019ff7c0 00007fff`cf527bd4 firefox!__scrt_common_main_seh+0x10c
08 00000073`019ff800 00007fff`d0e0ced1 KERNEL32!BaseThreadInitThunk+0x14
09 00000073`019ff830 00000000`00000000 ntdll!RtlUserThreadStart+0x21
00 00000073`019ff0f8 00007fff`7fca456c mozglue!mozilla::baseprofiler::InitializeWin64ProfilerHooks
01 00000073`019ff100 00007fff`7fca361e xul!locked_profiler_start+0x9c
02 00000073`019ff1f0 00007fff`7ff2b1a4 xul!profiler_init+0x7ee
03 (Inline Function) --------`-------- xul!mozilla::AutoProfilerInit::AutoProfilerInit+0xa
04 00000073`019ff2e0 00007ff7`cd1415d0 xul!XRE_InitChildProcess+0xf4
05 (Inline Function) --------`-------- firefox!content_process_main+0x8a
06 00000073`019ff530 00007ff7`cd141232 firefox!NS_internal_main+0x2d0
07 00000073`019ff700 00007ff7`cd190188 firefox!wmain+0x232
08 (Inline Function) --------`-------- firefox!invoke_main+0x22
09 00000073`019ff7c0 00007fff`cf527bd4 firefox!__scrt_common_main_seh+0x10c
0a 00000073`019ff800 00007fff`d0e0ced1 KERNEL32!BaseThreadInitThunk+0x14
0b 00000073`019ff830 00000000`00000000 ntdll!RtlUserThreadStart+0x21
Comment 13•5 years ago
|
||
As a historical note, the Win64 profiler hooks are there to prevent deadlocks that can happen when the profiler suspends various threads for sampling. Prior to the hooks, the Windows profiler used to have a bad reputation for causing trouble for performance engineers, often blocking their work. I would be extremely reluctant to bypass the hooks, even if things appear fine at first, it's often only later that the subtle issues get reported. In fact I might even go so far as to say that if setting those hooks fails, profiling should be disabled for that process.
Comment 14•5 years ago
|
||
My current patches to enable more decoders in rdd process appear to permafail on this issue on try. Eg: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=279555549&repo=try&lineNumber=7866
Is there something that I've done to make the issue worse? And if so, can I mitigate it?
:toshi, based on comment 13 and others, I think ideally we should disable profiling if setting the hook fails... So, would you be able to "fail graciously" instead of crashing, and return the success/error status from InitializeWin64ProfilerHooks()
?
Assignee | ||
Comment 16•5 years ago
|
||
Sure thing. Let me assign this bug to myself..
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 17•5 years ago
|
||
I noticed this crash does no longer happen on the latest Nightly. I believe it's because Bug 1522830 changed our hook behavior in a child process.
More specifically, we don't hook kernel32!BaseThreadInitThunk in a child process anymore, which means we never have a trampoline region even before we turn on the dynamic code policy. In that case, it seems that our detour can handle an error correctly.
Gerald, I'll still plan to prepare a patch of this kind of crash as a DiD fix, but it will not block profiler side's work. Now two hook APIs in InitializeWin64ProfilerHooks
return false
in the rdd process when Gecko Profiler starts.
Assignee | ||
Comment 18•5 years ago
|
||
Our detour allocates a trampoline with PAGE_EXECUTE_READ
first, and then makes
it writable before use. If the dynamic code policy is enabled after allocation,
we fail to change the attribute, and crash the process because we try to write
data into a readonly page. We need to check the validity of a trampoline before
writing data.
Comment 19•5 years ago
|
||
Pushed by cbrindusan@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a39663ca4c56 Graciously return a failure if we fail to change the attribute of a trampoline. r=handyman,dmajor
Comment 20•5 years ago
|
||
bugherder |
Updated•5 years ago
|
Description
•