Assertions in mozilla::mscom::ProcessRuntime::InitInsideApartment due to injected DLLs
Categories
(External Software Affecting Firefox :: Other, defect)
Tracking
(firefox-esr78 unaffected, firefox-esr91 wontfix, firefox90 unaffected, firefox91 wontfix, firefox92 wontfix, firefox97 wontfix, firefox98 wontfix, firefox99 wontfix, firefox100 wontfix, firefox101- wontfix, firefox102+ wontfix)
People
(Reporter: bugzilla, Unassigned)
References
(Regression)
Details
(Keywords: crash, regression)
Crash Data
If we look at https://crash-stats.mozilla.org/report/index/75baf12e-7112-471a-b745-5d8030210624, we see that something has caused COM initialization before our own mozilla::mscom::ProcessRuntime
object got to it.
In that report, I see libzdtp64.dll
and SafeWrapper.dll
injected into our process, both signed by Beijing Qihu Technology Co., Ltd.
.
We should take a look at the third-party modules ping to see if we can determine the injection vector for this.
(Note that I have no plans on removing the assertion in question, as it is a diagnostic assert to catch more serious problems, and is only affecting a small number of installations.)
Comment 1•3 years ago
|
||
(In reply to Aaron Klotz [:aklotz] from comment #0)
(Note that I have no plans on removing the assertion in question, as it is a diagnostic assert to catch more serious problems, and is only affecting a small number of installations.)
As I mentioned in this comment, our crash reporter does not catch this assert in the main process, and it might prevent Firefox from being auto-updated.
Comment 2•3 years ago
|
||
Could Fission make this crash more likely?
- 94% (758 out of 805) of these Nightly crash reports have Fission enabled, while only ~65% of Nightly users have Fission enabled.
- 86% (38 out of 44) of these Dev Edition crash reports have Fission enabled, while only ~15% of Dev Edition users have Fission enabled (from our Fission Beta/Dev Edition experiment).
(We don't have any crash reports from Beta users because MOZ_DIAGNOSTIC_ASSERT is not defined in Beta or Release builds.)
Comment 3•3 years ago
|
||
Could Fission make this crash more likely?
Perhaps Fission users are just more likely to hit this crash because Fission launches more content processes, not due to anything about Fission code?
Comment 4•3 years ago
|
||
This is probably not a Fission bug. We're probably seeing more crash reports with Fission enabled because Fission launches more content processes.
Can we block libzdtp64.dll and SafeWrapper.dll?
Bug 1706031 is another crash related to Qiho software. That bug's crash reports also contain libzdtp64.dll and SafeWrapper.dll.
Reporter | ||
Comment 5•3 years ago
•
|
||
(In reply to Toshihito Kikuchi [:toshi] from comment #1)
(In reply to Aaron Klotz [:aklotz] from comment #0)
(Note that I have no plans on removing the assertion in question, as it is a diagnostic assert to catch more serious problems, and is only affecting a small number of installations.)
As I mentioned in this comment, our crash reporter does not catch this assert in the main process, and it might prevent Firefox from being auto-updated.
I held an "MSCOM brain dump" meeting on August 9. We discussed this bug among other issues, and I wanted to leave some notes here for posterity:
By default, for compatibility reasons, the COM runtime wraps itself in a catch-all SEH handler that suppresses those exceptions. This can manifest itself in many ways. One of the most significant is when our process passes in an interface pointer to be used as a callback and the COM runtime later invokes our interface. If our implementation of that interface crashes, that crash will be suppressed by COM. At this risk of pointing out the obvious, this leaves the process in an undefined state. Any crashes within the COM runtime itself will also be suppressed by the catch-all SEH handler, also leaving the process in an undefined state.
Fortunately COM allows us to override this default behaviour to ensure that exceptions propagate as expected. This is done via the IGlobalOptions
interface. Our use of mscom::ProcessRuntime
to start up COM ensures that this happens. HOWEVER: IGlobalOptions
must be set before any interface (de)marshaling is performed by the COM runtime. Given the various hooks and other opportunities for marshaling to occur transitively, we must therefore set this option before creating any windows, pumping any messages, or invoking any other APIs that might trigger COM under the hood.
Further complicating the situation is that any calls to CoInitializeSecurity
also have the same restriction, with the added dependency that it should only be called once per process, and it must be called before instantiating CLSID_GlobalOptions
. Since CoInitializeSecurity
and CLSID_GlobalOptions
both have the same restrictions re: running prior to any marshaling, if CoInitializeSecurity
fails with RPC_E_TOO_LATE
, then any subsequent attempt to do anything with IGlobalOptions
is also likely to fail.
(If you read the ProcessRuntime
source code, you see that we still try to instantiate CLSID_GlobalOptions
anyway, but this is more of a last-ditch, "I don't expect this to work but I'm going to try anyway out of desparation" maneuver. As I understand things, if it is too late for CoInitializeSecurity
, then it is too late for CLSID_GlobalOptions
as well.)
Analyzing these dependencies, this means that we end up with the following order of operations:
CoInitializeSecurity
;- Instantiate
CLSID_GlobalOptions
and disable the catch-all SEH handler; - Create UI, pump messages, marshal COM interfaces, etc...
Note that these warnings are not purely theoretical: when I first enabled ProcessRuntime
during startup, crashes that were previously suppressed by COM became revealed in at least two locations! It is therefore critically important that we ensure that our developers are not inadvertently breaking ProcessRuntime
when making changes to the Firefox startup sequence. Early and immediate detection is key, so I strongly believe that we detect such conditions at build time and have the sheriffs back out the culprit.
Having said that, I do acknowledge Toshi's concerns in comment 1. Rather than completely removing the diagnostic assert, I would strongly suggest that we make it "smarter" instead. Whether that means only asserting when MOZ_AUTOMATION
is set, or we only assert if we're not running in updater mode, or we set a flag that avoids asserting until later in the startup sequence, once the crash reporter is up... I'll leave that to my successors to decide what to do. I think if we're careful about how we do this, we can still catch self-induced errors without harming the user experience, even if it's only on pre-release channels.
Updated•2 years ago
|
Comment 6•2 years ago
•
|
||
There was a spike in the build with:
https://hg.mozilla.org/mozilla-central/rev/c467b50beb6f
and it dropped in the build with the backout.
Comment 7•2 years ago
|
||
The big spike in buildid 20220118095036 was caused by win32k lockdown being enabled by default on Nightly.
This caused an issue in COM initialisation, which we are investigating.
Comment 8•2 years ago
|
||
There was a spike in the build with:
https://hg.mozilla.org/mozilla-central/rev/c467b50beb6f
As pointed out in the bug for that revision, it's not really related to this one:
https://bugzilla.mozilla.org/show_bug.cgi?id=1381019#c10
Comment 9•2 years ago
|
||
Setting S3 as this is a diagnostics assert.
Comment 10•2 years ago
|
||
Setting as blocker as we had to stop nightly updates.
Updated•2 years ago
|
Comment 11•2 years ago
|
||
Changing the priority to p2 as the bug is tracked by a release manager for the current nightly.
See What Do You Triage for more information
Comment 12•2 years ago
|
||
:tjr, since you are the author of the regressor, bug 1759167, could you take a look?
For more information, please visit auto_nag documentation.
Updated•2 years ago
|
Comment 13•2 years ago
|
||
Not sure if this is being handled here or Bug 1759167; but Bob will take a look.
Comment 14•2 years ago
|
||
We confirmed that Avast and "Crypto Pro" (whatever that is) are the main "security" software that's breaking the sandbox.
Updated•2 years ago
|
Updated•2 years ago
|
Comment 15•2 years ago
|
||
(In reply to Tom Ritter [:tjr] (ni? for response to CVE/sec-approval/advisories/etc) from comment #13)
Not sure if this is being handled here or Bug 1759167; but Bob will take a look.
I'll try and fix these recent win32k lockdown cases in bug 1759167.
Updated•2 years ago
|
Comment 16•2 years ago
|
||
The severity field for this bug is set to S3. However, the bug has the topcrash
keyword.
:haik, could you consider increasing the severity of this top-crash bug? If the crash isn't "top" anymore, could you drop the topcrash
keyword?
For more information, please visit auto_nag documentation.
Comment 17•2 years ago
•
|
||
This was set as a topcrash, because win32k lockdown caused a spike in earlier windows.
Updated•2 years ago
|
Comment 18•2 years ago
|
||
Back as a top crash on Nightly
Updated•2 years ago
|
Comment 19•2 years ago
|
||
The new ones all appear to be the utility process.
I guess we need to exclude win32k lockdown from earlier versions, same as for content process:
https://searchfox.org/mozilla-central/rev/8fe6930c0832009b3162bebee7d4ede1a4c8c9a8/toolkit/xre/nsAppRunner.cpp#718
Probably best done in a separate bug though.
Updated•2 years ago
|
Updated•2 years ago
|
Comment 20•2 years ago
|
||
From https://hg.mozilla.org/mozilla-central/rev/a534d0739a48 of bug 1769992 this landed on 20220518214245. From crash-stats, it seems the newest buildid still generating crashes as of today is 20220518094725, so I think we can safely assume the issue on Utility is fixed.
Comment 21•2 years ago
|
||
Removing topcrash keyword now this has dropped back down to the background level.
Updated•2 years ago
|
Comment 22•2 years ago
|
||
Since the crash volume is low (less than 5 per week), the severity is downgraded to S3
. Feel free to change it back if you think the bug is still critical.
For more information, please visit auto_nag documentation.
Comment 23•1 year ago
|
||
The bug is linked to a topcrash signature, which matches the following criteria:
- Top 20 desktop browser crashes on beta
- Top 10 content process crashes on beta
:haik, could you consider increasing the severity of this top-crash bug?
For more information, please visit auto_nag documentation.
Comment 24•1 year ago
|
||
There was a spike around November 18th, but other than that this crash continues to be <10 crashes per day typically 1 or 2.
Comment 25•1 year ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit auto_nag documentation.
Comment 26•10 months ago
|
||
Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criteria:
- Top 20 desktop browser crashes on beta
- Top 10 content process crashes on beta
For more information, please visit BugBot documentation.
Comment 27•8 months ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Comment 28•8 months ago
|
||
Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criteria:
- Top 20 desktop browser crashes on beta
- Top 10 content process crashes on beta
For more information, please visit BugBot documentation.
Comment 29•7 months ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Description
•