Open Bug 1719212 Opened 3 years ago Updated 7 months ago

Assertions in mozilla::mscom::ProcessRuntime::InitInsideApartment due to injected DLLs

Categories

(External Software Affecting Firefox :: Other, defect)

defect

Tracking

(firefox-esr78 unaffected, firefox-esr91 wontfix, firefox90 unaffected, firefox91 wontfix, firefox92 wontfix, firefox97 wontfix, firefox98 wontfix, firefox99 wontfix, firefox100 wontfix, firefox101- wontfix, firefox102+ wontfix)

Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- wontfix
firefox90 --- unaffected
firefox91 --- wontfix
firefox92 --- wontfix
firefox97 --- wontfix
firefox98 --- wontfix
firefox99 --- wontfix
firefox100 --- wontfix
firefox101 - wontfix
firefox102 + wontfix

People

(Reporter: bugzilla, Unassigned)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

If we look at https://crash-stats.mozilla.org/report/index/75baf12e-7112-471a-b745-5d8030210624, we see that something has caused COM initialization before our own mozilla::mscom::ProcessRuntime object got to it.

In that report, I see libzdtp64.dll and SafeWrapper.dll injected into our process, both signed by Beijing Qihu Technology Co., Ltd..

We should take a look at the third-party modules ping to see if we can determine the injection vector for this.

(Note that I have no plans on removing the assertion in question, as it is a diagnostic assert to catch more serious problems, and is only affecting a small number of installations.)

(In reply to Aaron Klotz [:aklotz] from comment #0)

(Note that I have no plans on removing the assertion in question, as it is a diagnostic assert to catch more serious problems, and is only affecting a small number of installations.)

As I mentioned in this comment, our crash reporter does not catch this assert in the main process, and it might prevent Firefox from being auto-updated.

Could Fission make this crash more likely?

  • 94% (758 out of 805) of these Nightly crash reports have Fission enabled, while only ~65% of Nightly users have Fission enabled.
  • 86% (38 out of 44) of these Dev Edition crash reports have Fission enabled, while only ~15% of Dev Edition users have Fission enabled (from our Fission Beta/Dev Edition experiment).

(We don't have any crash reports from Beta users because MOZ_DIAGNOSTIC_ASSERT is not defined in Beta or Release builds.)

Fission Milestone: --- → ?
See Also: → 1707954

Could Fission make this crash more likely?

Perhaps Fission users are just more likely to hit this crash because Fission launches more content processes, not due to anything about Fission code?

Depends on: 1723837

This is probably not a Fission bug. We're probably seeing more crash reports with Fission enabled because Fission launches more content processes.

Can we block libzdtp64.dll and SafeWrapper.dll?

Bug 1706031 is another crash related to Qiho software. That bug's crash reports also contain libzdtp64.dll and SafeWrapper.dll.

Fission Milestone: ? → ---
See Also: → 1706031

(In reply to Toshihito Kikuchi [:toshi] from comment #1)

(In reply to Aaron Klotz [:aklotz] from comment #0)

(Note that I have no plans on removing the assertion in question, as it is a diagnostic assert to catch more serious problems, and is only affecting a small number of installations.)

As I mentioned in this comment, our crash reporter does not catch this assert in the main process, and it might prevent Firefox from being auto-updated.

I held an "MSCOM brain dump" meeting on August 9. We discussed this bug among other issues, and I wanted to leave some notes here for posterity:

By default, for compatibility reasons, the COM runtime wraps itself in a catch-all SEH handler that suppresses those exceptions. This can manifest itself in many ways. One of the most significant is when our process passes in an interface pointer to be used as a callback and the COM runtime later invokes our interface. If our implementation of that interface crashes, that crash will be suppressed by COM. At this risk of pointing out the obvious, this leaves the process in an undefined state. Any crashes within the COM runtime itself will also be suppressed by the catch-all SEH handler, also leaving the process in an undefined state.

Fortunately COM allows us to override this default behaviour to ensure that exceptions propagate as expected. This is done via the IGlobalOptions interface. Our use of mscom::ProcessRuntime to start up COM ensures that this happens. HOWEVER: IGlobalOptions must be set before any interface (de)marshaling is performed by the COM runtime. Given the various hooks and other opportunities for marshaling to occur transitively, we must therefore set this option before creating any windows, pumping any messages, or invoking any other APIs that might trigger COM under the hood.

Further complicating the situation is that any calls to CoInitializeSecurity also have the same restriction, with the added dependency that it should only be called once per process, and it must be called before instantiating CLSID_GlobalOptions. Since CoInitializeSecurity and CLSID_GlobalOptions both have the same restrictions re: running prior to any marshaling, if CoInitializeSecurity fails with RPC_E_TOO_LATE, then any subsequent attempt to do anything with IGlobalOptions is also likely to fail.

(If you read the ProcessRuntime source code, you see that we still try to instantiate CLSID_GlobalOptions anyway, but this is more of a last-ditch, "I don't expect this to work but I'm going to try anyway out of desparation" maneuver. As I understand things, if it is too late for CoInitializeSecurity, then it is too late for CLSID_GlobalOptions as well.)

Analyzing these dependencies, this means that we end up with the following order of operations:

  1. CoInitializeSecurity;
  2. Instantiate CLSID_GlobalOptions and disable the catch-all SEH handler;
  3. Create UI, pump messages, marshal COM interfaces, etc...

Note that these warnings are not purely theoretical: when I first enabled ProcessRuntime during startup, crashes that were previously suppressed by COM became revealed in at least two locations! It is therefore critically important that we ensure that our developers are not inadvertently breaking ProcessRuntime when making changes to the Firefox startup sequence. Early and immediate detection is key, so I strongly believe that we detect such conditions at build time and have the sheriffs back out the culprit.

Having said that, I do acknowledge Toshi's concerns in comment 1. Rather than completely removing the diagnostic assert, I would strongly suggest that we make it "smarter" instead. Whether that means only asserting when MOZ_AUTOMATION is set, or we only assert if we're not running in updater mode, or we set a flag that avoids asserting until later in the startup sequence, once the crash reporter is up... I'll leave that to my successors to decide what to do. I think if we're careful about how we do this, we can still catch self-induced errors without harming the user experience, even if it's only on pre-release channels.

There was a spike in the build with:
https://hg.mozilla.org/mozilla-central/rev/c467b50beb6f
and it dropped in the build with the backout.

The big spike in buildid 20220118095036 was caused by win32k lockdown being enabled by default on Nightly.
This caused an issue in COM initialisation, which we are investigating.

There was a spike in the build with:
https://hg.mozilla.org/mozilla-central/rev/c467b50beb6f

As pointed out in the bug for that revision, it's not really related to this one:
https://bugzilla.mozilla.org/show_bug.cgi?id=1381019#c10

See Also: → 1751367

Setting S3 as this is a diagnostics assert.

Severity: -- → S3
Priority: -- → P3
See Also: 1706031

Setting as blocker as we had to stop nightly updates.

Changing the priority to p2 as the bug is tracked by a release manager for the current nightly.
See What Do You Triage for more information

Priority: P3 → P2

:tjr, since you are the author of the regressor, bug 1759167, could you take a look?
For more information, please visit auto_nag documentation.

Flags: needinfo?(tom)

Not sure if this is being handled here or Bug 1759167; but Bob will take a look.

Flags: needinfo?(tom) → needinfo?(bobowencode)

We confirmed that Avast and "Crypto Pro" (whatever that is) are the main "security" software that's breaking the sandbox.

(In reply to Tom Ritter [:tjr] (ni? for response to CVE/sec-approval/advisories/etc) from comment #13)

Not sure if this is being handled here or Bug 1759167; but Bob will take a look.

I'll try and fix these recent win32k lockdown cases in bug 1759167.

Flags: needinfo?(bobowencode)

The severity field for this bug is set to S3. However, the bug has the topcrash keyword.
:haik, could you consider increasing the severity of this top-crash bug? If the crash isn't "top" anymore, could you drop the topcrash keyword?

For more information, please visit auto_nag documentation.

Flags: needinfo?(haftandilian)

This was set as a topcrash, because win32k lockdown caused a spike in earlier windows.

Flags: needinfo?(haftandilian)
Keywords: topcrash

Back as a top crash on Nightly

Severity: S3 → S1
Keywords: topcrash
Priority: P2 → P1

The new ones all appear to be the utility process.
I guess we need to exclude win32k lockdown from earlier versions, same as for content process:
https://searchfox.org/mozilla-central/rev/8fe6930c0832009b3162bebee7d4ede1a4c8c9a8/toolkit/xre/nsAppRunner.cpp#718

Probably best done in a separate bug though.

Flags: needinfo?(lissyx+mozillians)
Flags: needinfo?(lissyx+mozillians)
Depends on: 1769992

From https://hg.mozilla.org/mozilla-central/rev/a534d0739a48 of bug 1769992 this landed on 20220518214245. From crash-stats, it seems the newest buildid still generating crashes as of today is 20220518094725, so I think we can safely assume the issue on Utility is fixed.

Removing topcrash keyword now this has dropped back down to the background level.

Keywords: topcrash

Since the crash volume is low (less than 5 per week), the severity is downgraded to S3. Feel free to change it back if you think the bug is still critical.

For more information, please visit auto_nag documentation.

Severity: S1 → S3
Priority: P1 → --

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on beta
  • Top 10 content process crashes on beta

:haik, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(haftandilian)
Keywords: topcrash

There was a spike around November 18th, but other than that this crash continues to be <10 crashes per day typically 1 or 2.

Flags: needinfo?(haftandilian)

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash
Blocks: 1825290
See Also: 1759167

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on beta
  • Top 10 content process crashes on beta

For more information, please visit BugBot documentation.

Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on beta
  • Top 10 content process crashes on beta

For more information, please visit BugBot documentation.

Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash
You need to log in before you can comment on or make changes to this bug.