1719212 - Assertions in mozilla::mscom::ProcessRuntime::InitInsideApartment due to injected DLLs

(No longer employed by Mozilla) Aaron Klotz

Reporter

Description

•

4 years ago

If we look at https://crash-stats.mozilla.org/report/index/75baf12e-7112-471a-b745-5d8030210624, we see that something has caused COM initialization before our own mozilla::mscom::ProcessRuntime object got to it.

In that report, I see libzdtp64.dll and SafeWrapper.dll injected into our process, both signed by Beijing Qihu Technology Co., Ltd..

We should take a look at the third-party modules ping to see if we can determine the injection vector for this.

(Note that I have no plans on removing the assertion in question, as it is a diagnostic assert to catch more serious problems, and is only affecting a small number of installations.)

Toshihito Kikuchi [:toshi]

Comment 1

•

4 years ago

(In reply to Aaron Klotz [:aklotz] from comment #0)

(Note that I have no plans on removing the assertion in question, as it is a diagnostic assert to catch more serious problems, and is only affecting a small number of installations.)

As I mentioned in this comment, our crash reporter does not catch this assert in the main process, and it might prevent Firefox from being auto-updated.

Chris Peterson [:cpeterson]

Comment 2

•

4 years ago

Could Fission make this crash more likely?

94% (758 out of 805) of these Nightly crash reports have Fission enabled, while only ~65% of Nightly users have Fission enabled.
86% (38 out of 44) of these Dev Edition crash reports have Fission enabled, while only ~15% of Dev Edition users have Fission enabled (from our Fission Beta/Dev Edition experiment).

(We don't have any crash reports from Beta users because MOZ_DIAGNOSTIC_ASSERT is not defined in Beta or Release builds.)

Fission Milestone: --- → ?

status-firefox90: --- → unaffected

status-firefox91: --- → affected

status-firefox92: --- → affected

status-firefox-esr78: --- → unaffected

status-firefox-esr91: --- → affected

Comment 3

•

4 years ago

Could Fission make this crash more likely?

Perhaps Fission users are just more likely to hit this crash because Fission launches more content processes, not due to anything about Fission code?

(No longer employed by Mozilla) Aaron Klotz

Reporter

Updated

•

4 years ago

Depends on: 1723837

Chris Peterson [:cpeterson]

Comment 4

•

4 years ago

This is probably not a Fission bug. We're probably seeing more crash reports with Fission enabled because Fission launches more content processes.

Can we block libzdtp64.dll and SafeWrapper.dll?

Bug 1706031 is another crash related to Qiho software. That bug's crash reports also contain libzdtp64.dll and SafeWrapper.dll.

Fission Milestone: ? → ---

Comment 5

•

4 years ago

•

Edited

(In reply to Toshihito Kikuchi [:toshi] from comment #1)

(In reply to Aaron Klotz [:aklotz] from comment #0)

(Note that I have no plans on removing the assertion in question, as it is a diagnostic assert to catch more serious problems, and is only affecting a small number of installations.)

As I mentioned in this comment, our crash reporter does not catch this assert in the main process, and it might prevent Firefox from being auto-updated.

I held an "MSCOM brain dump" meeting on August 9. We discussed this bug among other issues, and I wanted to leave some notes here for posterity:

By default, for compatibility reasons, the COM runtime wraps itself in a catch-all SEH handler that suppresses those exceptions. This can manifest itself in many ways. One of the most significant is when our process passes in an interface pointer to be used as a callback and the COM runtime later invokes our interface. If our implementation of that interface crashes, that crash will be suppressed by COM. At this risk of pointing out the obvious, this leaves the process in an undefined state. Any crashes within the COM runtime itself will also be suppressed by the catch-all SEH handler, also leaving the process in an undefined state.

Fortunately COM allows us to override this default behaviour to ensure that exceptions propagate as expected. This is done via the IGlobalOptions interface. Our use of mscom::ProcessRuntime to start up COM ensures that this happens. HOWEVER: IGlobalOptions must be set before any interface (de)marshaling is performed by the COM runtime. Given the various hooks and other opportunities for marshaling to occur transitively, we must therefore set this option before creating any windows, pumping any messages, or invoking any other APIs that might trigger COM under the hood.

Further complicating the situation is that any calls to CoInitializeSecurity also have the same restriction, with the added dependency that it should only be called once per process, and it must be called before instantiating CLSID_GlobalOptions. Since CoInitializeSecurity and CLSID_GlobalOptions both have the same restrictions re: running prior to any marshaling, if CoInitializeSecurity fails with RPC_E_TOO_LATE, then any subsequent attempt to do anything with IGlobalOptions is also likely to fail.

(If you read the ProcessRuntime source code, you see that we still try to instantiate CLSID_GlobalOptions anyway, but this is more of a last-ditch, "I don't expect this to work but I'm going to try anyway out of desparation" maneuver. As I understand things, if it is too late for CoInitializeSecurity, then it is too late for CLSID_GlobalOptions as well.)

Analyzing these dependencies, this means that we end up with the following order of operations:

CoInitializeSecurity;
Instantiate CLSID_GlobalOptions and disable the catch-all SEH handler;
Create UI, pump messages, marshal COM interfaces, etc...

Note that these warnings are not purely theoretical: when I first enabled ProcessRuntime during startup, crashes that were previously suppressed by COM became revealed in at least two locations! It is therefore critically important that we ensure that our developers are not inadvertently breaking ProcessRuntime when making changes to the Firefox startup sequence. Early and immediate detection is key, so I strongly believe that we detect such conditions at build time and have the sheriffs back out the culprit.

Having said that, I do acknowledge Toshi's concerns in comment 1. Rather than completely removing the diagnostic assert, I would strongly suggest that we make it "smarter" instead. Whether that means only asserting when MOZ_AUTOMATION is set, or we only assert if we're not running in updater mode, or we set a flag that avoids asserting until later in the startup sequence, once the crash reporter is up... I'll leave that to my successors to decide what to do. I think if we're careful about how we do this, we can still catch self-induced errors without harming the user experience, even if it's only on pre-release channels.

Sylvestre Ledru [:Sylvestre]

Updated

•

4 years ago

status-firefox91: affected → wontfix

status-firefox92: affected → wontfix

status-firefox97: --- → affected

status-firefox98: --- → affected

Calixte Denizet (:calixte)

Comment 6

•

4 years ago

•

Edited

There was a spike in the build with:
https://hg.mozilla.org/mozilla-central/rev/c467b50beb6f
and it dropped in the build with the backout.

Bob Owen (:bobowen)

Comment 7

•

4 years ago

The big spike in buildid 20220118095036 was caused by win32k lockdown being enabled by default on Nightly.
This caused an issue in COM initialisation, which we are investigating.

Gian-Carlo Pascutto [:gcp]

Comment 8

•

4 years ago

There was a spike in the build with:
https://hg.mozilla.org/mozilla-central/rev/c467b50beb6f

As pointed out in the bug for that revision, it's not really related to this one:
https://bugzilla.mozilla.org/show_bug.cgi?id=1381019#c10

Bob Owen (:bobowen)

Updated

•

4 years ago

Comment 9

•

4 years ago

Setting S3 as this is a diagnostics assert.

Severity: -- → S3

Priority: -- → P3

Toshihito Kikuchi [:toshi]

Updated

•

4 years ago

Updated

•

3 years ago

Updated

•

3 years ago

Regressed by: 1759167

Pascal Chevrel:pascalc (PTO until Sept 2)

Comment 10

•

3 years ago

Setting as blocker as we had to stop nightly updates.

status-firefox100: --- → affected

status-firefox97: affected → wontfix

status-firefox99: --- → affected

tracking-firefox100: --- → blocking

Keywords: crash, topcrash

BugBot [:suhaib / :marco/ :calixte]

Updated

•

3 years ago

Keywords: regression

BugBot [:suhaib / :marco/ :calixte]

Comment 11

•

3 years ago

Changing the priority to p2 as the bug is tracked by a release manager for the current nightly.
See What Do You Triage for more information

Priority: P3 → P2

BugBot [:suhaib / :marco/ :calixte]

Comment 12

•

3 years ago

:tjr, since you are the author of the regressor, bug 1759167, could you take a look?
For more information, please visit auto_nag documentation.

Flags: needinfo?(tom)

Pascal Chevrel:pascalc (PTO until Sept 2)

Updated

•

3 years ago

tracking-firefox100: blocking → ---

Tom Ritter [:tjr] (OOTO until 9/2)

Comment 13

•

3 years ago

Not sure if this is being handled here or Bug 1759167; but Bob will take a look.

Flags: needinfo?(tom) → needinfo?(bobowencode)

Gian-Carlo Pascutto [:gcp]

Comment 14

•

3 years ago

We confirmed that Avast and "Crypto Pro" (whatever that is) are the main "security" software that's breaking the sandbox.

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

status-firefox98: affected → wontfix

Donal Meehan [:dmeehan]

Updated

•

3 years ago

status-firefox99: affected → wontfix

Bob Owen (:bobowen)

Comment 15

•

3 years ago

(In reply to Tom Ritter [:tjr] (ni? for response to CVE/sec-approval/advisories/etc) from comment #13)

Not sure if this is being handled here or Bug 1759167; but Bob will take a look.

I'll try and fix these recent win32k lockdown cases in bug 1759167.

Flags: needinfo?(bobowencode)

Dianna Smith [:diannaS]

Updated

•

3 years ago

status-firefox100: affected → wontfix

status-firefox101: --- → affected

BugBot [:suhaib / :marco/ :calixte]

Comment 16

•

3 years ago

The severity field for this bug is set to S3. However, the bug has the topcrash keyword.
:haik, could you consider increasing the severity of this top-crash bug? If the crash isn't "top" anymore, could you drop the topcrash keyword?

For more information, please visit auto_nag documentation.

Flags: needinfo?(haftandilian)

Bob Owen (:bobowen)

Comment 17

•

3 years ago

•

Edited

This was set as a topcrash, because win32k lockdown caused a spike in earlier windows.

Flags: needinfo?(haftandilian)

Keywords: topcrash

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

status-firefox101: affected → wontfix

status-firefox102: --- → affected

status-firefox-esr91: affected → wontfix

Pascal Chevrel:pascalc (PTO until Sept 2)

Comment 18

•

3 years ago

Back as a top crash on Nightly

Severity: S3 → S1

tracking-firefox101: --- → ?

tracking-firefox102: --- → ?

Keywords: topcrash

Priority: P2 → P1

Pascal Chevrel:pascalc (PTO until Sept 2)

Updated

•

3 years ago

tracking-firefox102: ? → +

Bob Owen (:bobowen)

Comment 19

•

3 years ago

The new ones all appear to be the utility process.
I guess we need to exclude win32k lockdown from earlier versions, same as for content process:
https://searchfox.org/mozilla-central/rev/8fe6930c0832009b3162bebee7d4ede1a4c8c9a8/toolkit/xre/nsAppRunner.cpp#718

Probably best done in a separate bug though.

Flags: needinfo?(lissyx+mozillians)

:gerard-majax

Updated

•

3 years ago

Flags: needinfo?(lissyx+mozillians)

:gerard-majax

Updated

•

3 years ago

Depends on: 1769992

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

tracking-firefox101: ? → -

:gerard-majax

Comment 20

•

3 years ago

From https://hg.mozilla.org/mozilla-central/rev/a534d0739a48 of bug 1769992 this landed on 20220518214245. From crash-stats, it seems the newest buildid still generating crashes as of today is 20220518094725, so I think we can safely assume the issue on Utility is fixed.

Bob Owen (:bobowen)

Comment 21

•

3 years ago

Removing topcrash keyword now this has dropped back down to the background level.

Keywords: topcrash

Pascal Chevrel:pascalc (PTO until Sept 2)

Updated

•

3 years ago

status-firefox102: affected → wontfix

BugBot [:suhaib / :marco/ :calixte]

Comment 22

•

3 years ago

Since the crash volume is low (less than 5 per week), the severity is downgraded to S3. Feel free to change it back if you think the bug is still critical.

For more information, please visit auto_nag documentation.

Severity: S1 → S3

Priority: P1 → --

BugBot [:suhaib / :marco/ :calixte]

Comment 23

•

3 years ago

The bug is linked to a topcrash signature, which matches the following criteria:

Top 20 desktop browser crashes on beta
Top 10 content process crashes on beta

:haik, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(haftandilian)

Keywords: topcrash

Haik Aftandilian [:haik]

Comment 24

•

3 years ago

There was a spike around November 18th, but other than that this crash continues to be <10 crashes per day typically 1 or 2.

Flags: needinfo?(haftandilian)

BugBot [:suhaib / :marco/ :calixte]

Comment 25

•

3 years ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

:gerard-majax

Updated

•

2 years ago

Blocks: 1825290

Takanori MATSUURA

Updated

•

2 years ago

Comment 26

•

2 years ago

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criteria:

Top 20 desktop browser crashes on beta
Top 10 content process crashes on beta

For more information, please visit BugBot documentation.

Keywords: topcrash

BugBot [:suhaib / :marco/ :calixte]

Comment 27

•

2 years ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

BugBot [:suhaib / :marco/ :calixte]

Comment 28

•

2 years ago

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criteria:

Top 20 desktop browser crashes on beta
Top 10 content process crashes on beta

For more information, please visit BugBot documentation.

Keywords: topcrash

BugBot [:suhaib / :marco/ :calixte]

Comment 29

•

2 years ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash