Closed Bug 1489041 Opened Last year Closed 7 months ago

Crash in scdetour.dll@0x2dd77 ("Quick Heal" security software)

Categories

(External Software Affecting Firefox :: Other, defect, P1, critical)

x86
Windows
defect

Tracking

(relnote-firefox 63+, firefox-esr60 wontfix, firefox62 wontfix, firefox63+ wontfix, firefox64+ wontfix, firefox65+ wontfix, firefox66+ wontfix, firefox67 wontfix, firefox68 fixed)

RESOLVED DUPLICATE of bug 1503538
Tracking Status
relnote-firefox --- 63+
firefox-esr60 --- wontfix
firefox62 --- wontfix
firefox63 + wontfix
firefox64 + wontfix
firefox65 + wontfix
firefox66 + wontfix
firefox67 --- wontfix
firefox68 --- fixed

People

(Reporter: philipp, Assigned: aklotz)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, Whiteboard: [AV:Quick Heal][regressed sept 6th][dll version is 3.0.1.*][slack: quickheal])

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is
report bp-5763534e-459b-4809-b5cb-2ef4e0180906.
=============================================================

Top 4 frames of crashing thread:

0 scdetour.dll scdetour.dll@0x2dd77 
1 scdetour.dll scdetour.dll@0x8f72 
2 scdetour.dll scdetour.dll@0xa0c0 
3 scdetour.dll scdetour.dll@0x19bf0 

=============================================================

this signature is showing up as a fairly frequent content crash in very early stability data from the beta population on version 63 on 32bit windows builds.
so far the signature is accounting for a quarter of all tab crashes but there are ~4 reports per affected installation, so the overall numbers are inflating the real impact.

in bug 1347867, we already were in contact with them, maybe we can reach out again.
Blocks: 1435797
See Also: → 1322219
Whiteboard: [AV:Quick Heal]
chris, in bug 1347867 you said you were in contact with this external vendor. could we try to leverage those contacts again to look into this recent crash spike?
thank you
Flags: needinfo?(cpeterson)
(In reply to [:philipp] from comment #1)
> chris, in bug 1347867 you said you were in contact with this external
> vendor. could we try to leverage those contacts again to look into this
> recent crash spike?

Sure. I will email them.

I spot checked a dozen of these crashes and they all had scdetour.dll version 3.0.1.24 or 3.0.1.25. The crash type is EXCEPTION_STACK_OVERFLOW and appears to only affect 32-bit Windows, all versions.
Flags: needinfo?(cpeterson)
My Quick Heal contact replied and said their engineering team will investigate.
Chris, did you get any further update from Quick Heal?
Flags: needinfo?(cpeterson)
(In reply to Pascal Chevrel:pascalc from comment #4)
> Chris, did you get any further update from Quick Heal?

Not yet. I will ping them again.
Crash Signature: [@ scdetour.dll@0x2dd77] → [@ scdetour.dll@0x2dd27] [@ scdetour.dll@0x2dd77]
Flags: needinfo?(cpeterson)
Aaron, do you know of any Windows changes in Firefox 63 that might caused crashes in Quick Heal internet security software? Maybe AddHook bug 1460002, revised DLL interceptor interface bug 1460022, or cross-process DLL interception 1473371?
Flags: needinfo?(aklotz)
I don't think it could have been any of those bugs, as those were more about interface than changes at the binary level. And cross process interception isn't actually used by default yet.

I am very suspicious about the fact that they are crashing exclusively on 32-bit CPUs. This makes me wonder whether or not they are using Microsoft Detours 2.0 in their scdetour.dll library, as that has a known bug: http://dblohm7.ca/blog/2016/01/11/bugs-from-hell-injected-third-party-code-plus-detours-equals-a-bad-time/
Flags: needinfo?(aklotz)
Our Quick Heal contact confirmed that they have reproduced the crash. They're working on a fix, but did not share an ETA. They are aware of the Firefox 63 release date (2018-10-23).
Chris, could you ask your Quick Heal contact if they made any progress and if they plan to have a fix before we ship? Thanks
Flags: needinfo?(cpeterson)
I'll ping them again. I still see scdetour.dll crashes in 63 Beta, which will be released in less than three weeks (October 23).
Flags: needinfo?(cpeterson)
I think we're building with clang starting in 63, which might explain this.
bug 1052582 might impact this but seems unlikely.
Whiteboard: [AV:Quick Heal] → [AV:Quick Heal][regressed sept 6th]
This started showing up:

63.0a1 - 2018-09-11 (June Nightly)
64.0a1 - 2018-09-12 

Looks like this is caused by an incompatibility in QuickHeal with the 63+ code base. I wonder if they shipped an update around that time?
Whiteboard: [AV:Quick Heal][regressed sept 6th] → [AV:Quick Heal][regressed sept 6th][dll version is 3.0.1.25]
Whiteboard: [AV:Quick Heal][regressed sept 6th][dll version is 3.0.1.25] → [AV:Quick Heal][regressed sept 6th][dll version is 3.0.1.*]
According to the site the last major update of their windows line of av products was October 15th which meshes pretty well with the bustage we're seeing. Maybe they pushed updates out about three weeks after a new release?
Crash Signature: [@ scdetour.dll@0x2dd27] [@ scdetour.dll@0x2dd77] → [@ scdetour.dll@0x2dd27] [@ scdetour.dll@0x2dd77] [@ scdetour.dll@0x2ab39]
I've been trying to reproduce on Windows 7 64-bit with Firefox 64-bit 62 and 63. The feature that injects this dll is called the Browser Sandbox, which is disabled by default in the Quick Heal Anti Virus Pro Trial edition. 

There's an option in the same area of the application's Browser Sandbox config that adds a green border around browsers it loads into. I've reproduced this and confirmed the scdetour.dll loads in IE. However I can't get this feature to work with both versions of Firefox I've tried.

Clearly our users are getting this turned on somehow but I'm at a loss as to what the right set up is to do it.
Whiteboard: [AV:Quick Heal][regressed sept 6th][dll version is 3.0.1.*] → [AV:Quick Heal][regressed sept 6th][dll version is 3.0.1.*][slack: quickheal]
(In reply to Jim Mathies [:jimm] from comment #15)
> I've been trying to reproduce on Windows 7 64-bit with Firefox 64-bit 62 and
> 63.
these crash reports are exclusively from 32bit installations
(In reply to [:philipp] from comment #16)
> (In reply to Jim Mathies [:jimm] from comment #15)
> > I've been trying to reproduce on Windows 7 64-bit with Firefox 64-bit 62 and
> > 63.
> these crash reports are exclusively from 32bit installations

Confirming, I've reproduced in 32-bit Firefox.
when doing some manual bisection 63.0a1 build 20180708220048 appears to be the first nightly with this crash pattern, with the following changelog: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=6c4096&tochange=ffb7b5
when when i retest the workaround from comment #19 today it's no longer valid, so the connection to brotli was likely a red herring, sorry for that! also a try build that ryanvm produced with bug 1474122 backed out was exhibiting the same behaviour.
No longer blocks: 1474122
Release Note Request (optional, but appreciated)

[Why is this notable]:
CRASH REGRESSION in Firefox 63. Quick Heal engineers have been notified and are working on a fix, but we don't have an ETA for their fix yet. We don't know yet what code change in Firefox 63 caused this regression.

[Affects Firefox for Android]:
No.

[Suggested wording]:
Quick Heal internet security software might crash 32-bit Firefox on Windows.

[Links (documentation, blog post, etc)]:
:philipp and I are creating a SUMO article about this crash. WIP here: https://docs.google.com/document/d/16gDCX70_sp-KTSpWc7IGodRZzVJ2Qhy9HO0CvAqbgK4/edit
relnote-firefox: --- → ?
Added to 63 release notes, I will check the link to add on Monday.
I have updated the note with this wording:
Quick Heal internet security software might crash 32-bit Firefox on Windows. A workaround is documented from [this support article][1] until a fixed version of Quick Heal is available. 

[1]: https://support.mozilla.org/en-US/kb/firefox-tab-crashes-quick-heal-security-software
Over to Carl to spend some time investigating the injection mechanism.
Assignee: nobody → ccorcoran
Status: NEW → ASSIGNED
Priority: -- → P1
Here are some observations I found:

I wasn't able to reproduce the crash on x64 Windows 10 (even using x86 Firefox). I had to use a x86 Windows 7 VM to reproduce it. Even then I found that it's timing-sensitive. I tested on build 20181018182531.

They are using ObRegisterCallbacks from a kernel mode driver to inject scdetour.dll as if it were a static dependency (seems this is very common in AV software). The stack trace at the time of load is:
> ntdll!ZwMapViewOfSection+0xc
> ntdll!LdrpMapViewOfSection+0xc7
> ntdll!LdrpFindOrMapDll+0x313
> ntdll!LdrpLoadImportModule+0x296
> ntdll!LdrpHandleOneOldFormatImportDescriptor+0x8b
> ntdll!LdrpHandleOldFormatImportDescriptors+0x1f
> ntdll!LdrpProcessStaticImports+0x25c
> ntdll!LdrpInitializeProcess+0x10eb
> ntdll!_LdrpInitialize+0x78
> ntdll!LdrInitializeThunk+0x10

They install three drivers that hook into process creation at the kernel level, via either PsSetCreateProcessNotifyRoutine, PsSetLoadImageNotifyRoutine, or ObRegisterCallbacks:
  - emltdi.sys ("Mail Protection Driver")
  - bdsflt.sys ("BDS Driver")
  - ggc.sys ("Generic Driver")

Some more info about the crash context:

The crashing thread is overflowing the stack via an alloca() call, requesting about 13mb of stack space. EBP is about 6kb into stack space. The stack at the crash site is:
>  # ChildEBP RetAddr  
> 00 0060e214 742c8f73 Scdetour!DllRegisterServer+0x2a7d7
> 01 0060e254 742ca0c1 Scdetour!DllRegisterServer+0x59d3
> 02 0060e2e4 742d9bf1 Scdetour!DllRegisterServer+0x6b21
> 03 0060e36c 00000000 Scdetour!DllRegisterServer+0x16651

So the previous calls are obfuscated, but I'm seeing "api-ms-win-appmodel-runtime-l1-1-2.dll" in the used stack space.

This appears to be happening in a detoured call to ntdll!ZwQueryAttributesFile. Indeed this is one of the APIs being hooked, which calls down into the top stack frame:

> 0:000> u ZwQueryAttributesFile
> ntdll!ZwQueryAttributesFile:
> 77265a60 e9cb3d07fd      jmp     Scdetour!DllRegisterServer+0x16290 (742d9830)
> 77265a65 ba0003fe7f      mov     edx,offset SharedUserData!SystemCallStub (7ffe0300)

Meanwhile on the main thread, this is the stack:

> kernel32!WaitForSingleObject+0x12
> ...
> nss3!PR_Wait+0x84 [z:\build\build\src\nsprpub\pr\src\threads\prmon.c @ 297] 
> ...
> xul!NS_NewNamedThread<14>+0x3c [z:\build\build\src\obj-firefox\dist\include\nsThreadUtils.h @ 73] 
> xul!mozilla::net::nsSocketTransportService::Init+0x70 [z:\build\build\src\netwerk\base\nsSocketTransportService2.cpp @ 633] 
> xul!nsSocketTransportServiceConstructor+0x50 [z:\build\build\src\netwerk\build\nsNetModule.cpp @ 75] 
> xul!mozilla::GenericFactory::CreateInstance+0x12 [z:\build\build\src\xpcom\components\GenericFactory.cpp @ 17] 
> xul!nsComponentManagerImpl::CreateInstanceByContractID+0x84 [z:\build\build\src\xpcom\components\nsComponentManager.cpp @ 1161] 
> xul!nsComponentManagerImpl::GetServiceByContractID+0x2fb [z:\build\build\src\xpcom\components\nsComponentManager.cpp @ 1523] 
> xul!mozilla::net::nsIOService::InitializeSocketTransportService+0x3a [z:\build\build\src\netwerk\base\nsIOService.cpp @ 299] 
> xul!mozilla::net::nsIOService::SetOffline+0x10f [z:\build\build\src\netwerk\base\nsIOService.cpp @ 1075] 
> xul!mozilla::net::nsIOService::Init+0x1e1 [z:\build\build\src\netwerk\base\nsIOService.cpp @ 263] 
> xul!mozilla::net::nsIOService::GetInstance+0x38 [z:\build\build\src\netwerk\base\nsIOService.cpp @ 365] 
> xul!nsIOServiceConstructor+0x2a [z:\build\build\src\netwerk\build\nsNetModule.cpp @ 57] 
> xul!mozilla::GenericFactory::CreateInstance+0x12 [z:\build\build\src\xpcom\components\GenericFactory.cpp @ 17] 
> xul!nsComponentManagerImpl::CreateInstanceByContractID+0x84 [z:\build\build\src\xpcom\components\nsComponentManager.cpp @ 1161] 
> xul!nsComponentManagerImpl::GetServiceByContractID+0x2fb [z:\build\build\src\xpcom\components\nsComponentManager.cpp @ 1523] 
> xul!nsScriptSecurityManager::Init+0x3a [z:\build\build\src\caps\nsScriptSecurityManager.cpp @ 1390] 
> xul!nsScriptSecurityManager::InitStatics+0x4b [z:\build\build\src\caps\nsScriptSecurityManager.cpp @ 1459] 
> xul!nsXPConnect::InitStatics+0x26 [z:\build\build\src\js\xpconnect\src\nsXPConnect.cpp @ 143] 
> xul!xpcModuleCtor+0x8 [z:\build\build\src\js\xpconnect\src\XPCModule.cpp @ 15] 
> xul!Initialize+0x26 [z:\build\build\src\layout\build\nsLayoutModule.cpp @ 263] 
> xul!nsFactoryEntry::GetFactory+0x2da [z:\build\build\src\xpcom\components\nsComponentManager.cpp @ 1859] 
> xul!nsComponentManagerImpl::CreateInstanceByContractID+0x6c [z:\build\build\src\xpcom\components\nsComponentManager.cpp @ 1158] 
> xul!nsCreateInstanceByContractID::operator()+0x20 [z:\build\build\src\xpcom\components\nsComponentManagerUtils.cpp @ 198] 
> xul!nsCOMPtr_base::assign_from_helper+0x25 [z:\build\build\src\xpcom\base\nsCOMPtr.cpp @ 128] 
> xul!LogMessageWithContext+0xb4 [z:\build\build\src\xpcom\components\ManifestParser.cpp @ 152] 
> ...
> xul!nsComponentManagerImpl::RereadChromeManifests+0x44 [z:\build\build\src\xpcom\components\nsComponentManager.cpp @ 796] 
> xul!nsComponentManagerImpl::Init+0x22e [z:\build\build\src\xpcom\components\nsComponentManager.cpp @ 415] 
> xul!NS_InitXPCOM2+0x3e3 [z:\build\build\src\xpcom\build\XPCOMInit.cpp @ 693] 
> ...
> xul!mozilla::dom::ContentProcess::Init+0x4fa [z:\build\build\src\dom\ipc\ContentProcess.cpp @ 302] 
> xul!XRE_InitChildProcess+0x612 [z:\build\build\src\toolkit\xre\nsEmbedFunctions.cpp @ 744] 
> xul!mozilla::BootstrapImpl::XRE_InitChildProcess+0x11 [z:\build\build\src\toolkit\xre\Bootstrap.cpp @ 69] 
> firefox!content_process_main+0x61 [z:\build\build\src\ipc\contentproc\plugin-container.cpp @ 50] 
> ...
> kernel32!BaseThreadInitThunk+0xe

I wasn't able to catch the code that invokes the detoured call. Attempts I made at this ended up not reproing; it appears to be timing-related. The crashing alloca() call is in a critsec owned by scdetour.dll.
(In reply to Carl Corcoran [:ccorcoran] from comment #27)
> They are using ObRegisterCallbacks from a kernel mode driver to inject
> scdetour.dll as if it were a static dependency (seems this is very common in
> AV software). The stack trace at the time of load is:
> > ntdll!ZwMapViewOfSection+0xc
> > ntdll!LdrpMapViewOfSection+0xc7
> > ntdll!LdrpFindOrMapDll+0x313
> > ntdll!LdrpLoadImportModule+0x296
> > ntdll!LdrpHandleOneOldFormatImportDescriptor+0x8b
> > ntdll!LdrpHandleOldFormatImportDescriptors+0x1f
> > ntdll!LdrpProcessStaticImports+0x25c
> > ntdll!LdrpInitializeProcess+0x10eb
> > ntdll!_LdrpInitialize+0x78
> > ntdll!LdrInitializeThunk+0x10
> 

Which DLL are they set up as a static dependency of? When is the static dependency set up? ie if we CREATE_SUSPENDED a process, would it be possible to then poke at the affected DLL's import descriptors to remove scdetour.dll?
(In reply to Aaron Klotz [:aklotz] from comment #28)

> Which DLL are they set up as a static dependency of?

They patch the import table of firefox.exe to include scdetour!DllRegisterServer

> When is the static
> dependency set up? ie if we CREATE_SUSPENDED a process, would it be possible
> to then poke at the affected DLL's import descriptors to remove scdetour.dll?

Yes with the launcher process I think we have an opportunity to do exactly that. We'd need to experiment a bit I think. When the process is launched in suspended state, scdetour.dll will be mapped but its entry point will not have run yet. So a few hacky ideas off the top of my head we could do at that moment:

- Remove scdetour from the PEB's loaded module lists, and unmap scdetour.dll.
- Change flags in the PEB so the Windows loader thinks scdetour's entry point has already been called.
- Replace scdetour's entry point with { return 0; }. I don't think it would be necessary to NOP out anything else.

Side-note: A while ago I had experimented with a few other techniques that didn't work out cleanly (bug 1380335), but that was before the launcher process.
See Also: → 1380335
Any sense of how this code operates once the dll is loaded? Do they create a thread that's running in their code or maybe react to events on the firefox main thread?

Also does our neutering of BaseThreadInitThunk threads help us here?
See Also: → 1503538
This DLL performs all of its initialization during its entry point, and I didn't notice any worker threads, though hopefully that's not relevant if we manage to just block it outright.

The experiment of trying to manually scrape out the injected static dependency is working so far, so I have created bug 1503538 to explore that particular effort.

The code basically just does the following:

- Launches firefox.exe with the CREATE_SUSPENDED flag
- Finds the base address of firefox.exe in the child process, finds the import address table
- Walks through import entries looking for the target
- Rearranges entries to remove the target
- Resume the child process

And it's looking clean.
Attached file RemoveImportEntry.cpp
Attaching a proof of concept for this technique. The C++ console app here (please excuse the crude code style)

- Launches Firefox in suspended state
- Walks its import table
- Removes the target item by doing an OOP memmove.
- Resumes the process

For this particular bug, you'll still see the crash happen, because this POC will only remove the DLL for the main process, which was not crashing anyway. But this technique can theoretically be extended to any child processes.
Another note about Quick Heal: The injected dependency references a full path to the DLL. So in the import table, they refer to the DLL as literally,

> C:\Windows\system32\Scdetour.dll

When otherwise DLLs here are always just the module name, never a full path. Presumably this is to ensure the correct "Scdetour.dll" is picked up by the loader. If we do have the opportunity to scan the import table for unwanted DLLs, seeing a full path is a huge red flag.
This scdetour.dll crash is Firefox 63.0.1 topcrash #11 for content processes and #26 among all processes, but the volume is pretty low compared to other crashes.

I haven't heard any news from our Quick Heal contact since the Firefox 63.0 release.
Crash Signature: [@ scdetour.dll@0x2dd27] [@ scdetour.dll@0x2dd77] [@ scdetour.dll@0x2ab39] → [@ scdetour.dll@0x2dd27] [@ scdetour.dll@0x2dd77] [@ scdetour.dll@0x2ab39] [@ scdetour.dll@0x7233]
Another note about scdetour: In order to ensure that they have enough space to inject the dependency, they move the import directory to a new location that's outside the mapped module image.

This gives us another potential red flag to check for, and a potential fallback way to correct it by just restoring the pointer to the original import table.
Wontfix for 63 as the volume on the release channel went down and it's unlikely that we have a fix + a dot release before we ship 64. Tracking for 64 as the volume of crashes on beta is very high.
Some updates: The mechanism to block this (bug 1503538) successfully prevents these content process crashes and the browser is usable. Plugin-container.exe causes a problem though. From local observation,

- If I don't block scdetour from plugin-container.exe, then the whole system freezes when it's invoked (requiring a reboot).
- If I do block it, then I get a bluescreen.

It may be that Quick Heal is making assumptions about the presence of scdetour based on the process tree. There should be ways of fudging that as well, in hopes that Quick Heal will stop making these assumptions, but I hesitate to engineer a solution like that.

Or is there some other acceptable action we should consider? For example starting in safe mode when Quick Heal is detected.
Jim what do you recommend as a course of action with respect to plugin-container.exe? If we don't fix the plugin-container.exe system-wide-hangs, then users are faced with a minefield of system-wide hangs which is arguably worse than the original behavior. My instinct is that attempting to fix the hang will be too much work: the hang appears to be instigated by their kernel-mode driver which we may just not be able to control in any reasonable way.

I'm thinking of a solution like: if we block scdetour.dll from the browser, then disallow the use of plugins as a compatibility compromise. From what I can tell, blocking plugins should render a stable browser. We did make some similar compatibility compromise with WRusr.dll a while back (bug 1361410), though it didn't cause any significant user-facing behavior differences like this would. If we did this, we may need to think deeper about how to express this behavior change to users.
Flags: needinfo?(jmathies)
I requested PI resources to help us understand better:

* The extent of the original crash, so we know on which Windows versions & architectures to apply our fix.
* How our fix affects browser stability, basic functionality, and Flash plugin functionality.

My local testing has been inconsistent and unreliable and PI can increase our confidence about the fix.

Regarding plugin-container.exe, I wasn't able to reproduce the hangs or blue screens this afternoon. Again, PI should help understand whether this was just an anomaly on my local environment.
Flags: needinfo?(jmathies)
PI is getting results already; I am going to try and capture semi-structured results here: https://docs.google.com/spreadsheets/d/1esNWcSzJX4HtmakQsO7yrRya3oqhJmCX0gF8R1BUe3A/edit?usp=sharing
As we see results, we're also refining the test plan to answer the right questions. I am trying to keep my PI request document up-to-date: https://docs.google.com/document/d/1Zo7Akhb3CpHh2rBf-jrBE1suPuF-hI07mKfQrETJHjM
PI has concluded this round of testing. The main conclusions:

1. The original content crash is isolated to 32-bit Firefox builds, regardless of OS version or CPU architecture. So any blocking we do should be restricted to 32-bit builds.

2. When blocking scdetour.dll in main + content processes, Firefox is relatively stable, except for blue screen crashes when launching the Flash plugin. PI testing suggests that this is isolated to when running in a virtual machine. VirtualBox crashed the most, almost 100%, while VMWare crashed slightly less. Of the 4 physical machines tested, no blue screens were encountered. 

The physical machines have quite similar configurations, so this isn't enough to be 100% confident that blue screens are isolated to virtual machines. But it's enough to get started on rolling this out and measuring the results.

I am building a suitable query on STMO that we can use to approximately track blue screen occurrences for these users, so we can track the effectiveness of the fix as we roll it out.
Crash Signature: [@ scdetour.dll@0x2dd27] [@ scdetour.dll@0x2dd77] [@ scdetour.dll@0x2ab39] [@ scdetour.dll@0x7233] → [@ scdetour.dll@0x2dd27] [@ scdetour.dll@0x2dd77] [@ scdetour.dll@0x2ab39] [@ scdetour.dll@0x7233] [@ scdetour.dll@0x226bd] [@ scdetour.dll@0x27949]
Quick Heal's browser sandbox causes content process crashes. This addition to
our DLL block list serves to both:

1. Directly prevent the crashes that Quick Heal's browser sandbox causes.
2. Serve as the first DLL that uses the STATIC_DEPENDENCY_ONLY flag, introduced
   with bug 1503538.

Per the bug notes, restricting to NIGHTLY_BUILD as we track the effect of this
block before opening it up to general release.

Depends on D13200
Have you checked how many users on Nightly have this DLL? I wonder if landing on Nightly will actually validate that there are no blue screens in the wild.
Flags: needinfo?(ccorcoran)
I found 59 Nightly unique client_ids using Quick Heal software (source: https://sql.telemetry.mozilla.org/queries/60622 ), but 0 crashes of the scdetour.dll* crashes submitted on Nightly (source: crash-stats aggregations, and https://sql.telemetry.mozilla.org/queries/60640/source?p_Channel_60640=nightly ). It may be that all these users have the browser sandbox turned off, which is the default setting for Quick Heal.

So releasing to Nightly would not reveal any trends.

Instead of restricting to Nightly, another idea is to restrict to a deterministic 5% general sample, and then aim for beta uplift. We get between ~100-300 scdetour.dll crashes on beta daily, so it's enough to see activity.

The trick will be to select a sample that we can both:
- Select in the launcher process (client_id I don't believe is accessible from this point currently)
- Isolate in redash
Flags: needinfo?(ccorcoran)
Crash Signature: [@ scdetour.dll@0x2dd27] [@ scdetour.dll@0x2dd77] [@ scdetour.dll@0x2ab39] [@ scdetour.dll@0x7233] [@ scdetour.dll@0x226bd] [@ scdetour.dll@0x27949] → [@ scdetour.dll@0x2dd27] [@ scdetour.dll@0x2dd77] [@ scdetour.dll@0x2ab39] [@ scdetour.dll@0x7233] [@ scdetour.dll@0x226bd] [@ scdetour.dll@0x27949] [@ scdetour.dll]
I wanted to give an update on this. The plan remains the same: Land this restricted to EARLY_BETA_OR_EARLIER, follow somewhere on the tails of the launcher process into beta, and monitor telemetry to see how it performs.

I have dug into the bluescreens / instabilities again in the past couple days and want to share my findings.

First, the bluescreens are not reproducing for me anymore on any of my VMs. This was a 100% repro before, and 0% now. So I believe it was due to an app update from QH.

I had also seen system freezes before but I hadn't looked deeper into this. These were hard freezes; cursor not moving etc.

Today though I am getting a consistent user-mode crash in plugin-container.exe. The crash happens because the crash reporter attempts to connect to the firefox.exe process, but CreateFile() on the named pipe is returning ERROR_FILE_NOT_FOUND. I verified that the client / server pipes are correct, and the code path is similar to normal operation.

I believe QH's browser sandbox is preventing the IPC connection. I checked to see if they're hooking system APIs to do this. They are not hooking CreateFileW, nor are they hooking nt!*NtCreateNamedPipeFile or nt!*CreateFile. Side note: I even found that calls to OutputDebugString() from plugin-container.exe were not succeeding.

This IPC failure also happens even if QH's browser sandbox is disabled, or when QH is turned off completely (though I notice that when it's disabled, they still don't actually close any of their services or processes and scdetour still attempts to inject).

This being a user crash is relieving, but I want to ask PI if they still have a VM to confirm that the bluescreens are no longer happening.
I did some more manual testing today to get a clearer picture overall.

A couple main takeaways:
- Original crash is actually not reproing anymore on Win7 or Win10. On Win8.1 the original crash remains, and blocking scdetour allows Flash to operate fine.
- No bluescreens across 10 configurations which were previously a 100% repro.

So while things appear to keep shifting around, blocking scdetour does continue to increase stability overall. I propose we continue forward with releasing the block for 32-bit Windows and EARLY_BETA_OR_EARLIER, and watching closely.

I captured the raw test results here: https://docs.google.com/spreadsheets/d/1esNWcSzJX4HtmakQsO7yrRya3oqhJmCX0gF8R1BUe3A/edit#gid=913032545
Here's a redash query which shows daily scdetour crashes by release / windows version: https://sql.telemetry.mozilla.org/queries/60640/source?p_Channel_60640=release#156488
Depends on: 1503538

(In reply to Carl Corcoran [:ccorcoran] from comment #47)

So while things appear to keep shifting around, blocking scdetour does
continue to increase stability overall. I propose we continue forward with
releasing the block for 32-bit Windows and EARLY_BETA_OR_EARLIER, and
watching closely.

Did we end up shipping this block?

Flags: needinfo?(ccorcoran)

(In reply to Andrew Overholt [:overholt] from comment #49)

(In reply to Carl Corcoran [:ccorcoran] from comment #47)

So while things appear to keep shifting around, blocking scdetour does
continue to increase stability overall. I propose we continue forward with
releasing the block for 32-bit Windows and EARLY_BETA_OR_EARLIER, and
watching closely.

Did we end up shipping this block?

Not yet.

Carl, EARLY_BETA_OR_EARLIER will end with next week's beta 7 build so FYI if you still plan to land a change for that.

Flags: needinfo?(ccorcoran) → needinfo?(aklotz)

I can pick this up soon, but I have some higher-priority work atm.

Assignee: ccorcoran → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(aklotz)
Priority: P1 → P2

I found another signature in beta 66 that is 100% correlated to Quick Heal - https://bit.ly/2X9sEBI.

Aaron, anything you can try here? Crash volume is still very high on beta 66.

Or, Chris, would you mind contacting quickheal to ask about the remaining problem on Windows 8.1?

Flags: needinfo?(cpeterson)
Flags: needinfo?(aklotz)

(In reply to Liz Henry (:lizzard) (use needinfo) from comment #54)

Or, Chris, would you mind contacting quickheal to ask about the remaining problem on Windows 8.1?

What is the Windows 8.1 problem? Is that a different QuickHeal crash than the one tracked in this bug?

Flags: needinfo?(cpeterson)

Carl mentioned in comment 47 that the remaining crashes are in 8.1.

Actually, checking back on the redash query I'm not sure that's true - most of the crashes are Win 7.

Comment 47 says "On Win8.1 the original crash remains, and blocking scdetour allows Flash to operate fine", so I am assuming we don't need to contact QuickHeal about Windows 8.1. Just let me know if I should.

AFAIK we have never received any kind of response from QuickHeal.

We have to land the dependent bug for this, and it's on my radar, but I also have other high-priority bugs that are in flight. This is currently #3 on my list.

Flags: needinfo?(aklotz)
See Also: → 1529020
Assignee: nobody → aklotz
Status: NEW → ASSIGNED
Priority: P2 → P1

Aaron, is this on track to be fixed in 67?

Flags: needinfo?(aklotz)

No, this is too risky to fix in 67.

Flags: needinfo?(aklotz)

Duping this over to bug 1503538, since that's where the work is going.

Status: ASSIGNED → RESOLVED
Closed: 7 months ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1503538
You need to log in before you can comment on or make changes to this bug.