Closed Bug 1294903 Opened 8 years ago Closed 8 years ago

Regression in Firefox 50: crashes in COIDTable::ThreadCleanup with reasons EXCEPTION_{ACCESS_VIOLATION_{EXEC,READ,WRITE},PRIV_INSTRUCTION,ILLEGAL_INSTRUCTION}

Categories

(Core :: Widget: Win32, defect)

50 Branch
x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla53
Tracking Status
platform-rel --- +
firefox48 --- unaffected
firefox49 --- unaffected
firefox-esr45 --- unaffected
firefox50 --- fixed
firefox51 + fixed
firefox52 --- fixed
firefox53 --- fixed

People

(Reporter: marcia, Assigned: bugzilla)

References

Details

(Keywords: crash, regression, topcrash, Whiteboard: [platform-rel-Microsoft][tbird topcrash])

Crash Data

Attachments

(3 files, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-60ca0bff-53ed-47fc-9f2d-6bd2f2160812.
=============================================================

Seen while looking at 50 crash data: http://bit.ly/2bpzIXz. Currently sits at #17 top crash in the last 7 days.

http://stackoverflow.com/questions/21579498/crashes-in-ole32coidtablethreadcleanup-networkitemfactoryfdbackgroundthr seems to indicate something external is causing the crash.
Hi Andrew, Anthony, this is a aurora crash that is trending up and now at #7. Can you please help find an owner? Thanks!
Flags: needinfo?(overholt)
Flags: needinfo?(avaughn)
This feels like a Jim or maybe Aaron situation.
Flags: needinfo?(overholt)
Flags: needinfo?(jmathies)
Flags: needinfo?(aklotz)
Looks like some internal COM data structure was corrupted and is crashing when fundisc.dll tries to CoUninitialize() it.

Short of reporting this to MSFT, I'm not sure what else we can do here...
Flags: needinfo?(aklotz)
According to https://msdn.microsoft.com/en-us/library/windows/desktop/aa814070(v=vs.85).aspx the DLL in question provides hardware discovery APIs.
Crash volume for signature 'COIDTable::ThreadCleanup':
 - nightly (version 51): 19 crashes from 2016-08-01.
 - aurora  (version 50): 402 crashes from 2016-08-01.
 - beta    (version 49): 0 crashes from 2016-08-02.
 - release (version 48): 13 crashes from 2016-07-25.
 - esr     (version 45): 712 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly       4       4       6
 - aurora      160     109      37
 - beta          0       0       0
 - release       2       3       3
 - esr          54      51     100

Affected platform: Windows

Crash rank on the last 7 days:
           Browser     Content   Plugin
 - nightly #103
 - aurora  #9
 - beta
 - release #2902
 - esr     #104
From https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&date=%3E2016-01-01&signature=COIDTable%3A%3AThreadCleanup&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-date&page=1

we can see that the distribution on the nightly channel by build ID is:

Rank 	Build id 	Count 	%
5 	20160725030248 	1 	4.00 %
6 	20160727030230 	1 	4.00 %
7 	20160731030203 	1 	4.00 %
8 	20160801030227 	1 	4.00 %
9 	20160801074053 	1 	4.00 %
10 	20160802030437 	1 	4.00 %
11 	20160803030226 	1 	4.00 %
1 	20160804030441 	3 	12.00 %

i.e., that the first appearance of the crash on nightly was in the July 25 build.


The same search for the Aurora channel shows:

Rank 	Build id 	Count 	%
5 	20160804004004 	27 	5.57 %
13 	20160805004022 	20 	4.12 %
23 	20160806004002 	3 	0.62 %
18 	20160807004000 	12 	2.47 %
10 	20160808004011 	21 	4.33 %

which means it started on August 4 builds, which matches when Aurora updates wer re-enabled for 50.0 (August 4 at 11am PDT).


So this certainly seems like it was triggered by a change in our code.

Perhaps it's worth looking through:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=d224fc999cb6accb208af0a105f14433375e2e77&tochange=7c669d5d63efceb12696cd65cfa72c296013dafb
for candidates?
Flags: needinfo?(avaughn) → needinfo?(aklotz)
The crashes on release and esr channels are an entirely different bug, with reason EXCEPTION_STACK_BUFFER_OVERRUN.  These crashes are mostly EXCEPTION_ACCESS_VIOLATION_EXEC with a few EXCEPTION_PRIV_INSTRUCTION, EXCEPTION_ACCESS_VIOLATION_WRITE, EXCEPTION_ACCESS_VIOLATION_READ, and EXCEPTION_ILLEGAL_INSTRUCTION mixed in.
Summary: Crash in COIDTable::ThreadCleanup → Regression in Firefox 50: crashes in COIDTable::ThreadCleanup with reasons EXCEPTION_{ACCESS_VIOLATION_{EXEC,READ,WRITE},PRIV_INSTRUCTION,ILLEGAL_INSTRUCTION}
One interesting item in that changelog is the landing of bug 1277075.

I'm not sure whether CoUninitialize() uses the catch-all SEH, but if it does, then these crashes would have previously been suppressed by that and are now being surfaced.
Flags: needinfo?(aklotz)
platform-rel: --- → ?
Whiteboard: [platform-rel-Microsoft]
This is a Windows 7 only crash, looks like it may have something to do with network discovery. Although the main crashstats report doesn't indicate a correlation I see NetworkExplorer.dll and related dlls loaded into our process space in every report.

There's a Windows hotfix that addresses something similar in Explorer asociated with these files -

https://support.microsoft.com/en-us/kb/2494427
Flags: needinfo?(jmathies)
100% of crashes are with platform_version = 6.1.7601 Service Pack 1.
the [@ DisconnectSwitch] signature has a similar pattern.
Crash Signature: [@ COIDTable::ThreadCleanup] → [@ COIDTable::ThreadCleanup] [@ DisconnectSwitch]
Keywords: regression
Regarding "DisconnectSwitch":
(93.71% in signature vs 07.48% overall) reason = EXCEPTION_ACCESS_VIOLATION_EXEC
(100.0% in signature vs 59.00% overall) platform_pretty_version = Windows 7
(76.65% in signature vs 39.74% overall) platform_version = 6.1.7601 Service Pack 1
(57.49% in signature vs 30.62% overall) dom_ipc_enabled = 1
(89.52% in signature vs 63.13% overall) adapter_vendor_id = 0x8086
(99.40% in signature vs 80.18% overall) useragent_locale = en-US

By looking at the URLs, I've noticed a lot of Indian websites (.in), several
URLs that contain the substring 'upload' and several 'form'.

Regarding "COIDTable::ThreadCleanup":
(94.52% in signature vs 07.48% overall) reason = EXCEPTION_ACCESS_VIOLATION_EXEC
(100.0% in signature vs 39.74% overall) platform_version = 6.1.7601 Service Pack 1
(100.0% in signature vs 59.00% overall) platform_pretty_version = Windows 7
(30.14% in signature vs 06.67% overall) bios_manufacturer = Hewlett-Packard
(57.53% in signature vs 80.18% overall) useragent_locale = en-US
(100.0% in signature vs 81.40% overall) startup = false
(21.46% in signature vs 04.39% overall) cpu_info = GenuineIntel family 6 model 60 stepping 3 | 4

In both cases I've noticed a recurring URL is https://www.wetransfer.com/.
IIRC we recently fixed a bug affecting that website, but IIRC it was a graphics-related
crash, which this one doesn't seem to be.
Can we safely disable the CoInitializeSecurity code for a beta build to see?
Flags: needinfo?(aklotz)
We can do that for one beta release, yes. The most important use case for this code is to catch our own crashes during development, so I'd be okay for doing it for one cycle, as long as we revert that change again. ;-)
Flags: needinfo?(aklotz)
Here's the patch if you want to go ahead with that...
Attachment #8795071 - Flags: review?(jmathies)
Whoops, that first one wasn't setting mInitResult correctly.
Attachment #8795071 - Attachment is obsolete: true
Attachment #8795071 - Flags: review?(jmathies)
Attachment #8795073 - Flags: review?(jmathies)
(In reply to Marco Castelluccio [:marco] from comment #12)
> In both cases I've noticed a recurring URL is https://www.wetransfer.com/.
> IIRC we recently fixed a bug affecting that website, but IIRC it was a
> graphics-related
> crash, which this one doesn't seem to be.

Bug 1272123, it does seem unrelated.
(In reply to Jim Mathies [:jimm] from comment #9)
> This is a Windows 7 only crash, looks like it may have something to do with
> network discovery. Although the main crashstats report doesn't indicate a
> correlation I see NetworkExplorer.dll and related dlls loaded into our
> process space in every report.
> 
> There's a Windows hotfix that addresses something similar in Explorer
> asociated with these files -
> 
> https://support.microsoft.com/en-us/kb/2494427

If this is in fact the same issue, we might benefit from Microsoft's announced plans to start issuing monthly patch roll-ups for Win7 and Win8.1 as well starting in October. Would be interesting to see what happens with crash rates after October 11.
https://blogs.technet.microsoft.com/windowsitpro/2016/08/15/further-simplifying-servicing-model-for-windows-7-and-windows-8-1/
Comment on attachment 8795073 [details] [diff] [review]
Temporarily re-enable COM's catch-all exception handler (r2)

This is the first time we have something we think might be related to our com security changes. I'd like to get this change into beta, see if reduces the crash rate here.
Attachment #8795073 - Flags: review?(jmathies) → review+
Flags: needinfo?(aklotz)
Comment on attachment 8795073 [details] [diff] [review]
Temporarily re-enable COM's catch-all exception handler (r2)

Approval Request Comment
[Feature/regressing bug #]: Initialization of Microsoft COM runtime
[User impact if declined]: We will not be able to determine whether this bug was surfaced due to the disabling of the COM catch-all exception handler.
[Describe test coverage new/current, TreeHerder]: This reverts our COM startup code back to the way it has been done on all versions prior to 50.
[Risks and why]: Low, we've already been releasing with COM working this way since the beginning.
[String/UUID change made/needed]: None
Flags: needinfo?(aklotz)
Attachment #8795073 - Flags: approval-mozilla-beta?
Comment on attachment 8795073 [details] [diff] [review]
Temporarily re-enable COM's catch-all exception handler (r2)

This is a top crasher on 50, Seems like a patch backout, Beta50+
Attachment #8795073 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/releases/mozilla-beta/rev/8a1afb4cbebb

Setting the status to fixed so that it's off the needs-uplift radar for now. Feel free to change it back to affected when the time comes.
platform-rel: ? → +
OK, so this tells us that the reason why we have not seen these crashes until now is that the catch-all exception handler was suppressing the problem.

We don't actually know how long this has been an issue.

I would like to point out that leaving in this temporary patch is not the solution. As I wrote back in July:

> Recently it came to my attention that Microsoft COM by default calls back into COM objects from
> within a catch-all exception handler. This exception handler catches any and all exceptions and
> suppresses them.
> 
> Unfortunately this means that any crashes that occur within COM dispatch will have been completely
> missed by the crash reporter. Even worse, when crashes aren't surfaced, it may lead to undefined
> behavior within Firefox due to corrupted state.
> 
> Bug 1277075 overrides this default behavior and forcibly shuts off this exception handler. Once
> this patch merges, crashes that had previously been suppressed will now be caught by the crash
> reporter or by our automated testing.
> 
> When this bug merges, it will be very critical to remember that correlation is not causation: if
> the crash rate spikes, it is not because this bug caused the crashes, but rather because this bug
> has revealed crashes that were previously being hidden. If such a spike occurs, the correct
> response is to find the true cause of the crashes, as opposed to backing this patch out.
Comparing the module+version correlations with the binaries included in that hotfix mentioned above, 100% of the crashes are with a version of NetworkItemFactory.dll that is replaced by the hotfix.
Is there a bug to undo the temporary change, or are you planning to do that here as well?
Flags: needinfo?(aklotz)
Whiteboard: [platform-rel-Microsoft] → [platform-rel-Microsoft][tbird crash]
I'll be doing that here.
Flags: needinfo?(aklotz)
Ritu, could I please have approval for backing out https://hg.mozilla.org/releases/mozilla-beta/rev/8a1afb4cbebb ? 

That was a temporary change to determine whether the catch-all exception handler was suppressing these crashes. I'd like to revert it.
Flags: needinfo?(rkothari)
(In reply to Aaron Klotz [:aklotz] from comment #28)
> Ritu, could I please have approval for backing out
> https://hg.mozilla.org/releases/mozilla-beta/rev/8a1afb4cbebb ? 
> 
> That was a temporary change to determine whether the catch-all exception
> handler was suppressing these crashes. I'd like to revert it.

Hi Aaron, I understand the catch-all is not ideal but please note that Beta50 crash rates are a bit worse than Beta49. Do we have a solution to the actual problem? I don't think so based on my communication with the window7 (Microsoft) contact. They need crash dumps from us, which isn't what we share with external folks. Is leaving the catch-all masking other worse bugs? Or is it better to leave it in until we have an actual fix from Microsoft?
Flags: needinfo?(rkothari)
(In reply to Ritu Kothari (:ritu) from comment #29)
> Hi Aaron, I understand the catch-all is not ideal but please note that
> Beta50 crash rates are a bit worse than Beta49. Do we have a solution to the
> actual problem?

We do not. We suspect that a Microsoft patch *might* fix this, but without talking to an affected user and having them try that, we don't know for sure.

> Is leaving the catch-all masking other worse
> bugs? Or is it better to leave it in until we have an actual fix from
> Microsoft?

Without knowing what the catch-all could possibly be masking on the release channel, it's really impossible to say. Certainly on beta, this crash is the only one that we know about that has specifically been unmasked.

Of course, the caveat is that, by suppressing this crash, there is some state inside the Microsoft COM runtime that becomes corrupted. We don't know how that corruption manifests itself later on in the lifetime of the process.
Whiteboard: [platform-rel-Microsoft][tbird crash] → [platform-rel-Microsoft][tbird topcrash]
This crash becomes #2 in top crashes.
Hi :aklotz,
This crash is still the #2 top crash. We are about to go to 51 beta, if there is no perfect solution for the moment, do you think we can use the same approach in 51 as we used in 50 beta?
Flags: needinfo?(aklotz)
No.
Flags: needinfo?(aklotz)
Track 51+ as this is #2 top crash.
Hi :jimm,
can you help find someone who can work on this?
Flags: needinfo?(jmathies)
I have reached out to the top 3 affected users who included email addresses in their crash reports in order to determine whether or not the Windows hotfix helps them. Our course of action will depend on whether they respond, and if so, what their responses are. I do believe that the top affected user is quite motivated to help given the comments they made in their crash reports.
Flags: needinfo?(jmathies)
This is a speculative fix. It is clear from some debugging and the crash stacks that the COM multithreaded apartment is being shut down when this crash hits. If we leave our own MTA thread running in the background, that will increment the reference count on the MTA and should prevent MTA teardown.

Hopefully that resolves whatever race condition triggers these crashes.

Here's the stack that triggers the loading of fundisc.dll:

KERNELBASE!LoadLibraryExW+0x138
combase!LoadLibraryWithLogging+0x1b
combase!CClassCache::CDllPathEntry::LoadDll+0xc3
combase!CClassCache::CDllPathEntry::Create+0x3c
combase!CClassCache::CClassEntry::CreateDllClassEntry+0xeb
combase!CClassCache::GetClassObjectActivator+0x85b
combase!CClassCache::GetClassObject+0x30
combase!CCGetClassObject+0x4c
combase!CServerContextActivator::CreateInstance+0x12c
combase!ActivationPropertiesIn::DelegateCreateInstance+0x93
combase!CApartmentActivator::CreateInstance+0xa2
combase!CProcessActivator::CCICallback+0x65
combase!CProcessActivator::AttemptActivation+0x40
combase!CProcessActivator::ActivateByContext+0x77
combase!CProcessActivator::CreateInstance+0x67
combase!ActivationPropertiesIn::DelegateCreateInstance+0xba
combase!CClientContextActivator::CreateInstance+0xfd
combase!ActivationPropertiesIn::DelegateCreateInstance+0x9a
combase!ICoCreateInstanceEx+0xc46
combase!CComActivator::DoCreateInstance+0x149
combase!CoCreateInstanceEx+0x105
combase!CoCreateInstance+0x13a
NetworkItemFactory!CNetworkItemFactory::s_StartFDInMTA+0x3e
NetworkItemFactory!FDBackgroundThreadHandler+0x1b
shcore!CallerIdentity::GetCallingProcessHandle+0x153
KERNEL32!BaseThreadInitThunk+0x24
ntdll!__RtlUserThreadStart+0x2f
ntdll!_RtlUserThreadStart+0x1b

We can see that NetworkItemFactory is clearly loading that library in the MTA based on the symbol name NetworkItemFactory!CNetworkItemFactory::s_StartFDInMTA.
Also it is clear from debugging that the affected dlls are all dependencies of explorerframe.dll which is loaded whenever we instantiate CLSID_FileOpenDialog. I'm not sure whether there are other instances of us calling shell APIs that might inadvertently import Explorer goop; I hope that the file pickers are the main ones.
Comment on attachment 8810948 [details]
Bug 1294903: Add a default constructor to mscom::EnsureMTA that ensures that the MTA thread is running;

https://reviewboard.mozilla.org/r/93212/#review93608
Attachment #8810948 - Flags: review?(jmathies) → review+
Comment on attachment 8810949 [details]
Bug 1294903: Modify file and folder pickers to ensure that the background MTA thread is running if we're on Windows 7;

https://reviewboard.mozilla.org/r/93214/#review93610
Attachment #8810949 - Flags: review?(jmathies) → review+
Pushed by aklotz@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/be0fb2d8d036
Add a default constructor to mscom::EnsureMTA that ensures that the MTA thread is running; r=jimm
https://hg.mozilla.org/integration/autoland/rev/eae1056ae373
Modify file and folder pickers to ensure that the background MTA thread is running if we're on Windows 7; r=jimm
Rank: 2
the patch seems to have successfully fixed the issue, since there are no more crashes with these signatures recorded on nightly after build 20161117030212
Seems like we should get the fix backported.
Assignee: nobody → aklotz
Status: NEW → RESOLVED
Closed: 8 years ago
Component: General → Widget: Win32
Keywords: leave-opentopcrash
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
Hello Jim, as Aaron is on PTO, could you please nominate this fix for uplift to Aurora52/Beta51? Thanks!
Flags: needinfo?(jmathies)
Comment on attachment 8810948 [details]
Bug 1294903: Add a default constructor to mscom::EnsureMTA that ensures that the MTA thread is running;

Approval Request Comment
[Feature/Bug causing the regression]: Windows 7 bug, revealed by bug 1277075
[User impact if declined]: Crashes after using file pickers
[Is this code covered by automated tests?]: Yes, various automated tests launch file pickers, would covers this code.
[Has the fix been verified in Nightly?]: Yes.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: Yes, see the other patch in this bug.
[Is the change risky?]: No.
[Why is the change risky/not risky?]: This is a pretty trivial fix with well-tested code.
[String changes made/needed]: None.
Flags: needinfo?(jmathies)
Attachment #8810948 - Flags: approval-mozilla-beta?
Attachment #8810948 - Flags: approval-mozilla-aurora?
Comment on attachment 8810949 [details]
Bug 1294903: Modify file and folder pickers to ensure that the background MTA thread is running if we're on Windows 7;

Approval Request Comment
[Feature/Bug causing the regression]: Windows 7 bug, revealed by bug 1277075
[User impact if declined]: Crashes after using file pickers
[Is this code covered by automated tests?]: Yes, various automated tests launch file pickers, would covers this code.
[Has the fix been verified in Nightly?]: Yes.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: Yes, see the other patch in this bug.
[Is the change risky?]: No.
[Why is the change risky/not risky?]: This is a pretty trivial fix with well-tested code.
[String changes made/needed]: None.
Attachment #8810949 - Flags: approval-mozilla-beta?
Attachment #8810949 - Flags: approval-mozilla-aurora?
Comment on attachment 8810948 [details]
Bug 1294903: Add a default constructor to mscom::EnsureMTA that ensures that the MTA thread is running;

Fix a topcrash. Beta51+ and Aurora52+. Should be in 51 beta 5.
Attachment #8810948 - Flags: approval-mozilla-beta?
Attachment #8810948 - Flags: approval-mozilla-beta+
Attachment #8810948 - Flags: approval-mozilla-aurora?
Attachment #8810948 - Flags: approval-mozilla-aurora+
Attachment #8810949 - Flags: approval-mozilla-beta?
Attachment #8810949 - Flags: approval-mozilla-beta+
Attachment #8810949 - Flags: approval-mozilla-aurora?
Attachment #8810949 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: