The default bug view has changed. See this FAQ.

Dev Edition startup crashes in nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x...

RESOLVED WORKSFORME

Status

()

Core
Graphics
--
critical
RESOLVED WORKSFORME
a year ago
a year ago

People

(Reporter: Robert Kaiser, Assigned: milan, NeedInfo)

Tracking

(4 keywords)

Trunk
x86
Windows NT
crash, leave-open, regression, topcrash
Points:
---

Firefox Tracking Flags

(firefox43+ fixed)

Details

(Whiteboard: [gfx-noted], crash signature)

Attachments

(1 attachment)

(Reporter)

Description

a year ago
This bug was filed from the Socorro interface and is 
report bp-0eda266a-6c2f-4721-9322-82cac2151025.
=============================================================

I'm only filing this in GFX because of nvd3d9wrap.dll being involved in those crashes.

Those crashes started to appear and spike heavily on 2015-10-24 on Dev Edition, in both e10s-activated and e10s-deactivated builds (even more visible in crash reports right now with the experiment running, see metadata tabs).

The stacks very a lot, the only thing common seems to be that frame 0 is nsCOMPtr_base::~nsCOMPtr_base, frame 1 is in nvd3d9wrap.dll and frame 2 is in PLDHashTable.

About 80% of those crashes are withing the first 60s of uptime, so we consider them startup.

This is by far the largest crash issue on Dev Edition 43 at this point.
(Reporter)

Comment 1

a year ago
[Tracking Requested - why for this release]:
status-firefox43: --- → affected
status-firefox44: affected → ---
tracking-firefox43: --- → ?
Summary: De Edition startup crashes in nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x... → Dev Edition startup crashes in nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x...
Tracking for 43 since it is a new (top) crash.
tracking-firefox43: ? → +
Keywords: regressionwindow-wanted, topcrash
This crash looks exclusive to Intel adapters. There may be a more specific correlation. Driver versions:

 10.18.10.4276   57%    (0.25% of Telemetry sessions)
 9.17.10.4229    23%    (1.5% of Telemetry sessions)
 10.18.10.4252   11%    (1.0% of Telemetry sessions)

Devices:
 0x0166          69%       (4.9% of Telemetry sessions)
 0x0116          19%       (3.8% of Telemetry sessions)

Most of those Telemetry numbers are on the high end, meaning they're more prevalent given the large distribution of devices and drivers. But,

0x0166 on driver version 10.18.10.4276 apears in 0.09% of Telemetry sessions.
0x0166 on driver version 10.18.10.4252 appears in 0.58% of Telemetry sessions.
0x0116 on driver version 9.17.10.4229 appears in 0.6% of Telemetry sessions.

So it might be that a very small population of one specific adapter and driver is causing the majority of these crashes. It doesn't look like crash-stats can do multi-key aggregations though.
(Assignee)

Comment 4

a year ago
Could probably use some layout opinions.
Flags: needinfo?(matt.woodrow)
(Assignee)

Comment 5

a year ago
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #0)
> This bug was filed from the Socorro interface and is 
> report bp-0eda266a-6c2f-4721-9322-82cac2151025.
> =============================================================
> 
> 
> Those crashes started to appear and spike heavily on 2015-10-24 on Dev
> Edition, ...

What are the patches that landed for this?
(Assignee)

Comment 6

a year ago
Need more big brains on this; anything we can glean from these crashes? (No offense to big brains I didn't needinfo :)
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(bas)
(Assignee)

Comment 7

a year ago
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #0)
> ...
> Those crashes started to appear and spike heavily on 2015-10-24 on Dev
> Edition...

Hopefully, not a result of disabling async plugin init.

When was the previous Dev Edition built?  I'd like to take a look at all the patches that were new on the 24th.
Flags: needinfo?(kairo)
(Assignee)

Comment 8

a year ago
Shooting in the dark - bug 1210444? Maybe that SharedSurfaceTextureClient extra AddFlags should just have been mFlags |= mSurf->GetTextureFlags() instead?
Flags: needinfo?(jnicol)
I don't know the IPC code well enough to know if this could cause this. But yeah if it is responsible Milan's suggestion or just adding it to the parent constructor call would be fine (not sure why I didn't do the latter to be honest). I didn't intend for the AddFlags to have any side effect other than adding a flag.
Flags: needinfo?(jnicol)
(Assignee)

Comment 10

a year ago
Jamie, let's put together a patch to make that change, and uplift.  Even if it has nothing to do with this issue, it's probably safer, and it seems adequate.  Should check if fixes the original Android problem, of course.
Created attachment 8679726 [details] [diff] [review]
Avoid calling AddFlags from SharedSurfaceTextureClient constructor

Confirmed that this still fixes bug 1210444.
Attachment #8679726 - Flags: review?(milan)
Assignee: nobody → jnicol
Status: NEW → ASSIGNED
didn't mean for "hg bzexport" to assign me
Assignee: jnicol → nobody
Status: ASSIGNED → NEW
(Assignee)

Updated

a year ago
Attachment #8679726 - Flags: review?(milan) → review+
(Assignee)

Comment 13

a year ago
The patch is speculative, so we don't want to close the bug when we land it, but it won't hurt to remove one unknown.
Keywords: leave-open
Do we have a regression range for this?
Flags: needinfo?(matt.woodrow)
(Assignee)

Comment 15

a year ago
So far the "started to appear and spike heavily on 2015-10-24 on Dev Edition", but I don't know when the previous Dev Edition was.
I can't figure out exactly how the above mentioned patch would cause this problem. I strongly suspect this bug has absolutely nothing to do with graphics. The D3D DLL probably just gets in there by accident. Having said I think this patch improves the quality of the code, so let's still do this. Let's just not assume it fixes anything :-).
Flags: needinfo?(bas)
(Reporter)

Comment 17

a year ago
Dev Edition has daily builds, so the regression range would be a day before that. The first build ID affected there is 20151024004043, looking into archive.m.o I find https://archive.mozilla.org/pub/firefox/nightly/2015/10/2015-10-24-00-40-43-mozilla-aurora/firefox-43.0a2.en-US.win32.json with moz_source_stamp": "68f165595f3a1081999a4b11edd5e5e7127f901d" and for the build before I find https://archive.mozilla.org/pub/firefox/nightly/2015/10/2015-10-23-00-40-26-mozilla-aurora/firefox-43.0a2.en-US.win32.json with "moz_source_stamp": "6c1457027394d198932267c646135a9d8ab3876c" so the regression range would be https://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=6c1457027394d198932267c646135a9d8ab3876c&tochange=68f165595f3a1081999a4b11edd5e5e7127f901d
Crash Signature: [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2b36] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2cb6] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2b86] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2b16] [@ nsCOMPtr_… → [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2b36] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2cb6] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2b86] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2b16] [@ nsCOMPtr_…
Flags: needinfo?(kairo)
(Reporter)

Comment 18

a year ago
All that said, it looks like those signatures are very flaky, I actually see them appearing with the build from the 24th and the 27th but not for builds in between, which makes me think that something else might be wrong there.
(Reporter)

Comment 19

a year ago
FWIW, this could even be connected to the startup crashes on beta in bug 1218473
Try run for above patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8c7f03aa91d7

Requesting checkin please, and leave open since we don't think this will fix the problem.
Keywords: checkin-needed
Comment on attachment 8679726 [details] [diff] [review]
Avoid calling AddFlags from SharedSurfaceTextureClient constructor

And uplift this to aurora since it's low risk and might fix it.

Approval Request Comment
[Feature/regressing bug #]: possibly bug 1210444
[User impact if declined]: possibly this startup crash
[Describe test coverage new/current, TreeHerder]: Try run on aurora: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c8cbecc0150f I am assuming the Android 4.3 API11+ debug failures are unrelated.
[Risks and why]: Very low. very small change which does the same thing but removes potentially risky code.
[String/UUID change made/needed]: None
Attachment #8679726 - Flags: approval-mozilla-aurora?
Assignee: nobody → jnicol
Whiteboard: [gfx-noted]
(Assignee)

Comment 22

a year ago
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #19)
> FWIW, this could even be connected to the startup crashes on beta in bug
> 1218473

My guess in IRC, mostly based on the flakiness of signatures and the same hardware Intel/Nvidia combinations.
Comment on attachment 8679726 [details] [diff] [review]
Avoid calling AddFlags from SharedSurfaceTextureClient constructor

OK, let's uplift this speculative fix. Since Bas thinks it improves the code anyway, sounds like a reasonable idea.
Attachment #8679726 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Comment 24

a year ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/7e2286a4141c
Keywords: checkin-needed
https://hg.mozilla.org/releases/mozilla-aurora/rev/9470fd1a1c89
status-firefox43: affected → fixed

Comment 26

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/7e2286a4141c

Comment 27

a year ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-b2g44_v2_5/rev/7e2286a4141c
status-b2g-v2.5: --- → fixed
removing the b2g 2.5 flag since this commit has been reverted due to an incorrect merge, sorry for the confusion
status-b2g-v2.5: fixed → ---
Jamie, this signature has jumped up to #4 spot in DevEd44 crash stats (last 3 days of data). I know we uplifted a fix a few weeks back. Just wanted to bring this to your attention in case we have better diagnostics in place and want to review the recent crash dumps to root cause the problem.
Flags: needinfo?(jnicol)
I doubt I'm the best person for this to be assigned to. I don't have the hardware or expertise. Milan, could you reassign to somebody better suited? (Or if you I'll take a look.)
Assignee: jnicol → milan
Flags: needinfo?(jnicol)
(Assignee)

Comment 31

a year ago
Right - are these all Optimus bugs, dual GPU configurations?
Flags: needinfo?(anthony.s.hughes)
(In reply to Milan Sreckovic [:milan] from comment #31)
> Right - are these all Optimus bugs, dual GPU configurations?

100% of these are on dual-GPU systems with the following NVIDIA chipsets:
* 70% with Kepler
** 45% with Kepler GK107GLM
** 15% with Kepler GK106GLM
** 10% with Kepler GK107
* 30% with Fermi
** 15% with Fermi GF106GLM
** 10% with Fermi GF108M
**  5% with Fermi GF108M

Here are the Intel chipsets which show up as GPU#1 on these systems:
69% Ivybridge
25% Haswell
 6% Sandybridge
Flags: needinfo?(anthony.s.hughes)
(Assignee)

Comment 33

a year ago
How about the existence of _etoured.dll?
(In reply to Milan Sreckovic [:milan] from comment #33)
> How about the existence of _etoured.dll?

Module correlations don't work for these crashes (not sure why) so the only way to know for sure would be to check every crash report manually. I manually checked 50 reports at random and they all have _etoured.dll as the first module. It's probably safe to assume most if not all are in the same boat.
(Assignee)

Comment 35

a year ago
That's making me think this could be related, if not a duplicate - see bug 1218473 comment 44.
So one thing that's somewhat interesting here is that the regression window on nightly appears to be *newer* than the regression window on aurora, at least assuming I'm using Socorro correctly, and it stores old data reliably.

https://crash-stats.mozilla.com/search/?product=Firefox&release_channel=aurora&signature=%24nsCOMPtr_base%3A%3A~nsCOMPtr_base+|+nvd3d9wrap.dll%40&date=%3E2015-07-01&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
is a search for crashes on the aurora channel.  Those, as stated in comment 0, appear to have started on 2015-10-24 (you have to look at the Aggregations tab of each separate crash).

The same search for the nightly channel is:
https://crash-stats.mozilla.com/search/?product=Firefox&release_channel=nightly&signature=%24nsCOMPtr_base%3A%3A~nsCOMPtr_base+|+nvd3d9wrap.dll%40&date=%3E2015-07-01&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
On nightly, the crashes appear to have started on 2015-11-07.

It might be worth trying to figure out if the crashes are from a small number of users -- and whether it's possible that the trigger for the crashes starting was some software update not-from-us, or whether it's possible instead that we just happened to hit the right users on nightly later than we hit them on aurora.
Crash Signature: [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2b36] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2cb6] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2b86] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2b16] [@ nsCOMPtr_… → [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2b36] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2cb6] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2b86] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2b16] [@ nsCOMPtr_…
Hi,

Based on Socorro reports, all provided crash signatures shows that in the last 28 days, the latest Aurora (46.0a2) is not affected. The latest Aurora version that is present in the crash signatures on the latest 28 days was version 45.0a2. Most of the signatures does not have crash reports. Can someone confirm that this issue was fixed on the latest Aurora 46.0a2? 

Removing "regressionwindow-wanted" keyboard since the regression was provided in comment 17.

Thanks,
Cosmin.
Keywords: regressionwindow-wanted → regression
I'm closing this bug report as there's only been three reports of this crash recently, all on versions of Firefox we no longer support.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.