Closed Bug 1218395 Opened 9 years ago Closed 8 years ago

Dev Edition startup crashes in nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x...

Categories

(Core :: Graphics, defect)

x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox43 + fixed

People

(Reporter: kairo, Assigned: milan, NeedInfo)

Details

(Keywords: crash, regression, topcrash, Whiteboard: [gfx-noted])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-0eda266a-6c2f-4721-9322-82cac2151025.
=============================================================

I'm only filing this in GFX because of nvd3d9wrap.dll being involved in those crashes.

Those crashes started to appear and spike heavily on 2015-10-24 on Dev Edition, in both e10s-activated and e10s-deactivated builds (even more visible in crash reports right now with the experiment running, see metadata tabs).

The stacks very a lot, the only thing common seems to be that frame 0 is nsCOMPtr_base::~nsCOMPtr_base, frame 1 is in nvd3d9wrap.dll and frame 2 is in PLDHashTable.

About 80% of those crashes are withing the first 60s of uptime, so we consider them startup.

This is by far the largest crash issue on Dev Edition 43 at this point.
[Tracking Requested - why for this release]:
Summary: De Edition startup crashes in nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x... → Dev Edition startup crashes in nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x...
Tracking for 43 since it is a new (top) crash.
This crash looks exclusive to Intel adapters. There may be a more specific correlation. Driver versions:

 10.18.10.4276   57%    (0.25% of Telemetry sessions)
 9.17.10.4229    23%    (1.5% of Telemetry sessions)
 10.18.10.4252   11%    (1.0% of Telemetry sessions)

Devices:
 0x0166          69%       (4.9% of Telemetry sessions)
 0x0116          19%       (3.8% of Telemetry sessions)

Most of those Telemetry numbers are on the high end, meaning they're more prevalent given the large distribution of devices and drivers. But,

0x0166 on driver version 10.18.10.4276 apears in 0.09% of Telemetry sessions.
0x0166 on driver version 10.18.10.4252 appears in 0.58% of Telemetry sessions.
0x0116 on driver version 9.17.10.4229 appears in 0.6% of Telemetry sessions.

So it might be that a very small population of one specific adapter and driver is causing the majority of these crashes. It doesn't look like crash-stats can do multi-key aggregations though.
Could probably use some layout opinions.
Flags: needinfo?(matt.woodrow)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #0)
> This bug was filed from the Socorro interface and is 
> report bp-0eda266a-6c2f-4721-9322-82cac2151025.
> =============================================================
> 
> 
> Those crashes started to appear and spike heavily on 2015-10-24 on Dev
> Edition, ...

What are the patches that landed for this?
Need more big brains on this; anything we can glean from these crashes? (No offense to big brains I didn't needinfo :)
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(bas)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #0)
> ...
> Those crashes started to appear and spike heavily on 2015-10-24 on Dev
> Edition...

Hopefully, not a result of disabling async plugin init.

When was the previous Dev Edition built?  I'd like to take a look at all the patches that were new on the 24th.
Flags: needinfo?(kairo)
Shooting in the dark - bug 1210444? Maybe that SharedSurfaceTextureClient extra AddFlags should just have been mFlags |= mSurf->GetTextureFlags() instead?
Flags: needinfo?(jnicol)
I don't know the IPC code well enough to know if this could cause this. But yeah if it is responsible Milan's suggestion or just adding it to the parent constructor call would be fine (not sure why I didn't do the latter to be honest). I didn't intend for the AddFlags to have any side effect other than adding a flag.
Flags: needinfo?(jnicol)
Jamie, let's put together a patch to make that change, and uplift.  Even if it has nothing to do with this issue, it's probably safer, and it seems adequate.  Should check if fixes the original Android problem, of course.
Confirmed that this still fixes bug 1210444.
Attachment #8679726 - Flags: review?(milan)
Assignee: nobody → jnicol
Status: NEW → ASSIGNED
didn't mean for "hg bzexport" to assign me
Assignee: jnicol → nobody
Status: ASSIGNED → NEW
Attachment #8679726 - Flags: review?(milan) → review+
The patch is speculative, so we don't want to close the bug when we land it, but it won't hurt to remove one unknown.
Keywords: leave-open
Do we have a regression range for this?
Flags: needinfo?(matt.woodrow)
So far the "started to appear and spike heavily on 2015-10-24 on Dev Edition", but I don't know when the previous Dev Edition was.
I can't figure out exactly how the above mentioned patch would cause this problem. I strongly suspect this bug has absolutely nothing to do with graphics. The D3D DLL probably just gets in there by accident. Having said I think this patch improves the quality of the code, so let's still do this. Let's just not assume it fixes anything :-).
Flags: needinfo?(bas)
Dev Edition has daily builds, so the regression range would be a day before that. The first build ID affected there is 20151024004043, looking into archive.m.o I find https://archive.mozilla.org/pub/firefox/nightly/2015/10/2015-10-24-00-40-43-mozilla-aurora/firefox-43.0a2.en-US.win32.json with moz_source_stamp": "68f165595f3a1081999a4b11edd5e5e7127f901d" and for the build before I find https://archive.mozilla.org/pub/firefox/nightly/2015/10/2015-10-23-00-40-26-mozilla-aurora/firefox-43.0a2.en-US.win32.json with "moz_source_stamp": "6c1457027394d198932267c646135a9d8ab3876c" so the regression range would be https://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=6c1457027394d198932267c646135a9d8ab3876c&tochange=68f165595f3a1081999a4b11edd5e5e7127f901d
Crash Signature: nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2b66] → nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2b66] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x28f6] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2a16]
Flags: needinfo?(kairo)
All that said, it looks like those signatures are very flaky, I actually see them appearing with the build from the 24th and the 27th but not for builds in between, which makes me think that something else might be wrong there.
FWIW, this could even be connected to the startup crashes on beta in bug 1218473
Try run for above patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8c7f03aa91d7

Requesting checkin please, and leave open since we don't think this will fix the problem.
Keywords: checkin-needed
Comment on attachment 8679726 [details] [diff] [review]
Avoid calling AddFlags from SharedSurfaceTextureClient constructor

And uplift this to aurora since it's low risk and might fix it.

Approval Request Comment
[Feature/regressing bug #]: possibly bug 1210444
[User impact if declined]: possibly this startup crash
[Describe test coverage new/current, TreeHerder]: Try run on aurora: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c8cbecc0150f I am assuming the Android 4.3 API11+ debug failures are unrelated.
[Risks and why]: Very low. very small change which does the same thing but removes potentially risky code.
[String/UUID change made/needed]: None
Attachment #8679726 - Flags: approval-mozilla-aurora?
Assignee: nobody → jnicol
Whiteboard: [gfx-noted]
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #19)
> FWIW, this could even be connected to the startup crashes on beta in bug
> 1218473

My guess in IRC, mostly based on the flakiness of signatures and the same hardware Intel/Nvidia combinations.
Comment on attachment 8679726 [details] [diff] [review]
Avoid calling AddFlags from SharedSurfaceTextureClient constructor

OK, let's uplift this speculative fix. Since Bas thinks it improves the code anyway, sounds like a reasonable idea.
Attachment #8679726 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
removing the b2g 2.5 flag since this commit has been reverted due to an incorrect merge, sorry for the confusion
Jamie, this signature has jumped up to #4 spot in DevEd44 crash stats (last 3 days of data). I know we uplifted a fix a few weeks back. Just wanted to bring this to your attention in case we have better diagnostics in place and want to review the recent crash dumps to root cause the problem.
Flags: needinfo?(jnicol)
I doubt I'm the best person for this to be assigned to. I don't have the hardware or expertise. Milan, could you reassign to somebody better suited? (Or if you I'll take a look.)
Assignee: jnicol → milan
Flags: needinfo?(jnicol)
Right - are these all Optimus bugs, dual GPU configurations?
Flags: needinfo?(anthony.s.hughes)
(In reply to Milan Sreckovic [:milan] from comment #31)
> Right - are these all Optimus bugs, dual GPU configurations?

100% of these are on dual-GPU systems with the following NVIDIA chipsets:
* 70% with Kepler
** 45% with Kepler GK107GLM
** 15% with Kepler GK106GLM
** 10% with Kepler GK107
* 30% with Fermi
** 15% with Fermi GF106GLM
** 10% with Fermi GF108M
**  5% with Fermi GF108M

Here are the Intel chipsets which show up as GPU#1 on these systems:
69% Ivybridge
25% Haswell
 6% Sandybridge
Flags: needinfo?(anthony.s.hughes)
How about the existence of _etoured.dll?
(In reply to Milan Sreckovic [:milan] from comment #33)
> How about the existence of _etoured.dll?

Module correlations don't work for these crashes (not sure why) so the only way to know for sure would be to check every crash report manually. I manually checked 50 reports at random and they all have _etoured.dll as the first module. It's probably safe to assume most if not all are in the same boat.
That's making me think this could be related, if not a duplicate - see bug 1218473 comment 44.
So one thing that's somewhat interesting here is that the regression window on nightly appears to be *newer* than the regression window on aurora, at least assuming I'm using Socorro correctly, and it stores old data reliably.

https://crash-stats.mozilla.com/search/?product=Firefox&release_channel=aurora&signature=%24nsCOMPtr_base%3A%3A~nsCOMPtr_base+|+nvd3d9wrap.dll%40&date=%3E2015-07-01&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
is a search for crashes on the aurora channel.  Those, as stated in comment 0, appear to have started on 2015-10-24 (you have to look at the Aggregations tab of each separate crash).

The same search for the nightly channel is:
https://crash-stats.mozilla.com/search/?product=Firefox&release_channel=nightly&signature=%24nsCOMPtr_base%3A%3A~nsCOMPtr_base+|+nvd3d9wrap.dll%40&date=%3E2015-07-01&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
On nightly, the crashes appear to have started on 2015-11-07.

It might be worth trying to figure out if the crashes are from a small number of users -- and whether it's possible that the trigger for the crashes starting was some software update not-from-us, or whether it's possible instead that we just happened to hit the right users on nightly later than we hit them on aurora.
Crash Signature: nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2c56] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2c36] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2a26] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x29f6] [@ nsCOMPtr_base::~ns… → nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2c56] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2c36] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2c76] [@ nsCOMPtr_base::~nsCOMPtr_base | nvd3d9wrap.dll@0x2a26] [@ nsCOMPtr_base::~ns…
Hi,

Based on Socorro reports, all provided crash signatures shows that in the last 28 days, the latest Aurora (46.0a2) is not affected. The latest Aurora version that is present in the crash signatures on the latest 28 days was version 45.0a2. Most of the signatures does not have crash reports. Can someone confirm that this issue was fixed on the latest Aurora 46.0a2? 

Removing "regressionwindow-wanted" keyboard since the regression was provided in comment 17.

Thanks,
Cosmin.
I'm closing this bug report as there's only been three reports of this crash recently, all on versions of Firefox we no longer support.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: