Closed Bug 865701 Opened 8 years ago Closed 7 years ago

crash in nsFrameManager::ReResolveStyleContext with AMD Radeon 6310/6320

Categories

(Core :: Layout, defect)

21 Branch
x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox21 + affected

People

(Reporter: scoobidiver, Assigned: benjamin)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression)

Crash Data

It's already #1 top crasher in 21.0b4 which is not yet released.

Signature 	mozilla::dom::DocumentBinding::CreateInterfaceObjects(JSContext*, JSObject*, JSObject**) More Reports Search
UUID	1dcbafd5-842f-4903-8323-24c0e2130425
Date Processed	2013-04-25 14:16:30
Uptime	2643
Last Crash	44.6 minutes before submission
Install Age	16.8 hours since version was first installed.
Install Time	2013-04-24 21:26:50
Product	Firefox
Version	21.0
Build ID	20130423212553
Release Channel	beta
OS	Windows NT
OS Version	6.1.7601 Service Pack 1
Build Architecture	x86
Build Architecture Info	AuthenticAMD family 20 model 2 stepping 0
Crash Reason	EXCEPTION_ACCESS_VIOLATION_WRITE
Crash Address	0x1d18
App Notes 	
AdapterVendorID: 0x1002, AdapterDeviceID: 0x9806, AdapterSubsysID: 00000000, AdapterDriverVersion: 6.1.7600.16385
D3D10 Layers? D3D10 Layers- D3D9 Layers? D3D9 Layers- 
Processor Notes 	sp-processor06.phx1.mozilla.com_22392:2012
EMCheckCompatibility	True
Adapter Vendor ID	0x1002
Adapter Device ID	0x9806
Total Virtual Memory	2147352576
Available Virtual Memory	1619656704
System Memory Use Percentage	47
Available Page File	2430001152
Available Physical Memory	903467008

Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::dom::DocumentBinding::CreateInterfaceObjects 	obj-firefox/dom/bindings/DocumentBinding.cpp:7308
1 	xul.dll 	NS_NewStyleContext 	layout/style/nsStyleContext.cpp:723
2 	xul.dll 	nsStyleSet::GetContext 	layout/style/nsStyleSet.cpp:776
3 	xul.dll 	nsFrameManager::ReResolveStyleContext 	layout/base/nsFrameManager.cpp:1219
4 	xul.dll 	nsFrameManager::ReResolveStyleContext 	layout/base/nsFrameManager.cpp:1604
5 	xul.dll 	nsFrameManager::ComputeStyleChangeFor 	layout/base/nsFrameManager.cpp:1697
6 	xul.dll 	nsCSSFrameConstructor::RestyleElement 	layout/base/nsCSSFrameConstructor.cpp:8442
7 	xul.dll 	mozilla::css::RestyleTracker::DoProcessRestyles 	layout/base/RestyleTracker.cpp:209
8 	xul.dll 	PresShell::FlushPendingNotifications 	layout/base/nsPresShell.cpp:3880
9 	xul.dll 	nsRefreshDriver::Tick 	layout/base/nsRefreshDriver.cpp:898
10 	xul.dll 	mozilla::RefreshDriverTimer::Tick 	layout/base/nsRefreshDriver.cpp:156
11 	xul.dll 	nsTimerImpl::Fire 	xpcom/threads/nsTimerImpl.cpp:498
12 	nspr4.dll 	nspr4.dll@0x8d70 	
13 	xul.dll 	nsTimerEvent::Run 	xpcom/threads/nsTimerImpl.cpp:589
14 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:627
15 	ntdll.dll 	EtwEventEnabled 	
16 	nspr4.dll 	PR_Lock 	nsprpub/pr/src/threads/combined/prulock.c:201
17 	nspr4.dll 	PR_Unlock 	nsprpub/pr/src/threads/combined/prulock.c:315
18 	xul.dll 	mozilla::Mutex::Unlock 	obj-firefox/dist/include/mozilla/Mutex.h:83
19 	xul.dll 	NS_ProcessNextEvent_P 	obj-firefox/xpcom/build/nsThreadUtils.cpp:238
20 	xul.dll 	mozilla::ipc::MessagePump::Run 	ipc/glue/MessagePump.cpp:117
21 	xul.dll 	MessageLoop::RunHandler 	ipc/chromium/src/base/message_loop.cc:208
22 	xul.dll 	_SEH_epilog4 	
23 	xul.dll 	MessageLoop::Run 	ipc/chromium/src/base/message_loop.cc:182
24 	xul.dll 	nsBaseAppShell::Run 	widget/xpwidgets/nsBaseAppShell.cpp:163
25 	xul.dll 	nsAppShell::Run 	widget/windows/nsAppShell.cpp:154
26 	xul.dll 	XREMain::XRE_mainRun 	toolkit/xre/nsAppRunner.cpp:3871
27 	mozalloc.dll 	mozalloc.dll@0x10a0 	
28 		@0x17024e0 	

More reports at:
https://crash-stats.mozilla.com/report/list?signature=mozilla%3A%3Adom%3A%3ADocumentBinding%3A%3ACreateInterfaceObjects%28JSContext*%2C+JSObject*%2C+JSObject**%29
Keywords: qawanted
Setting QA Contact to Juan since he has a netbook with the AMD 6310 GPU.
QA Contact: jbecerra
I have not been able to reproduce this crash exercising functionality in live.com and yahoo.com accounts or by visiting some of the other URLs associated with this crash and browsing around those sites.

You can access the machine through VNC in the MV office at 10.250.6.86 if you want to give it a try.
I'm wondering if there is any difference between the AMD E350/450 and Radeon 6310/6320 on a Desktop platform vs a Laptop platform. I know that sometimes the same CPU/GPU will have slight technical differences depending on the intended platform. Would anyone be able to comment to this theory? Maybe someone from AMD?
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #3)
> I'm wondering if there is any difference between the AMD E350/450 and Radeon
> 6310/6320 on a Desktop platform vs a Laptop platform. I know that sometimes
> the same CPU/GPU will have slight technical differences depending on the
> intended platform. Would anyone be able to comment to this theory? Maybe
> someone from AMD?

The reason I ask is if it's worth the time and money to invest in a desktop computer using this hardware given that Juan's been unable to reproduce thus far on a netbook platform.
Crash Signature: [@ mozilla::dom::DocumentBinding::CreateInterfaceObjects(JSContext*, JSObject*, JSObject**)] → [@ mozilla::dom::DocumentBinding::CreateInterfaceObjects(JSContext*, JSObject*, JSObject**)] [@ JSCompartment::getNewType(JSContext*, js::Class*, js::TaggedProto, JSFunction*) ] [@ JS_GetCompartmentPrincipals(JSCompartment*) ] [@ nsStyleSet::ReparentSt…
21.0b1 seems as crashy as 19.0 was (see bug 830531).
Crash Signature: , JSFunction*) ] [@ JS_GetCompartmentPrincipals(JSCompartment*) ] [@ nsStyleSet::ReparentStyleContext(nsStyleContext*, nsStyleContext*, mozilla::dom::Element*) ] → , JSFunction*) ] [@ JS_GetCompartmentPrincipals(JSCompartment*) ] [@ nsStyleSet::ReparentStyleContext(nsStyleContext*, nsStyleContext*, mozilla::dom::Element*) ] [@ nsFrameManager::ReResolveStyleContext(nsPresContext*, nsIFrame*, nsIContent*, nsStyleCh…
> 21.0b1 seems [...]
I meant 21.0b4.
Crash Signature: , unsigned int) ] [@ nsStyleSet::ResolveAnonymousBoxStyle(nsIAtom*, nsStyleContext*) ] → , unsigned int) ] [@ nsStyleSet::ResolveAnonymousBoxStyle(nsIAtom*, nsStyleContext*) ] [@ nsCSSFrameConstructor::AddFrameConstructionItems(nsFrameConstructorState&, nsIContent*, bool, nsIFrame*, nsCSSFrameConstructor::FrameConstructionItemList&) ] [@ n…
Checking the latest Url's from the crash-report, I see https://www.facebook.com/ a top hit with which the users seem to crash.

Benjamin, please let us know if there is any more interesting co-relations from the data yet that can help QA here.
I have a full dump of this crash (signature of JSCompartment::getNewType) from Juan's QA computer. I will be examining it thoroughly to check the memory corruption.
dbaron or anyone else, could you construct a web page that calls NS_NewStyleContext in as tight a loop as we can manage? I'm trying to make this as reproduceable as possible.

Juan or anyone with access to that machine, could you play around with running other programs at the same time as Firefox to see if any other programs cause this crash to happen more regularly? Graphics-intensive programs in particular may make this easier to reproduce.
Flags: needinfo?(jbecerra)
Flags: needinfo?(dbaron)
Let's use in the summary the first frame shared by every stack traces.
Crash Signature: , nsCSSFrameConstructor::FrameConstructionItemList&) ] [@ nsStyleContext::nsStyleContext(nsStyleContext*, nsIAtom*, nsCSSPseudoElements::Type, nsRuleNode*) ] → , nsCSSFrameConstructor::FrameConstructionItemList&) ] [@ nsStyleContext::nsStyleContext(nsStyleContext*, nsIAtom*, nsCSSPseudoElements::Type, nsRuleNode*) ] [@ nsStyleSet::ProbePseudoElementStyle(mozilla::dom::Element*, nsCSSPseudoElements::Type, nsSty…
Summary: crash in mozilla::dom::DocumentBinding::CreateInterfaceObjects with AMD Radeon 6310/6320 → crash in nsFrameManager::ReResolveStyleContext with AMD Radeon 6310/6320
This crash seems to have spiked as of today, Adding needsinfo on Kairo to get some data around # of unique affected users.

CCing Ted,Tracy if they can help out earlier here.The basic query to be used here is similar to https://bugzilla.mozilla.org/show_bug.cgi?id=830531#c25, taking the right signature into account.
Flags: needinfo?(kairo)
I haven't seen these crashes or equivalent ones in 17.0b5 not yet released.
(In reply to Scoobidiver from comment #13)
> I haven't seen these crashes or equivalent ones in 17.0b5 not yet released.

I think you mean 21.0b5 here ?
(In reply to bhavana bajaj [:bajaj] from comment #14)
> (In reply to Scoobidiver from comment #13)
> > I haven't seen these crashes or equivalent ones in 17.0b5 not yet released.
> I think you mean 21.0b5 here ?
Yes. My bad.
Using 21.0b5 I let the machine run over the weekend with the same sorts of tabs and video playlists playing, but it didn't crash this weekend. Before, it used to crash a few hours into the video playlist.

I will now go back to using 21.0b4 to address the request in comment #10.
Flags: needinfo?(jbecerra)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #10)
> dbaron or anyone else, could you construct a web page that calls
> NS_NewStyleContext in as tight a loop as we can manage? I'm trying to make
> this as reproduceable as possible.
> 
> Juan or anyone with access to that machine, could you play around with
> running other programs at the same time as Firefox to see if any other
> programs cause this crash to happen more regularly? Graphics-intensive
> programs in particular may make this easier to reproduce.

Running multiple programs at the same time seems to make the crash happen sooner. I'll keep trying a few more times, but this last time it took maybe a half hour or so for it to crash on 21.0b4 with several applications open.
(In reply to bhavana bajaj [:bajaj] from comment #12)
> This crash seems to have spiked as of today, Adding needsinfo on Kairo to
> get some data around # of unique affected users.

As the vast majority of the crashes is from one signature, I'll give you the installations overview of that one (this is for yesterday only):

breakpad=> SELECT version,COUNT(*) as crashes,COUNT(DISTINCT client_crash_date - install_age  * interval '1 second') as installations FROM reports WHERE product='Firefox' AND signature LIKE 'mozilla::dom::DocumentBinding::CreateInterfaceObjects%' AND utc_day_is(date_processed, '2013-04-28') GROUP BY version;
 version | crashes | installations 
---------+---------+---------------
 21.0    |   18377 |          7460
(1 row)
Flags: needinfo?(kairo)
For multiple days the ratio of crashes/installation becomes even higher.

breakpad=> SELECT version,COUNT(*) as crashes,COUNT(DISTINCT client_crash_date - install_age  * interval '1 second') as installations FROM reports WHERE product='Firefox' AND signature LIKE 'mozilla::dom::DocumentBinding::CreateInterfaceObjects%' AND date_processed BETWEEN '2013-04-24' AND '2013-04-29' GROUP BY version;
 version | crashes | installations 
---------+---------+---------------
 21.0    |   44123 |         12495
(1 row)
Unfortunately I haven't been able to reproduce the problem reliably and within a short period of time. This morning I was able to crash within the first five minutes, but after that I wasn't able to get it to reproduce within a short time.

Whenever I have been able to reproduce the problem it has been while playing youtube videos, a long playlist, and trying to send email using outlook.com. Sometimes it just takes time for this to happen while the videos are playing. In addition, I get a system dialog saying the plugin container had stopped working.
Because of the nature of the memory corruption here, it's going to exist in a bunch of different signatures:

40335 (3.1): mozilla::dom::DocumentBinding::CreateInterfaceObjects(JSContext*, JSObject*, JSObject**)
10543 (8.0): JSCompartment::getNewType(JSContext*, js::Class*, js::TaggedProto, JSFunction*)
3623 (2.0): nsStyleSet::ReparentStyleContext(nsStyleContext*, nsStyleContext*, mozilla::dom::Element*)
3196 (1.0): nsFrameManager::ReResolveStyleContext(nsPresContext*, nsIFrame*, nsIContent*, nsStyleChangeList*, nsChangeHint, nsChangeHint, nsRestyleHint, mozilla::css::RestyleTracker&, nsFrameManager::DesiredA11yNotifications, nsTArray<nsIContent*>&, TreeMatchConte...
2454 (4.9): JS_GetCompartmentPrincipals(JSCompartment*)
1968 (2.1): nsStyleSet::ResolveStyleFor(mozilla::dom::Element*, nsStyleContext*, TreeMatchContext&)
822 (7.9): js::detail::HashTable<js::AtomStateEntry const, js::HashSet<js::AtomStateEntry, js::AtomHasher, js::SystemAllocPolicy>::SetOps, js::SystemAllocPolicy>::lookup(js::AtomHasher::Lookup const&, unsigned int, unsigned int)
521 (3.5): nsStyleContext::AddChild(nsStyleContext*)
494 (2.0): nsStyleSet::ResolveAnonymousBoxStyle(nsIAtom*, nsStyleContext*)
124 (5.3): js::NewObjectWithGivenProto(JSContext*, js::Class*, js::TaggedProto, JSObject*, js::gc::AllocKind, js::NewObjectKind)
80 (3.1): nsStyleContext::nsStyleContext(nsStyleContext*, nsIAtom*, nsCSSPseudoElements::Type, nsRuleNode*)
71 (2.5): nsStyleSet::ProbePseudoElementStyle(mozilla::dom::Element*, nsCSSPseudoElements::Type, nsStyleContext*, TreeMatchContext&)
63 (2.3): nsStyleSet::GetContext(nsStyleContext*, nsRuleNode*, nsRuleNode*, nsIAtom*, nsCSSPseudoElements::Type, mozilla::dom::Element*, unsigned int)
42 (5.2): mozilla::Preferences::AddBoolVarCache(bool*, char const*, bool)
39 (5.3): SelectorMatches
30 (1.8): firefox.exe@0x10203
29 (1.0): nsStyleContext::CalcStyleDifference(nsStyleContext*, nsChangeHint)
28 (1.8): firefox.exe@0x60203
27 (4.0): RuleHash::EnumerateAllRules(mozilla::dom::Element*, ElementDependentRuleProcessorData*, NodeMatchContext&)
27 (2.2): nsStyleSet::ResolveStyleByAddingRules(nsStyleContext*, nsCOMArray<nsIStyleRule> const&)
25 (3.9): nsRuleNode::WalkRuleTree(nsStyleStructID, nsStyleContext*)

The (number) is the average depth of ReResolveStyleContext in the stack.

I'm working on some scripts to scan the stack memory to see whether all these different crashes all have the callstack ReResolveStyleContext -> NS_NewStyleContext -> nsStyleContext::nsStyleContext -> nsStyleContext::AddChild -> weeds or not. I ran a small sample of the minidumps through a memory checker and I haven't encountered any corrupted .text memory yet.

On IRC we were speculating that perhaps an interrupt handler is clobbering some register by accident, so I'm going focus my investigation on whether there is a particular register clobber which might produce the varied effects seen in this bug.
Assignee: nobody → benjamin
probably something like:

  <div id="d">
    <div></div>
    <!-- repeat to get a decent number of children -->
  </div>


<script>

var d = document.getElementById("d");
var cs = getComputedStyle(d, "");
while (true) { /* or break it up to avoid the slow script dialog */
  d.style.transform = 'translate(1px)';
  cs.color; /* flush style */
  d.style.transform = 'translate(2px)';
  cs.color; /* flush style */
}

</script>
Flags: needinfo?(dbaron)
(In reply to comment #21)
> On IRC we were speculating that perhaps an interrupt handler is clobbering some
> register by accident, so I'm going focus my investigation on whether there is a
> particular register clobber which might produce the varied effects seen in this
> bug.

I think a most likely explanation is something corrupting the heap, causing the second mov instruction in the assembly code you showed to me the other day to read a zero value into edx.
I have trouble believing that heap corruption could explain why this crash hits just this function-tree and not others and has the other characteristics (varying volume per beta even on the same cset).

More data from running dumplookup on a sample of crashes with "ReResolveStyleContext" somewhere near the top of the stack:

nsStyleContext::nsStyleContext is present as a return address on the stack in 98% of the crashes, and I think the others are unrelated to this.

Of the crashes which had nsStyleContext::nsStyleContext as a return location, most of them are returning to http://hg.mozilla.org/releases/mozilla-beta/annotate/04aba2e6927f/layout/style/nsStyleContext.cpp#l70 which means that they called nsStyleContext::AddChild and crashed there.

https://crash-stats.mozilla.com/report/index/f46d376f-43e4-428a-96fd-0f15a2130430

In this case, AddChild has returned successfully and we have executed another two instructions:

http://pastebin.mozilla.org/2379752 (crash is at line 78 dereferencing eax+0x24 which should be this->mRuleNode). Registers: $EAX=0xa000c7de $EBX=0x5bc7d765 $ECX=0x18e3dc61

Also:
* $EBX is supposed to be `this`, but it's pointing at code within nsStyleContext::AddChild and cannot be a valid heap address.
* $EAX is the correct value of *($EBX+0xc)
$ECX is a heap address and we just finished a MOV ECX,EBX above. But it's an odd number.
According to my read of the stack memory, the actual value of `this` should be 0x18e3dc78 *(return address + 4).
* http://pastebin.mozilla.org/2379837 is the disassembly of nsStyleContext::AddChild. It only modifies EAX and EDX and never changes EBX or ECX or makes any calls.
* So $EBX should really be identical to $ECX when we hit this crash.
More details from https://crash-stats.mozilla.com/report/index/c3abc1c8-fd11-4db6-940e-dba802130425 which is an EXCEPTION_ILLEGAL_INSTRUCTION:

stack memory shows:
*ESP: nsStyleContext::AddChild + 2 bytes
*(ESP + 4): nsStyleContext::nsStyleContext[70] (returning from AddChild)
*(ESP + 8): EDI saved by nsStyleContext::nsStyleContext
*(ESP + 12): ESI "
*(ESP + 16): EBX "
*(ESP + 20): ECX "
*(ESP + 24): return to NS_NewStyleContext from nsStyleContext::nsStyleContext

The minidump can't show memory corruption in nsStyleContext::AddChild, but I'm still betting that the first two bytes of that function are being corrupted into a 2-byte jump instruction which happens to end up in CreateInterfaceObjects.

The exact offset may vary.

Ehsan, does this sound like a reasonable guess? If so, I'm going to have Juan's machine shipped to me and try to add memory watchpoints in a kernel debugger.
(In reply to comment #25)
> Ehsan, does this sound like a reasonable guess? If so, I'm going to have Juan's
> machine shipped to me and try to add memory watchpoints in a kernel debugger.

You've definitely convinced me of the likelihood of the register corruption!  Thanks for the detailed analysis!
Removing the topcrash keyword because:
1. it no longer happens in 21.0b5 and above
2. bug 772330 comment 19
Keywords: topcrash
Since this is no longer a top crash I'm removing qawanted. QA will continue to dogfood on our AMD netbooks periodically with Beta and RCs.
Keywords: qawanted
(In reply to Robert Kaiser (:kairo@mozilla.com) [away until early June] from comment #18)
> (In reply to bhavana bajaj [:bajaj] from comment #12)
> > This crash seems to have spiked as of today, Adding needsinfo on Kairo to
> > get some data around # of unique affected users.
> 
> As the vast majority of the crashes is from one signature, I'll give you the
> installations overview of that one (this is for yesterday only):
> 
> breakpad=> SELECT version,COUNT(*) as crashes,COUNT(DISTINCT
> client_crash_date - install_age  * interval '1 second') as installations
> FROM reports WHERE product='Firefox' AND signature LIKE
> 'mozilla::dom::DocumentBinding::CreateInterfaceObjects%' AND
> utc_day_is(date_processed, '2013-04-28') GROUP BY version;
>  version | crashes | installations 
> ---------+---------+---------------
>  21.0    |   18377 |          7460
> (1 row)

For comparison, we had 1,583,848 ADI on 21.0b4 on that day.


(In reply to Robert Kaiser (:kairo@mozilla.com) [away until early June] from comment #19)
> For multiple days the ratio of crashes/installation becomes even higher.
> 
> breakpad=> SELECT version,COUNT(*) as crashes,COUNT(DISTINCT
> client_crash_date - install_age  * interval '1 second') as installations
> FROM reports WHERE product='Firefox' AND signature LIKE
> 'mozilla::dom::DocumentBinding::CreateInterfaceObjects%' AND date_processed
> BETWEEN '2013-04-24' AND '2013-04-29' GROUP BY version;
>  version | crashes | installations 
> ---------+---------+---------------
>  21.0    |   44123 |         12495
> (1 row)

We had 5,873,863 ADI pings on 21.0b4 over those days.
ping?(In reply to :Ehsan Akhgari (needinfo? me!) from comment #26)
> (In reply to comment #25)
> > Ehsan, does this sound like a reasonable guess? If so, I'm going to have Juan's
> > machine shipped to me and try to add memory watchpoints in a kernel debugger.
> 
> You've definitely convinced me of the likelihood of the register corruption!
> Thanks for the detailed analysis!

Any news? This (and possibly also bug#830531) is causing us to regenerate-multiple-release-builds as a "workaround", so is a priority for us.
Flags: needinfo?(benjamin)
No news yet.
Flags: needinfo?(benjamin)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.