Closed Bug 1124740 Opened 9 years ago Closed 9 years ago

Frequent Windows 7 opt/PGO dromaeo_dom | application crashed [@ mozalloc_abort(char const * const)]

Categories

(Testing :: Talos, defect)

x86
Windows 7
defect
Not set
normal

Tracking

(e10s+)

RESOLVED WONTFIX
Tracking Status
e10s + ---

People

(Reporter: RyanVM, Unassigned)

References

Details

(Keywords: intermittent-failure)

+++ This bug was initially created as a clone of Bug #872788 +++

Spun-off from the recent spike in bug 872788. I'm working on backfilling some data to figure out what regressed it this time. It's near perma-fail on PGO and rarely failing (like 1/25 runs) on regular opt builds. Early suspect within the regression range is yesterday's Talos update.

I wanted to pin this on some recent GC changes landed by Jon, but there were still PGO runs after his push that were solidly green vs. the near-permafail we've got now.
Confirmed to be from the Talos update.
Blocks: 1123852
ok, we should back out the talos update, we can test on try to narrow down the culprit.
Looking more into this on try server, we fail on the first page for dromaeo dom, in fact dromaeo css runs completely, it is in DOM that we fail (http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jmaher@mozilla.com-8b573f27ae2d/try-win32/try_win7-ix_test-dromaeojs-bm110-tests1-windows-build768.txt.gz):

19:36:24     INFO -  RSS: Main: 115236864
19:36:24     INFO -  Cycle 1(1): loaded http://localhost/page_load_test/dromaeo/dom-attr.html (next: http://localhost/page_load_test/dromaeo/dom-modify.html)
19:36:24     INFO -  RSS: Main: 152506368
19:36:24     INFO -  PROCESS-CRASH | dromaeo_dom | application crashed [@ mozalloc_abort(char const * const)]
19:36:24     INFO -  Crash dump filename: c:\users\cltbld\appdata\local\temp\tmpt7pu7m\profile\minidumps\0f945a4a-a2af-4131-8084-0ef7ed2dad75.dmp
19:36:24     INFO -  Operating system: Windows NT
19:36:24     INFO -                    6.1.7601 Service Pack 1
19:36:24     INFO -  CPU: x86
19:36:24     INFO -       GenuineIntel family 6 model 30 stepping 5
19:36:24     INFO -       8 CPUs
19:36:24     INFO -  Crash reason:  EXCEPTION_BREAKPOINT
19:36:24     INFO -  Crash address: 0x743e1abe
19:36:24     INFO -  Thread 0 (crashed)
19:36:24     INFO -   0  mozalloc.dll!mozalloc_abort(char const * const) [mozalloc_abort.cpp:8b573f27ae2d : 37 + 0x0]
19:36:24     INFO -      eip = 0x743e1abe   esp = 0x001fd648   ebp = 0x001fd648   ebx = 0x00000000
19:36:24     INFO -      esi = 0x743e22c6   edi = 0x001fd692   eax = 0x00000000   ecx = 0x68e20a3e
19:36:24     INFO -      edx = 0x00000003   efl = 0x00000206
19:36:24     INFO -      Found by: given as instruction pointer in context
19:36:24     INFO -   1  mozalloc.dll!mozalloc_handle_oom(unsigned int) [mozalloc_oom.cpp:8b573f27ae2d : 50 + 0x8]
19:36:24     INFO -      eip = 0x743e1b3e   esp = 0x001fd650   ebp = 0x001fd698
19:36:24     INFO -      Found by: call frame info
19:36:24     INFO -   2  mozalloc.dll!moz_xmalloc [mozalloc.cpp:8b573f27ae2d : 54 + 0x5]
19:36:24     INFO -      eip = 0x743e1022   esp = 0x001fd6a0   ebp = 0x001fd6a8
19:36:24     INFO -      Found by: call frame info
19:36:24     INFO -   3  xul.dll!nsTextNode::CloneDataNode(mozilla::dom::NodeInfo *,bool) [nsTextNode.cpp:8b573f27ae2d : 118 + 0x7]
19:36:24     INFO -      eip = 0x6303387e   esp = 0x001fd6b0   ebp = 0x001fd6c0
19:36:24     INFO -      Found by: call frame info
19:36:24     INFO -   4  xul.dll!nsGenericDOMDataNode::Clone(mozilla::dom::NodeInfo *,nsINode * *) [nsGenericDOMDataNode.h:8b573f27ae2d : 179 + 0xc]
19:36:24     INFO -      eip = 0x6302fe28   esp = 0x001fd6c8   ebp = 0x001fd6d0
19:36:24     INFO -      Found by: call frame info
19:36:24     INFO -   5  xul.dll!nsNodeUtils::CloneAndAdopt(nsINode *,bool,bool,nsNodeInfoManager *,JS::Handle<JSObject *>,nsCOMArray<nsINode> &,nsINode *,nsINode * *) [nsNodeUtils.cpp:8b573f27ae2d : 364 + 0x20]
19:36:24     INFO -      eip = 0x62defc81   esp = 0x001fd6d8   ebp = 0x001fd818
19:36:24     INFO -      Found by: call frame info
19:36:24     INFO -   6  xul.dll!nsNodeUtils::CloneAndAdopt(nsINode *,bool,bool,nsNodeInfoManager *,JS::Handle<JSObject *>,nsCOMArray<nsINode> &,nsINode *,nsINode * *) [nsNodeUtils.cpp:8b573f27ae2d : 497 + 0x26]
19:36:24     INFO -      eip = 0x62defdd5   esp = 0x001fd820   ebp = 0x001fd970
19:36:24     INFO -      Found by: call frame info
19:36:24     INFO -   7  xul.dll!nsNodeUtils::CloneAndAdopt(nsINode *,bool,bool,nsNodeInfoManager *,JS::Handle<JSObject *>,nsCOMArray<nsINode> &,nsINode *,nsINode * *) [nsNodeUtils.cpp:8b573f27ae2d : 497 + 0x26]
19:36:24     INFO -      eip = 0x62defdd5   esp = 0x001fd978   ebp = 0x001fdac8
19:36:24     INFO -      Found by: call frame info
19:36:24     INFO -   8  xul.dll!nsNodeUtils::CloneAndAdopt(nsINode *,bool,bool,nsNodeInfoManager *,JS::Handle<JSObject *>,nsCOMArray<nsINode> &,nsINode *,nsINode * *) [nsNodeUtils.cpp:8b573f27ae2d : 497 + 0x26]
19:36:24     INFO -      eip = 0x62defdd5   esp = 0x001fdad0   ebp = 0x001fdc20
19:36:24     INFO -      Found by: call frame info
19:36:24     INFO -   9  xul.dll!nsNodeUtils::CloneNodeImpl(nsINode *,bool,nsINode * *) [nsNodeUtils.cpp:8b573f27ae2d : 300 + 0x15]
19:36:24     INFO -      eip = 0x63237ed8   esp = 0x001fdc28   ebp = 0x001fdc5c
19:36:24     INFO -      Found by: call frame info
19:36:24     INFO -  10  xul.dll!nsINode::CloneNode(bool,mozilla::ErrorResult &) [nsINode.cpp:8b573f27ae2d : 2704 + 0x5]
19:36:24     INFO -      eip = 0x62deec60   esp = 0x001fdc64   ebp = 0x001fdc74
19:36:24     INFO -      Found by: call frame info
19:36:24     INFO -  11  xul.dll!mozilla::dom::NodeBinding::cloneNode [NodeBinding.cpp:8b573f27ae2d : 779 + 0x5]
19:36:24     INFO -      eip = 0x631554a5   esp = 0x001fdc7c   ebp = 0x001fdcf0
19:36:24     INFO -      Found by: call frame info
19:36:24     INFO -  12  mscms.dll + 0x1a03f
19:36:24     INFO -      eip = 0x72f7a040   esp = 0x001fdcf4   ebp = 0x0da98cc0
19:36:24     INFO -      Found by: call frame info


Is it possible we have an OOM.  We run dromaeo dom on windows 7 pgo for e10s, so I know this works.  The difference here is we are using the messagemanager to communicate from chrome->content without browser.tabs.remote=True.

bz- do you have thoughts on anything to try so we could get dromaeo dom running on windows 7 pgo builds?
Flags: needinfo?(bzbarsky)
Not offhand.  This seems like a problem fundamentally similar to the one in https://bugs.webkit.org/show_bug.cgi?id=95376 except with a different test.

I assume we're crashing on the dom-modify test, and in particular on the cloneNode test? That test looks like this:

	test( "cloneNode", function(){
		for ( var i = 0; i < elems.length; i++ ) {
			ret = elems[i].cloneNode(false);
			ret = elems[i].cloneNode(true);
			ret = elems[i].cloneNode(true);
		}
	});

if we fail to GC the no-longer-referenced nodes expeditiously here, we could in fact OOM.  The faster we run compared to the available memory, the more likely we are to OOM.

So the first obvious things to try are slower machines or more memory (though we may be running out of address space, not RAM, too).
Flags: needinfo?(bzbarsky)
in looking at Dromaeo DOM we fail almost 100% when run in e10s mode, and when running the e10s code but not in e10s mode we fail about 20% of the time.  In addition we still fail about 5% of the time without any e10s code path changes.

As it stands we don't have the ability to change our machines, but all the failures indicate we are running low on memory.

So the question becomes- how can we solve this problem?
* disable dromaeo_dom for win7?
* reduce the scope of dromaeo_dom in general?
Flags: needinfo?(bzbarsky)
Have we verified that my hypothesis from comment 17 is correct?

If it is, then maybe we should look into why it is that we're not triggering GC (or CC?  Or the actual deletions?) expeditiously enough here...
Flags: needinfo?(bzbarsky)
DeferredRelease or SnowWhite could keep the objects alive, even if CC/GC run.
I wonder where we could release stuff.
I ran talos a few times on low memory (1GB, 512MB) win7 vm (note: this isn't 100% like production).  Both times it passed just fine.  I will continue to try a few things- open to more direction.
I did a handful of more runs on my vm all with success.

A few things:
* when running in --e10s mode, nothing displays in the browser, but we do load all the pages and collect results
* I tried between 512MB and 2GB of RAM
* I tried 1 and 2 cpu cores

In every case, I was always seeing success.  While watching the task manager (there is a cpu/memory real time graph), I see that we spike the cpu with 1 core all the time, and hang out between 50-60% with the two cores.  For the memory, we seem to keep it <1GB at all times.

This is the hardware profile of the machines which run Talos:
https://wiki.mozilla.org/Buildbot/Talos/Misc#Hardware_Profile_of_machines_used_in_automation

I assume that with 8GB of RAM for 4 nodes, we have 2GB of RAM for a given machine.
I have a loaner machine and have ran the exact steps that we have from an official job.  After 5 runs, I continue to get complete runs and real results.  

The only difference between me and what we run in production is there is no buildbot script kicking off the mozharness script.  There might be slight machine differences as the machine is slightly adjusted from the permissions model to allow me into it.

As it stands, I am out of options for debugging this.  I would be open to other ideas.
bug 1094645 appears to be a dupe, different test, same crash.
Depends on: 1094645
what is odd here is that we go for days where win7 dromaeo_dom pgo fails, then we go for days where it works.  That looks to me like it is an issue with how we do pgo and we just keep changing our library sizes, etc.  No idea though, just a thought.
here we are again with dromaeo_dom issues- we have disabled this on osx 10.6, winxp, now we face the issues on win7.

bz- any thoughts here?  I would prefer to disable dromaeo_dom on win7 so we don't have to hide the job from treeherder.
Flags: needinfo?(bzbarsky)
Where would that leave the test enabled?
Flags: needinfo?(bzbarsky)
we would have it on:
linux32
linux64
osx10.10
windows 8 x64

I don't know of any frequent issues on those platforms.
Depends on: 1153191
(In reply to Joel Maher (:jmaher) from comment #97)
> here we are again with dromaeo_dom issues- we have disabled this on osx
> 10.6, winxp, now we face the issues on win7.
> 
> bz- any thoughts here?  I would prefer to disable dromaeo_dom on win7 so we
> don't have to hide the job from treeherder.

I filed a parent bug on the dom crash which I'll try to get some activity on. If we don't get tracktion on that in the next week or so, lets just disable.
thanks for filing bug 1153191.
Blocks: 1154404
disabled the test, we won't see this until it is live again.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
err, fixed in bug 1191952
Resolution: FIXED → WONTFIX
You need to log in before you can comment on or make changes to this bug.