798849 - Linux32 opt±pgo browser-chrome nearly completely broken

Reporter

Description

•

12 years ago

We had an episode of this last June/July, and we're very much in the middle of another episode of the same old thing.

The first stage is when all of the browser_tilt_ browser-chrome tests start hitting OOM crashes. Once we give in and disable all of them, the second stage is bug 761049, when browser_bug666317.js gets a NS_ERROR_NOT_AVAILABLE. Once we give in and disable it, the third stage is a diaspora: bug 766826 (which I expanded to a few more this time around) plus anything that thinks using nsIZipReader.open will go well for it, a rather large class including all the many extension tests, and some less obvious things like the js crash in bug 761488.

On inbound right now, five of the last six failed, which is pretty typical (and that's *after* the first two stages of pretending we can disable our way to victory). I'd try to stand it, because hey, it's just browser-chrome and just Linux being destroyed, but now we're into the part of the cycle where so many tests fail that tbpl hits several unresponsive script warnings trying to open the comment form, and puts the submit button off screen because it has so many suggested bugs.

Phil Ringnalda (:philor)

Reporter

Updated

•

12 years ago

Blocks: 761488

Phil Ringnalda (:philor)

Reporter

Updated

•

12 years ago

Blocks: 763285

Phil Ringnalda (:philor)

Reporter

Updated

•

12 years ago

Blocks: 762893

Phil Ringnalda (:philor)

Reporter

Updated

•

12 years ago

Blocks: 761049, 766826, 759157

(not currently active) Ted Mielczarek

Comment 1

•

12 years ago

Did we introduce some kind of horrible memory corruption that we only hit on browser-chrome tests? Perhaps we should try a Valgrind run of them on Linux32?

Phil Ringnalda (:philor)

Reporter

Updated

•

12 years ago

Blocks: 798372

Phil Ringnalda (:philor)

Reporter

Updated

•

12 years ago

Blocks: 764002

Julian Seward [:jseward]

Comment 2

•

12 years ago

On Linux64, I managed to get this, with 
content/browser/browser/devtools/tilt/test/browser_tilt_02_notifications-seq.js
although to tell the truth, I am not convinced this isn't a Valgrind
false positive (it isn't so hot on 16-bit comparisons).  Anyway:

TEST-INFO | chrome://mochitests/content/browser/browser/devtools/tilt/test/browser_tilt_02_notifications-seq.js | Tilt was opened.
TEST-INFO | chrome://mochitests/content/browser/browser/devtools/tilt/test/browser_tilt_02_notifications-seq.js | Handling the INITIALIZING notification.

Conditional jump or move depends on uninitialised value(s)
   at 0x6EC9F98: bool mozilla::WebGLElementArrayCache::Validate<unsigned short>(unsigned short, unsigned long, unsigned long) (WebGLElementArrayCache.cpp:488)
   by 0x6ECA217: mozilla::WebGLElementArrayCache::Validate(unsigned int, unsigned int, unsigned long, unsigned long) (WebGLElementArrayCache.cpp:526)
   by 0x6EB47B4: mozilla::WebGLContext::DrawElements(unsigned int, int, unsigned int, long) (WebGLContext.h:1563)
   by 0x7855D65: mozilla::dom::WebGLRenderingContextBinding::drawElements(JSContext*, JS::Handle<JSObject*>, mozilla::WebGLContext*, unsigned int, JS::Value*) (WebGLRen\
deringContextBinding.cpp:4363)
   by 0x78461C9: mozilla::dom::WebGLRenderingContextBinding::genericMethod(JSContext*, unsigned int, JS::Value*) (WebGLRenderingContextBinding.cpp:8853)
   by 0x7E04510: js::InvokeKernel(JSContext*, JS::CallArgs, js::MaybeConstruct) (jscntxtinlines.h:364)
   by 0x7E04C4C: js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value*, JS::Value*) (jsinterp.h:109)
   by 0x7E3FE59: js::IndirectProxyHandler::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jsproxy.cpp:450)
   by 0x7EAAB47: js::DirectWrapper::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jswrapper.cpp:390)
   by 0x7EAC9F6: js::CrossCompartmentWrapper::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jswrapper.cpp:722)
   by 0x7E458D6: js::Proxy::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jsproxy.cpp:2462)
   by 0x7E4590A: proxy_Call(JSContext*, unsigned int, JS::Value*) (jsproxy.cpp:3002)
   by 0x7E0472B: js::InvokeKernel(JSContext*, JS::CallArgs, js::MaybeConstruct) (jscntxtinlines.h:364)
   by 0x7DFF221: js::Interpret(JSContext*, js::StackFrame*, js::InterpMode) (jsinterp.cpp:2461)
   by 0x7E0411E: js::RunScript(JSContext*, JS::Handle<JSScript*>, js::StackFrame*) (jsinterp.cpp:324)
   by 0x7E04678: js::InvokeKernel(JSContext*, JS::CallArgs, js::MaybeConstruct) (jsinterp.cpp:378)
   by 0x7E04C4C: js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value*, JS::Value*) (jsinterp.h:109)
   by 0x7E3FE59: js::IndirectProxyHandler::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jsproxy.cpp:450)
   by 0x7EAAB47: js::DirectWrapper::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jswrapper.cpp:390)
   by 0x7EAC9F6: js::CrossCompartmentWrapper::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jswrapper.cpp:722)

 Uninitialised value was created by a heap allocation
   at 0x4C2C62F: malloc (vg_replace_malloc.c:270)
   by 0x4032025: moz_xmalloc (mozalloc.cpp:57)
   by 0x6A1C1CF: nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity(unsigned int, unsigned int) (nsTArray.h:60)
   by 0x6A7B283: nsTArray_base<nsTArrayDefaultAllocator>::InsertSlotsAt(unsigned int, unsigned int, unsigned int, unsigned long) (nsTArray-inl.h:256)
   by 0x6ECA145: bool mozilla::WebGLElementArrayCache::Validate<unsigned short>(unsigned short, unsigned long, unsigned long) (nsTArray.h:1073)
   by 0x6ECA217: mozilla::WebGLElementArrayCache::Validate(unsigned int, unsigned int, unsigned long, unsigned long) (WebGLElementArrayCache.cpp:526)
   by 0x6EB47B4: mozilla::WebGLContext::DrawElements(unsigned int, int, unsigned int, long) (WebGLContext.h:1563)
   by 0x7855D65: mozilla::dom::WebGLRenderingContextBinding::drawElements(JSContext*, JS::Handle<JSObject*>, mozilla::WebGLContext*, unsigned int, JS::Value*) (WebGLRen\
deringContextBinding.cpp:4363)
   by 0x78461C9: mozilla::dom::WebGLRenderingContextBinding::genericMethod(JSContext*, unsigned int, JS::Value*) (WebGLRenderingContextBinding.cpp:8853)
   by 0x7E04510: js::InvokeKernel(JSContext*, JS::CallArgs, js::MaybeConstruct) (jscntxtinlines.h:364)
   by 0x7E04C4C: js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value*, JS::Value*) (jsinterp.h:109)
   by 0x7E3FE59: js::IndirectProxyHandler::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jsproxy.cpp:450)
   by 0x7EAAB47: js::DirectWrapper::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jswrapper.cpp:390)
   by 0x7EAC9F6: js::CrossCompartmentWrapper::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jswrapper.cpp:722)
   by 0x7E458D6: js::Proxy::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jsproxy.cpp:2462)
   by 0x7E4590A: proxy_Call(JSContext*, unsigned int, JS::Value*) (jsproxy.cpp:3002)
   by 0x7E0472B: js::InvokeKernel(JSContext*, JS::CallArgs, js::MaybeConstruct) (jscntxtinlines.h:364)
   by 0x7DFF221: js::Interpret(JSContext*, js::StackFrame*, js::InterpMode) (jsinterp.cpp:2461)
   by 0x7E0411E: js::RunScript(JSContext*, JS::Handle<JSScript*>, js::StackFrame*) (jsinterp.cpp:324)
   by 0x7E04678: js::InvokeKernel(JSContext*, JS::CallArgs, js::MaybeConstruct) (jsinterp.cpp:378)

TEST-INFO | chrome://mochitests/content/browser/browser/devtools/tilt/test/browser_tilt_02_notifications-seq.js | Longer timeout required, waiting longer...  Remaining timeouts: \
9

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 3

•

12 years ago

Since the uninitialised value was created by a heap allocation near WebGL, cc'ing Jeff Gilbert.

Kelsey Gilbert [:jgilbert]

Comment 4

•

12 years ago

This is plausible. This is the new element array cache tree from bjacob, cc'ing.

Kelsey Gilbert [:jgilbert]

Updated

•

12 years ago

Component: General → Canvas: WebGL

Benoit Jacob [:bjacob] (mostly away)

Comment 5

•

12 years ago

Many thanks for the report. Thankfully, the Valgrind error reported by Julian reproduces on the compiled-code unit test that we added for this feature (in content/canvas/test/compiled) (Yes, we knew that this would be bug prone, so we added compiled code tests just for it).

Here's what I get in a debug build (Julian's is visibly optimized):

==27763== Conditional jump or move depends on uninitialised value(s)
==27763==    at 0x405752: unsigned char const& NS_MAX<unsigned char>(unsigned char const&, unsigned char const&) (nsAlgorithm.h:47)
==27763==    by 0x4046D1: mozilla::WebGLElementArrayCacheTree<unsigned char>::Update() (WebGLElementArrayCache.cpp:426)
==27763==    by 0x402F87: bool mozilla::WebGLElementArrayCache::Validate<unsigned char>(unsigned char, unsigned long, unsigned long) (WebGLElementArrayCache.cpp:484)
==27763==    by 0x402700: mozilla::WebGLElementArrayCache::Validate(unsigned int, unsigned int, unsigned long, unsigned long) (WebGLElementArrayCache.cpp:524)
==27763==    by 0x403450: void CheckValidateOneType<unsigned char>(mozilla::WebGLElementArrayCache&, unsigned long, unsigned long) (TestWebGLElementArrayCache.cpp:72)
==27763==    by 0x4028D8: CheckValidate(mozilla::WebGLElementArrayCache&, unsigned long, unsigned long) (TestWebGLElementArrayCache.cpp:83)
==27763==    by 0x4029E2: main (TestWebGLElementArrayCache.cpp:139)
==27763==  Uninitialised value was created by a heap allocation
==27763==    at 0x402C787: realloc (vg_replace_malloc.c:663)
==27763==    by 0x406215A: moz_xrealloc (mozalloc.cpp:89)
==27763==    by 0x402422: nsTArrayInfallibleAllocator::Realloc(void*, unsigned long) (nsTArray.h:60)
==27763==    by 0x406041: nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity(unsigned int, unsigned int) (nsTArray-inl.h:167)
==27763==    by 0x405A6A: nsTArray_base<nsTArrayDefaultAllocator>::InsertSlotsAt(unsigned int, unsigned int, unsigned int, unsigned long) (nsTArray-inl.h:256)
==27763==    by 0x405132: nsTArray<unsigned char, nsTArrayDefaultAllocator>::InsertElementsAt(unsigned int, unsigned int) (nsTArray.h:1062)
==27763==    by 0x403360: nsTArray<unsigned char, nsTArrayDefaultAllocator>::SetLength(unsigned int) (nsTArray.h:1022)
==27763==    by 0x402CB7: mozilla::WebGLElementArrayCacheTree<unsigned char>::ResizeToParentSize() (WebGLElementArrayCache.cpp:293)
==27763==    by 0x40254A: mozilla::WebGLElementArrayCache::BufferData(void const*, unsigned long) (WebGLElementArrayCache.cpp:443)
==27763==    by 0x4029CA: main (TestWebGLElementArrayCache.cpp:138)

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 6

•

12 years ago

Benoit, you might want to run Valgrind with --track-origins=yes.

Benoit Jacob [:bjacob] (mostly away)

Comment 7

•

12 years ago

Comment 5 did that, look, it does give the origin of the uninitialized value.

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 8

•

12 years ago

(In reply to Benoit Jacob [:bjacob] from comment #5)
> Here's what I get in a debug build (Julian's is visibly optimized):

> ==27763==  Uninitialised value was created by a heap allocation

Oh yes, you're right, sorry I misread.

Benoit Jacob [:bjacob] (mostly away)

Comment 9

•

12 years ago

So I'll try to debug this tomorrow --- but don't set your hopes too high that this will fix the browser-chrome issues. So far the WebGL bug is just a conditional-jump-based-on-uninitialized-value bug. There is still no theory about how it could result in the browser-chrome issues i.e. it may still be a red herring here.

Benoit Jacob [:bjacob] (mostly away)

Comment 10

•

12 years ago

Attached patch fix uninitialized values in WebGL element array cache — Details — Splinter Review

So there was a real bug there, although the fact that the very extensive unit test passes suggests that perhaps it didn't actually manifest itself for some reason. Not sure.

The bug was that when we allocate the storage for the element-array-cache tree, we did not initialize it. We relied on the tree Update() method to initialize just the parts that are needed given current invalidation. The problem with that approach was that in Update() it was not possible (AFAICS now) to distinguish between an existing tree entry that didn't need updating, and an uninitialized tree entry.

This patch fixes it by memset()-ing the tree data everytime it's resized (bufferData is expensive anyways). It also causes the tree to be completely invalidated on resize.

Runs valgrind-clean here.

Please ignore the part of this patch that changes TOOLS_DIRS to TEST_DIRS. Needed this to get the tests to compile locally, but then again this won't compile on Android. Will figure this out in a separate bug/patch.

Attachment #670013 - Flags: review?(jgilbert)

Kelsey Gilbert [:jgilbert]

Updated

•

12 years ago

Attachment #670013 - Flags: review?(jgilbert) → review+

Benoit Jacob [:bjacob] (mostly away)

Comment 11

•

12 years ago

Pushed to try. Expecting same Android build error as last time I had tried TEST_DIRS for compiled code tests there, but wanted to check if the issue was still present.
https://tbpl.mozilla.org/?tree=Try&rev=8d05762ff23e

Benoit Jacob [:bjacob] (mostly away)

Comment 12

•

12 years ago

Failed in the expected way; new try that is expected to build:

https://tbpl.mozilla.org/?tree=Try&rev=ae56c5539ec6

Benoit Jacob [:bjacob] (mostly away)

Comment 13

•

12 years ago

Erm, _this_ should work:
https://tbpl.mozilla.org/?tree=Try&rev=a9a10a1b87e2

Ed Morley [:emorley]

Updated

•

12 years ago

Assignee: nobody → bjacob

status-firefox18: --- → affected

tracking-firefox18: --- → ?

Masatoshi Kimura [:emk]

Comment 14

•

12 years ago

Didn't expect RAND_MAX to be so low on MSVC? :)

Benoit Jacob [:bjacob] (mostly away)

Comment 15

•

12 years ago

Oh hey, that is really funny. This is a static assertion in the compiled unit test. That means that on MSVC RAND_MAX is < 1e24. OK, will adapt to that dire reality.

Benoit Jacob [:bjacob] (mostly away)

Comment 16

•

12 years ago

https://tbpl.mozilla.org/?tree=Try&rev=0ddae7132e8d

Benoit Jacob [:bjacob] (mostly away)

Comment 17

•

12 years ago

How #$%^&, http://msdn.microsoft.com/en-us/library/2dfe3bzd.aspx says that RAND_MAX is 0x7fff. That is the minimum value allowed by the spec. OK, new try push coming.

Benoit Jacob [:bjacob] (mostly away)

Comment 18

•

12 years ago

https://tbpl.mozilla.org/?tree=Try&rev=67d8c0f64394

Masatoshi Kimura [:emk]

Comment 19

•

12 years ago

>+  enum { bitsToIgnore = 7 };
>   MOZ_STATIC_ASSERT((unsigned int)(RAND_MAX) >> (8 + bitsToIgnore),
0x7fff >> 15 would be zero. This assertion is broken.
Is this assertion needed in the first place? RAND_MAX is guaranteed to be greater than or equal to 0x7fff by the spec.

Benoit Jacob [:bjacob] (mostly away)

Comment 20

•

12 years ago

(In reply to Masatoshi Kimura [:emk] from comment #19)
> >+  enum { bitsToIgnore = 7 };
> >   MOZ_STATIC_ASSERT((unsigned int)(RAND_MAX) >> (8 + bitsToIgnore),
> 0x7fff >> 15 would be zero. This assertion is broken.

You're right, I wanted MOZ_STATIC_ASSERT((RAND_MAX >> bitsToIgnore) >= 255),

> Is this assertion needed in the first place? RAND_MAX is guaranteed to be
> greater than or equal to 0x7fff by the spec.

http://msdn.microsoft.com/en-us/library/2dfe3bzd.aspx says that in MSVC, it is 0x7fff. We have to build on MSVC.

Masatoshi Kimura [:emk]

Comment 21

•

12 years ago

(In reply to Benoit Jacob [:bjacob] from comment #20)
> http://msdn.microsoft.com/en-us/library/2dfe3bzd.aspx says that in MSVC, it
> is 0x7fff. We have to build on MSVC.
Yeah, I said greater than *or equal to* 0x7fff, and it is equal to 0x7fff on MSVC.
>     a[i] = static_cast<uint8_t>((unsigned int)(rand()) >> bitsToIgnore);
Even if RAND_MAX is 0x7fff, a[i] would be uniformly distributed between 0 to 255.

Benoit Jacob [:bjacob] (mostly away)

Comment 22

•

12 years ago

OK, right -- sorry I misread.
Anyway, the assertion is useless now that we don't need anything more than what the spec requires. Removed it. New try:
https://tbpl.mozilla.org/?tree=Try&rev=5e2e6cfe338b

Benoit Jacob [:bjacob] (mostly away)

Comment 23

•

12 years ago

Got the same issue about xpcom.lib not being found, that originally prompted the use of TOOLS_DIRS instead of TEST_DIRS --- which doesn't work anymore. Let's not block on this, instead, let's temporarily do content/canvas/test/compiled only on Linux where it works, and we'll then solve this problem on other platforms in a separate bug.

Benoit Jacob [:bjacob] (mostly away)

Comment 24

•

12 years ago

Try: https://tbpl.mozilla.org/?tree=Try&rev=a9b26a5a0326

bhavana bajaj [:bajaj]

Updated

•

12 years ago

tracking-firefox18: ? → +

Benoit Jacob [:bjacob] (mostly away)

Comment 25

•

12 years ago

Failed again, but this time the log doesn't seem to give a clue why. Now I have simply disabled content/canvas/test/compiled by omitting it in the Makefile. Let's see if that builds:
https://tbpl.mozilla.org/?tree=Try&rev=c4716762f837

Benoit Jacob [:bjacob] (mostly away)

Comment 26

•

12 years ago

Inbound:
http://hg.mozilla.org/integration/mozilla-inbound/rev/4476b1668b33

Now... here's the commit message:

> This fixes Valgrind errors while running TestWebGLElementArrayCache.
> 
> This also fixes the way we do content/canvas/test in the Makefiles, which had been broken by my attempt to enable compiled-code tests there (bug 732660).
> 
> Since I still don't know how to do compiled-code tests there, I've disabled them. At least we have the rest of content/canvas/test back.
> 
> Let's hope that no regression happened. 

I'm very sorry to have accidentally disabled content/canvas mochitests for 3 weeks. Let's hope that no regression slipped in.

I'll also file a separate bug about correctly enabling the compiled-code tests.

Anyway, does this fix the issue originally discussed here?

(no longer active)

Comment 27

•

12 years ago

Benoit, have you asked the build system folks about the proper way of doing compiled tests?

Masatoshi Kimura [:emk]

Comment 28

•

12 years ago

Is TOOLS_DIRS a typo of TOOL_DIRS?
https://mxr.mozilla.org/mozilla-central/search?string=TOOLS_DIRS
https://mxr.mozilla.org/mozilla-central/search?string=TOOL_DIRS

Benoit Jacob [:bjacob] (mostly away)

Comment 29

•

12 years ago

(In reply to Masatoshi Kimura [:emk] from comment #28)
> Is TOOLS_DIRS a typo of TOOL_DIRS?
> https://mxr.mozilla.org/mozilla-central/search?string=TOOLS_DIRS
> https://mxr.mozilla.org/mozilla-central/search?string=TOOL_DIRS

Ouch. Seems like it!

(In reply to Ehsan Akhgari [:ehsan] from comment #27)
> Benoit, have you asked the build system folks about the proper way of doing
> compiled tests?

Yes. Answer was "dunno, try TOOLS_DIRS" (with the S).

Benoit Jacob [:bjacob] (mostly away)

Comment 30

•

12 years ago

Filed bug 800612 to continue this conversation.

Try with TOOL_DIRS:
https://tbpl.mozilla.org/?tree=Try&rev=22f197df1767

Benoit Jacob [:bjacob] (mostly away)

Comment 31

•

12 years ago

FWIW there was one regression in WebGL (bug 800657) and one in Canvas2D (bug 800658). Nothing too terrible.

Phil Ringnalda (:philor)

Reporter

Comment 32

•

12 years ago

(In reply to Benoit Jacob [:bjacob] from comment #26)
> Anyway, does this fix the issue originally discussed here?

Nope. We seem to have skipped running Linux opt moth most of the time since your push, but the clearest example of unfixed so far is https://tbpl.mozilla.org/php/getParsedLog.php?id=16032121&tree=Mozilla-Inbound#error3

Ed Morley [:emorley]

Comment 33

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/a0297358abc1
https://hg.mozilla.org/mozilla-central/rev/022c39b4aae6

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla19

Benoit Jacob [:bjacob] (mostly away)

Comment 34

•

12 years ago

Should have put [leave open].

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Benoit Jacob [:bjacob] (mostly away)

Comment 35

•

12 years ago

Unassigning from myself based on comment 32.

Assignee: bjacob → nobody

Ed Morley [:emorley]

Comment 36

•

12 years ago

I'll unfortunately have to be hiding Linux32 m-oth soon, it's pretty much perma-orange & the failures cause the TBPL UI to hang (pending bug 795460 and a bunch of other TBPL cleanup I'm doing).

Benoit Jacob [:bjacob] (mostly away)

Comment 37

•

12 years ago

Un-categorizing from WebGL, as there is no reason anymore to think that that has anything to do with WebGL.

Component: Canvas: WebGL → Untriaged

Product: Core → Firefox

Target Milestone: mozilla19 → ---

Ed Morley [:emorley]

Updated

•

12 years ago

Component: Untriaged → General

Product: Firefox → Core

Karl Tomlinson (:karlt)

Comment 38

•

12 years ago

Is there a regression window?

Karl Tomlinson (:karlt)

Updated

•

12 years ago

Keywords: regression, regressionwindow-wanted

Ed Morley [:emorley]

Comment 39

•

12 years ago

(In reply to Karl Tomlinson (:karlt) from comment #38)
> Is there a regression window?

These are really the best we have:

http://brasstacks.mozilla.com/orangefactor/?display=Bug&entireHistory=true&tree=trunk&bugid=759157
http://brasstacks.mozilla.com/orangefactor/?display=Bug&entireHistory=true&tree=trunk&bugid=761049
http://brasstacks.mozilla.com/orangefactor/?display=Bug&entireHistory=true&tree=trunk&bugid=761488
http://brasstacks.mozilla.com/orangefactor/?display=Bug&entireHistory=true&tree=trunk&bugid=762893
http://brasstacks.mozilla.com/orangefactor/?display=Bug&entireHistory=true&tree=trunk&bugid=763285
http://brasstacks.mozilla.com/orangefactor/?display=Bug&entireHistory=true&tree=trunk&bugid=764002
http://brasstacks.mozilla.com/orangefactor/?display=Bug&entireHistory=true&tree=trunk&bugid=766826
http://brasstacks.mozilla.com/orangefactor/?display=Bug&entireHistory=true&tree=trunk&bugid=798372

Although bear in mind comment 0 (ie: some of the above spike when we disable tests prior, so I'd pick the earliest dates overall).

Phil Ringnalda (:philor)

Reporter

Comment 40

•

12 years ago

Yeah, depending on which regression you mean, the first or the second, "around May 28th" and "September 11th when Ionmonkey merged to mozilla-central" are about as close as you'll get (bug 759157 is probably the best leading indicator, and notice how we starred it a few times on the ionmonkey branch, before the multi-branch explosion on the 12th after the merge).

The window I'd most like to know about, but don't know if we can pinpoint even to a week much less a day, is what landed in mid-June which caused the first episode to fade away? Did we, say, stop optimizing something in tracemonkey because we were hitting a gcc optimization bug, but either that same thing or some other thing is still being optimized and broken in ionmonkey?

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 41

•

12 years ago

> Did we, say, stop optimizing something in
> tracemonkey because we were hitting a gcc optimization bug

(Tracemonkey was removed in bug 698201 some time ago last year, so it is unlikely to be the cause)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 42

•

12 years ago

My suspicion right now is that we're just running out of memory. I don't know why that would happen, since the tests shouldn't keep around any long-lived objects. I'm investigating the memory usage on Linux right now.

Phil Ringnalda (:philor)

Reporter

Comment 43

•

12 years ago

I have trouble keeping my monkey species straight, what I actually meant was "in the js engine we had last June, the one which predated the one we got on September 11th."

Karl Tomlinson (:karlt)

Updated

•

12 years ago

Summary: Linux32 browser-chrome nearly completely broken → Linux32 opt±pgo browser-chrome nearly completely broken

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 44

•

12 years ago

Attached file list of running threads during browser-chrome tests — Details

I found something that might be causing the problem. I stopped the browser-chrome tests partway through and looked at what threads were running. There are tons of them. Way, way too many. The worst offender seems to be "mozStorage Thread", of which we have 190. There are also 20 "SteamTrans" threads and 55 "Analysis Helper" threads. When I started the test there were "only" 77 threads running, and when I stopped it (before the end) there were 331, so they seem to accumulate over time. I've attached a complete list.

I think this is a possible explanation for why we're running out of memory. Each thread appears to allocate 8MB of address space for the stack. With 331 threads, that's over 2.6 GB of address space just for stacks. When you add in everything else, it wouldn't surprise me that we're running out of address space. That could easily be causing all sorts of random failures since we don't deal with OOMs very well.

Hopefully some people on the CC list will be able to sort of what these threads are for. I'm pretty sure we're not even using the Analysis Helper ones right now. Also, I think that was added along with IonMonkey, so that would explain a spike in failures around that time. However, the mozStorage threads should probably be our biggest target.

Mike Hommey [:glandium]

Comment 45

•

12 years ago

(In reply to Bill McCloskey (:billm) from comment #44)
> I think this is a possible explanation for why we're running out of memory.
> Each thread appears to allocate 8MB of address space for the stack. With 331
> threads, that's over 2.6 GB of address space just for stacks.

Relatedly, we should probably set a lower limit for stack space for some if not all of these threads. We already do for some classes of threads, iirc.

Masatoshi Kimura [:emk]

Comment 46

•

12 years ago

Are too many threads observed only on the linux-opt? If so, why?

Mike Hommey [:glandium]

Comment 47

•

12 years ago

(In reply to Masatoshi Kimura [:emk] from comment #46)
> Are too many threads observed only on the linux-opt? If so, why?

My guess is that other platforms see the same number of threads, but it's not a problem for them: windows and mac, iirc, have a smaller stack space for new threads, and they probably don't exhaust the 32-bits address space. On Linux 64-bits, there's more address space, so it doesn't reach exhaustion either.

Ed Morley [:emorley]

Comment 48

•

12 years ago

Thank you Bill, good spot :-)

Masatoshi Kimura [:emk]

Comment 49

•

12 years ago

(In reply to Mike Hommey [:glandium] from comment #47)
> My guess is that other platforms see the same number of threads, but it's
> not a problem for them: windows and mac, iirc, have a smaller stack space
> for new threads, and they probably don't exhaust the 32-bits address space.
> On Linux 64-bits, there's more address space, so it doesn't reach exhaustion
> either.
Unlike linux-opt, does linux-debug (not linux64-debug) actually run on 64-bit machine?
https://tbpl.mozilla.org/?tree=Try&jobname=Rev3%20Fedora%2012%20try%20debug%20test%20mochitest-other

Nicholas Nethercote [inactive]

Comment 50

•

12 years ago

See bug 664341 for an example of where we reduced the size of stacks used for video threads -- there's an internal API for it, apparently.

Bill, this makes sense w.r.t. our earlier conversation, because about:memory doesn't measure thread stacks.  Perhaps it should... is there a way to iterate over all existing threads?

Karl Tomlinson (:karlt)

Comment 51

•

12 years ago

speculation: linux-debug may run slower and so some threads may complete and join before others start.

Component: General → Storage

Product: Core → Toolkit

Mike Hommey [:glandium]

Comment 52

•

12 years ago

(In reply to Nicholas Nethercote [:njn] from comment #50)
> See bug 664341 for an example of where we reduced the size of stacks used
> for video threads -- there's an internal API for it, apparently.
> 
> Bill, this makes sense w.r.t. our earlier conversation, because about:memory
> doesn't measure thread stacks.  Perhaps it should... is there a way to
> iterate over all existing threads?

The nsThreadManager keeps a list of all nsThreads. That probably won't cover threads initiated from ipc or webrtc code... which makes me think it may be worth adding a thin layer to those so that all thread activity we have do go through the thread manager.

Andrew McCreight [:mccr8]

Comment 53

•

12 years ago

Luke filed bug 801961 for disabling the unused IM compilation threads.

Depends on: 801961

Justin Lebar (not reading bugmail)

Comment 54

•

12 years ago

(In reply to Masatoshi Kimura [:emk] from comment #49)
> https://tbpl.mozilla.org/
> ?tree=Try&jobname=Rev3%20Fedora%2012%20try%20debug%20test%20mochitest-other

It doesn't matter -- a 32-bit binary can only touch 4GB of address space, because its pointers are only 32 bits long.  That's true whether the binary is running on a 32-bit OS or a 64-bit OS.

Mike Hommey [:glandium]

Comment 55

•

12 years ago

(In reply to Justin Lebar [:jlebar] from comment #54)
> (In reply to Masatoshi Kimura [:emk] from comment #49)
> > https://tbpl.mozilla.org/
> > ?tree=Try&jobname=Rev3%20Fedora%2012%20try%20debug%20test%20mochitest-other
> 
> It doesn't matter -- a 32-bit binary can only touch 4GB of address space,
> because its pointers are only 32 bits long.  That's true whether the binary
> is running on a 32-bit OS or a 64-bit OS.

The address space for 32-bit programs running on a 64-bit OS is wider than when running on a 32-bit OS.

Justin Lebar (not reading bugmail)

Comment 56

•

12 years ago

> The address space for 32-bit programs running on a 64-bit OS is wider than when running on a 32-bit 
> OS.

You mean, there's more address space available for use on 64-bit OS'es?  I hadn't thought about that, but that's of course true.  But on Linux, the difference is a few hundred megs, right?  (In contrast, on Windows, it's 1 or 2GB.)

Julian Seward [:jseward]

Comment 57

•

12 years ago

32-bit process on 32-bit linux generally (not always) has 3GB address
space available.  32 bit process on 64-bit linux generally has a full
4GB available.

Mike Hommey [:glandium]

Comment 58

•

12 years ago

(In reply to Julian Seward from comment #57)
> 32-bit process on 32-bit linux generally (not always) has 3GB address
> space available.  32 bit process on 64-bit linux generally has a full
> 4GB available.

Depends if PAE is enabled on the 32-bit linux kernel.

Boris Zbarsky [:bzbarsky]

Comment 59

•

12 years ago

"StreamTrans" would be the stream transport service's I/O threadpool.  I'm pretty sure we can reduce the default stack size for those threads to way below 8GB.  Furthermore, this threadpool should be capped to 4 threads by default.  It looks like initialization of the dom FileService raises that to 5 threads, and I don't see anything else that would change that limit offhand.  So how did we get to 20?  Bill, do you still have this in a debugger?  I wonder what the thread limit is set to there during a test run...  Though note that afaict the list of threads attached only shows one StreamTrans thread.

The "mozStorage" threads are of course used by the storage stuff.  It looks like mozStorage uses a thread per storage connection, not an overall threadpool.  So as long as storage connection objects are hanging out (which they could be because chrome stashed them in some data structure or because we haven't gced in a bit or whatever), their threads will be too.  This seems bad; I wonder why we don't use a threadpool here.

Boris Zbarsky [:bzbarsky]

Comment 60

•

12 years ago

One thing I don't get: where would the mozStorage threads get destroyed?  I don't see any Shutdown() calls on the nsThread in that code...

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 61

•

12 years ago

IIRC we depend on the Connection object being destroyed.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 62

•

12 years ago

I guess there's only one StreamTrans thread. It was named #20, but maybe that's because 19 others were created and destroyed before it.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 63

•

12 years ago

It looks like we have 5 JSRuntimes going, which means that there are 5 GC helper threads and 5 JS source compression threads. Also, each runtime starts (#CPUs)-1 threads for Ion compilation, and I have 12 cores, so that's where the 55 analysis threads come from.

Clearly we need a way to say whether a JS runtime is the "main" one or not.

Boris Zbarsky [:bzbarsky]

Comment 64

•

12 years ago

> IIRC we depend on the Connection object being destroyed.

Yes, but even when that happens, I don't see how it actually joins the thread!

> It was named #20, but maybe that's because 19 others were created and destroyed before it.

Yes, this is very likely.

> It looks like we have 5 JSRuntimes going

I'd guess workers?  Wonder why they're still alive, though.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 65

•

12 years ago

Here are the JSRuntimes:
* XPConnect (the main one)
* Three DOM worker threads seem to be around at all times
* There's a thread to process the proxy auto-config files

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 66

•

12 years ago

(In reply to Boris Zbarsky (:bz) from comment #64)
> > IIRC we depend on the Connection object being destroyed.
> 
> Yes, but even when that happens, I don't see how it actually joins the
> thread!

When the connection dies it will release its mAsyncExecutionThread. When that refcount finally goes to 0 it will automatically call Shutdown on the thread.

> > It looks like we have 5 JSRuntimes going
> 
> I'd guess workers?  Wonder why they're still alive, though.

There's at least one in chrome now: http://mxr.mozilla.org/mozilla-central/source/browser/components/thumbnails/PageThumbsWorker.js

Maybe others. Each one gets a ton of threads at the moment due to some JS engine fun (bug 801961, bug 714050, maybe others).

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 67

•

12 years ago

(In reply to Bill McCloskey (:billm) from comment #63)
> It looks like we have 5 JSRuntimes going, which means that there are 5 GC
> helper threads and 5 JS source compression threads. Also, each runtime
> starts (#CPUs)-1 threads for Ion compilation, and I have 12 cores, so that's
> where the 55 analysis threads come from.

Bill, any way you can twist someone's arm into fixing bug 801961 or bug 714050?

Ed Morley [:emorley]

Updated

•

12 years ago

Depends on: 714050

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 68

•

12 years ago

(In reply to ben turner [:bent] from comment #66)
> When the connection dies it will release its mAsyncExecutionThread. When
> that refcount finally goes to 0 it will automatically call Shutdown on the
> thread.

How does that happen? The nsThread destructor doesn't do anything.

> Bill, any way you can twist someone's arm into fixing bug 801961 or bug 714050?

I'll take a look at these today.

Boris Zbarsky [:bzbarsky]

Comment 69

•

12 years ago

> How does that happen? The nsThread destructor doesn't do anything.

Yes, that is _exactly_ my question.

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 70

•

12 years ago

(In reply to Boris Zbarsky (:bz) from comment #69)
> > How does that happen? The nsThread destructor doesn't do anything.
> 
> Yes, that is _exactly_ my question.

Um... It doesn't, and I guess it never did. That's crazy, I would have bet a large chunk of money that this behavior existed.

Looks like only thread pools auto-shutdown on destruction. Regular threads can't because a reference is kept in thread-local data and on the stack of the executing thread.

Boris Zbarsky [:bzbarsky]

Comment 71

•

12 years ago

So my current hypothesis is that mozStorageConnection never shuts down its thread, ever.  I would love someone to produce data to the contrary....

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 72

•

12 years ago

Yep, looks like the threads aren't shut down until nsThreadManager::Shutdown.  That's insane.

Boris Zbarsky [:bzbarsky]

Comment 73

•

12 years ago

I filed bug 802239.  Now all's we need is an owner....

Depends on: 802239

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 74

•

12 years ago

Looks like this regressed in bug 496019, see comment 71.

Mike Hommey [:glandium]

Comment 75

•

12 years ago

(In reply to ben turner [:bent] from comment #74)
> Looks like this regressed in bug 496019, see comment 71.

So I guess applying the interdiff between v4.4 and v4.5 of the patch from bug 496019 would solve it.

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 76

•

12 years ago

(In reply to Mike Hommey [:glandium] from comment #75)
> So I guess applying the interdiff between v4.4 and v4.5 of the patch from
> bug 496019 would solve it.

Hard to say... There have been lots of changes to that code in the last 2 years...

Masatoshi Kimura [:emk]

Comment 77

•

12 years ago

Can we reenable browser_tilt_* and browser_bug666317 on linux opt?

Phil Ringnalda (:philor)

Reporter

Comment 78

•

12 years ago

Yep, done.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

bhavana bajaj [:bajaj]

Comment 79

•

12 years ago

(In reply to Benoit Jacob [:bjacob] from comment #26)
> Inbound:
> http://hg.mozilla.org/integration/mozilla-inbound/rev/4476b1668b33

Can you please uplift this on aurora(considering the original issue in this bug is fixed and is still a problem on aurora) if its ready?

> 
> Now... here's the commit message:
> 
> > This fixes Valgrind errors while running TestWebGLElementArrayCache.
> > 
> > This also fixes the way we do content/canvas/test in the Makefiles, which had been broken by my attempt to enable compiled-code tests there (bug 732660).
> > 
> > Since I still don't know how to do compiled-code tests there, I've disabled them. At least we have the rest of content/canvas/test back.
> > 
> > Let's hope that no regression happened. 
> 
> I'm very sorry to have accidentally disabled content/canvas mochitests for 3
> weeks. Let's hope that no regression slipped in.
> 
> I'll also file a separate bug about correctly enabling the compiled-code
> tests.
> 
> Anyway, does this fix the issue originally discussed here?

Benoit Jacob [:bjacob] (mostly away)

Comment 80

•

12 years ago

(In reply to bhavana bajaj [:bajaj] from comment #79)
> (In reply to Benoit Jacob [:bjacob] from comment #26)
> > Inbound:
> > http://hg.mozilla.org/integration/mozilla-inbound/rev/4476b1668b33
> 
> Can you please uplift this on aurora(considering the original issue in this
> bug is fixed and is still a problem on aurora) if its ready?

For all we know, this valgrind uninitialized value was a red herring as far as the present bug is concerned. The fix is nice to have, but it's not related to the real bug discussed here.

Ed Morley [:emorley]

Comment 81

•

12 years ago

Un-tracking since the tests are now unhidden again on aurora; so whilst the patch attached in this bug wasn't backported, enough of the dependent bugs were that this is no longer an issue.

status-firefox18: affected → wontfix

tracking-firefox18: + → ---

Manuela Muntean [Away]

Updated

•

12 years ago

Whiteboard: [qa-]

Ryan VanderMeulen [:RyanVM]

Updated

•

9 years ago

Keywords: regressionwindow-wanted

fix uninitialized values in WebGL element array cache 12 years ago Benoit Jacob [:bjacob] (mostly away) 5.45 KB, patch	jgilbert : review+	Details \| Diff \| Splinter Review
list of running threads during browser-chrome tests 12 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 44.11 KB, text/plain		Details