Closed Bug 798849 Opened 12 years ago Closed 12 years ago

Linux32 opt±pgo browser-chrome nearly completely broken

Categories

(Toolkit :: Storage, defect)

x86
Linux
defect
Not set
major

Tracking

()

RESOLVED FIXED
Tracking Status
firefox18 --- wontfix

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: regression, Whiteboard: [qa-])

Attachments

(2 files)

We had an episode of this last June/July, and we're very much in the middle of another episode of the same old thing.

The first stage is when all of the browser_tilt_ browser-chrome tests start hitting OOM crashes. Once we give in and disable all of them, the second stage is bug 761049, when browser_bug666317.js gets a NS_ERROR_NOT_AVAILABLE. Once we give in and disable it, the third stage is a diaspora: bug 766826 (which I expanded to a few more this time around) plus anything that thinks using nsIZipReader.open will go well for it, a rather large class including all the many extension tests, and some less obvious things like the js crash in bug 761488.

On inbound right now, five of the last six failed, which is pretty typical (and that's *after* the first two stages of pretending we can disable our way to victory). I'd try to stand it, because hey, it's just browser-chrome and just Linux being destroyed, but now we're into the part of the cycle where so many tests fail that tbpl hits several unresponsive script warnings trying to open the comment form, and puts the submit button off screen because it has so many suggested bugs.
Blocks: 761488
Blocks: 763285
Blocks: 762893
Blocks: 761049, 766826, 759157
Did we introduce some kind of horrible memory corruption that we only hit on browser-chrome tests? Perhaps we should try a Valgrind run of them on Linux32?
Blocks: 798372
Blocks: 764002
On Linux64, I managed to get this, with 
content/browser/browser/devtools/tilt/test/browser_tilt_02_notifications-seq.js
although to tell the truth, I am not convinced this isn't a Valgrind
false positive (it isn't so hot on 16-bit comparisons).  Anyway:

TEST-INFO | chrome://mochitests/content/browser/browser/devtools/tilt/test/browser_tilt_02_notifications-seq.js | Tilt was opened.
TEST-INFO | chrome://mochitests/content/browser/browser/devtools/tilt/test/browser_tilt_02_notifications-seq.js | Handling the INITIALIZING notification.

Conditional jump or move depends on uninitialised value(s)
   at 0x6EC9F98: bool mozilla::WebGLElementArrayCache::Validate<unsigned short>(unsigned short, unsigned long, unsigned long) (WebGLElementArrayCache.cpp:488)
   by 0x6ECA217: mozilla::WebGLElementArrayCache::Validate(unsigned int, unsigned int, unsigned long, unsigned long) (WebGLElementArrayCache.cpp:526)
   by 0x6EB47B4: mozilla::WebGLContext::DrawElements(unsigned int, int, unsigned int, long) (WebGLContext.h:1563)
   by 0x7855D65: mozilla::dom::WebGLRenderingContextBinding::drawElements(JSContext*, JS::Handle<JSObject*>, mozilla::WebGLContext*, unsigned int, JS::Value*) (WebGLRen\
deringContextBinding.cpp:4363)
   by 0x78461C9: mozilla::dom::WebGLRenderingContextBinding::genericMethod(JSContext*, unsigned int, JS::Value*) (WebGLRenderingContextBinding.cpp:8853)
   by 0x7E04510: js::InvokeKernel(JSContext*, JS::CallArgs, js::MaybeConstruct) (jscntxtinlines.h:364)
   by 0x7E04C4C: js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value*, JS::Value*) (jsinterp.h:109)
   by 0x7E3FE59: js::IndirectProxyHandler::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jsproxy.cpp:450)
   by 0x7EAAB47: js::DirectWrapper::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jswrapper.cpp:390)
   by 0x7EAC9F6: js::CrossCompartmentWrapper::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jswrapper.cpp:722)
   by 0x7E458D6: js::Proxy::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jsproxy.cpp:2462)
   by 0x7E4590A: proxy_Call(JSContext*, unsigned int, JS::Value*) (jsproxy.cpp:3002)
   by 0x7E0472B: js::InvokeKernel(JSContext*, JS::CallArgs, js::MaybeConstruct) (jscntxtinlines.h:364)
   by 0x7DFF221: js::Interpret(JSContext*, js::StackFrame*, js::InterpMode) (jsinterp.cpp:2461)
   by 0x7E0411E: js::RunScript(JSContext*, JS::Handle<JSScript*>, js::StackFrame*) (jsinterp.cpp:324)
   by 0x7E04678: js::InvokeKernel(JSContext*, JS::CallArgs, js::MaybeConstruct) (jsinterp.cpp:378)
   by 0x7E04C4C: js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value*, JS::Value*) (jsinterp.h:109)
   by 0x7E3FE59: js::IndirectProxyHandler::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jsproxy.cpp:450)
   by 0x7EAAB47: js::DirectWrapper::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jswrapper.cpp:390)
   by 0x7EAC9F6: js::CrossCompartmentWrapper::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jswrapper.cpp:722)

 Uninitialised value was created by a heap allocation
   at 0x4C2C62F: malloc (vg_replace_malloc.c:270)
   by 0x4032025: moz_xmalloc (mozalloc.cpp:57)
   by 0x6A1C1CF: nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity(unsigned int, unsigned int) (nsTArray.h:60)
   by 0x6A7B283: nsTArray_base<nsTArrayDefaultAllocator>::InsertSlotsAt(unsigned int, unsigned int, unsigned int, unsigned long) (nsTArray-inl.h:256)
   by 0x6ECA145: bool mozilla::WebGLElementArrayCache::Validate<unsigned short>(unsigned short, unsigned long, unsigned long) (nsTArray.h:1073)
   by 0x6ECA217: mozilla::WebGLElementArrayCache::Validate(unsigned int, unsigned int, unsigned long, unsigned long) (WebGLElementArrayCache.cpp:526)
   by 0x6EB47B4: mozilla::WebGLContext::DrawElements(unsigned int, int, unsigned int, long) (WebGLContext.h:1563)
   by 0x7855D65: mozilla::dom::WebGLRenderingContextBinding::drawElements(JSContext*, JS::Handle<JSObject*>, mozilla::WebGLContext*, unsigned int, JS::Value*) (WebGLRen\
deringContextBinding.cpp:4363)
   by 0x78461C9: mozilla::dom::WebGLRenderingContextBinding::genericMethod(JSContext*, unsigned int, JS::Value*) (WebGLRenderingContextBinding.cpp:8853)
   by 0x7E04510: js::InvokeKernel(JSContext*, JS::CallArgs, js::MaybeConstruct) (jscntxtinlines.h:364)
   by 0x7E04C4C: js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value*, JS::Value*) (jsinterp.h:109)
   by 0x7E3FE59: js::IndirectProxyHandler::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jsproxy.cpp:450)
   by 0x7EAAB47: js::DirectWrapper::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jswrapper.cpp:390)
   by 0x7EAC9F6: js::CrossCompartmentWrapper::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jswrapper.cpp:722)
   by 0x7E458D6: js::Proxy::call(JSContext*, JSObject*, unsigned int, JS::Value*) (jsproxy.cpp:2462)
   by 0x7E4590A: proxy_Call(JSContext*, unsigned int, JS::Value*) (jsproxy.cpp:3002)
   by 0x7E0472B: js::InvokeKernel(JSContext*, JS::CallArgs, js::MaybeConstruct) (jscntxtinlines.h:364)
   by 0x7DFF221: js::Interpret(JSContext*, js::StackFrame*, js::InterpMode) (jsinterp.cpp:2461)
   by 0x7E0411E: js::RunScript(JSContext*, JS::Handle<JSScript*>, js::StackFrame*) (jsinterp.cpp:324)
   by 0x7E04678: js::InvokeKernel(JSContext*, JS::CallArgs, js::MaybeConstruct) (jsinterp.cpp:378)

TEST-INFO | chrome://mochitests/content/browser/browser/devtools/tilt/test/browser_tilt_02_notifications-seq.js | Longer timeout required, waiting longer...  Remaining timeouts: \
9
Since the uninitialised value was created by a heap allocation near WebGL, cc'ing Jeff Gilbert.
This is plausible. This is the new element array cache tree from bjacob, cc'ing.
Component: General → Canvas: WebGL
Many thanks for the report. Thankfully, the Valgrind error reported by Julian reproduces on the compiled-code unit test that we added for this feature (in content/canvas/test/compiled) (Yes, we knew that this would be bug prone, so we added compiled code tests just for it).

Here's what I get in a debug build (Julian's is visibly optimized):

==27763== Conditional jump or move depends on uninitialised value(s)
==27763==    at 0x405752: unsigned char const& NS_MAX<unsigned char>(unsigned char const&, unsigned char const&) (nsAlgorithm.h:47)
==27763==    by 0x4046D1: mozilla::WebGLElementArrayCacheTree<unsigned char>::Update() (WebGLElementArrayCache.cpp:426)
==27763==    by 0x402F87: bool mozilla::WebGLElementArrayCache::Validate<unsigned char>(unsigned char, unsigned long, unsigned long) (WebGLElementArrayCache.cpp:484)
==27763==    by 0x402700: mozilla::WebGLElementArrayCache::Validate(unsigned int, unsigned int, unsigned long, unsigned long) (WebGLElementArrayCache.cpp:524)
==27763==    by 0x403450: void CheckValidateOneType<unsigned char>(mozilla::WebGLElementArrayCache&, unsigned long, unsigned long) (TestWebGLElementArrayCache.cpp:72)
==27763==    by 0x4028D8: CheckValidate(mozilla::WebGLElementArrayCache&, unsigned long, unsigned long) (TestWebGLElementArrayCache.cpp:83)
==27763==    by 0x4029E2: main (TestWebGLElementArrayCache.cpp:139)
==27763==  Uninitialised value was created by a heap allocation
==27763==    at 0x402C787: realloc (vg_replace_malloc.c:663)
==27763==    by 0x406215A: moz_xrealloc (mozalloc.cpp:89)
==27763==    by 0x402422: nsTArrayInfallibleAllocator::Realloc(void*, unsigned long) (nsTArray.h:60)
==27763==    by 0x406041: nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity(unsigned int, unsigned int) (nsTArray-inl.h:167)
==27763==    by 0x405A6A: nsTArray_base<nsTArrayDefaultAllocator>::InsertSlotsAt(unsigned int, unsigned int, unsigned int, unsigned long) (nsTArray-inl.h:256)
==27763==    by 0x405132: nsTArray<unsigned char, nsTArrayDefaultAllocator>::InsertElementsAt(unsigned int, unsigned int) (nsTArray.h:1062)
==27763==    by 0x403360: nsTArray<unsigned char, nsTArrayDefaultAllocator>::SetLength(unsigned int) (nsTArray.h:1022)
==27763==    by 0x402CB7: mozilla::WebGLElementArrayCacheTree<unsigned char>::ResizeToParentSize() (WebGLElementArrayCache.cpp:293)
==27763==    by 0x40254A: mozilla::WebGLElementArrayCache::BufferData(void const*, unsigned long) (WebGLElementArrayCache.cpp:443)
==27763==    by 0x4029CA: main (TestWebGLElementArrayCache.cpp:138)
Benoit, you might want to run Valgrind with --track-origins=yes.
Comment 5 did that, look, it does give the origin of the uninitialized value.
(In reply to Benoit Jacob [:bjacob] from comment #5)
> Here's what I get in a debug build (Julian's is visibly optimized):

> ==27763==  Uninitialised value was created by a heap allocation

Oh yes, you're right, sorry I misread.
So I'll try to debug this tomorrow --- but don't set your hopes too high that this will fix the browser-chrome issues. So far the WebGL bug is just a conditional-jump-based-on-uninitialized-value bug. There is still no theory about how it could result in the browser-chrome issues i.e. it may still be a red herring here.
So there was a real bug there, although the fact that the very extensive unit test passes suggests that perhaps it didn't actually manifest itself for some reason. Not sure.

The bug was that when we allocate the storage for the element-array-cache tree, we did not initialize it. We relied on the tree Update() method to initialize just the parts that are needed given current invalidation. The problem with that approach was that in Update() it was not possible (AFAICS now) to distinguish between an existing tree entry that didn't need updating, and an uninitialized tree entry.

This patch fixes it by memset()-ing the tree data everytime it's resized (bufferData is expensive anyways). It also causes the tree to be completely invalidated on resize.

Runs valgrind-clean here.

Please ignore the part of this patch that changes TOOLS_DIRS to TEST_DIRS. Needed this to get the tests to compile locally, but then again this won't compile on Android. Will figure this out in a separate bug/patch.
Attachment #670013 - Flags: review?(jgilbert)
Attachment #670013 - Flags: review?(jgilbert) → review+
Pushed to try. Expecting same Android build error as last time I had tried TEST_DIRS for compiled code tests there, but wanted to check if the issue was still present.
https://tbpl.mozilla.org/?tree=Try&rev=8d05762ff23e
Failed in the expected way; new try that is expected to build:

https://tbpl.mozilla.org/?tree=Try&rev=ae56c5539ec6
Assignee: nobody → bjacob
Didn't expect RAND_MAX to be so low on MSVC? :)
Oh hey, that is really funny. This is a static assertion in the compiled unit test. That means that on MSVC RAND_MAX is < 1e24. OK, will adapt to that dire reality.
How #$%^&, http://msdn.microsoft.com/en-us/library/2dfe3bzd.aspx says that RAND_MAX is 0x7fff. That is the minimum value allowed by the spec. OK, new try push coming.
>+  enum { bitsToIgnore = 7 };
>   MOZ_STATIC_ASSERT((unsigned int)(RAND_MAX) >> (8 + bitsToIgnore),
0x7fff >> 15 would be zero. This assertion is broken.
Is this assertion needed in the first place? RAND_MAX is guaranteed to be greater than or equal to 0x7fff by the spec.
(In reply to Masatoshi Kimura [:emk] from comment #19)
> >+  enum { bitsToIgnore = 7 };
> >   MOZ_STATIC_ASSERT((unsigned int)(RAND_MAX) >> (8 + bitsToIgnore),
> 0x7fff >> 15 would be zero. This assertion is broken.

You're right, I wanted MOZ_STATIC_ASSERT((RAND_MAX >> bitsToIgnore) >= 255),

> Is this assertion needed in the first place? RAND_MAX is guaranteed to be
> greater than or equal to 0x7fff by the spec.

http://msdn.microsoft.com/en-us/library/2dfe3bzd.aspx says that in MSVC, it is 0x7fff. We have to build on MSVC.
(In reply to Benoit Jacob [:bjacob] from comment #20)
> http://msdn.microsoft.com/en-us/library/2dfe3bzd.aspx says that in MSVC, it
> is 0x7fff. We have to build on MSVC.
Yeah, I said greater than *or equal to* 0x7fff, and it is equal to 0x7fff on MSVC.
>     a[i] = static_cast<uint8_t>((unsigned int)(rand()) >> bitsToIgnore);
Even if RAND_MAX is 0x7fff, a[i] would be uniformly distributed between 0 to 255.
OK, right -- sorry I misread.
Anyway, the assertion is useless now that we don't need anything more than what the spec requires. Removed it. New try:
https://tbpl.mozilla.org/?tree=Try&rev=5e2e6cfe338b
Got the same issue about xpcom.lib not being found, that originally prompted the use of TOOLS_DIRS instead of TEST_DIRS --- which doesn't work anymore. Let's not block on this, instead, let's temporarily do content/canvas/test/compiled only on Linux where it works, and we'll then solve this problem on other platforms in a separate bug.
Failed again, but this time the log doesn't seem to give a clue why. Now I have simply disabled content/canvas/test/compiled by omitting it in the Makefile. Let's see if that builds:
https://tbpl.mozilla.org/?tree=Try&rev=c4716762f837
Inbound:
http://hg.mozilla.org/integration/mozilla-inbound/rev/4476b1668b33

Now... here's the commit message:

> This fixes Valgrind errors while running TestWebGLElementArrayCache.
> 
> This also fixes the way we do content/canvas/test in the Makefiles, which had been broken by my attempt to enable compiled-code tests there (bug 732660).
> 
> Since I still don't know how to do compiled-code tests there, I've disabled them. At least we have the rest of content/canvas/test back.
> 
> Let's hope that no regression happened. 

I'm very sorry to have accidentally disabled content/canvas mochitests for 3 weeks. Let's hope that no regression slipped in.

I'll also file a separate bug about correctly enabling the compiled-code tests.

Anyway, does this fix the issue originally discussed here?
Benoit, have you asked the build system folks about the proper way of doing compiled tests?
(In reply to Masatoshi Kimura [:emk] from comment #28)
> Is TOOLS_DIRS a typo of TOOL_DIRS?
> https://mxr.mozilla.org/mozilla-central/search?string=TOOLS_DIRS
> https://mxr.mozilla.org/mozilla-central/search?string=TOOL_DIRS

Ouch. Seems like it!

(In reply to Ehsan Akhgari [:ehsan] from comment #27)
> Benoit, have you asked the build system folks about the proper way of doing
> compiled tests?

Yes. Answer was "dunno, try TOOLS_DIRS" (with the S).
Filed bug 800612 to continue this conversation.

Try with TOOL_DIRS:
https://tbpl.mozilla.org/?tree=Try&rev=22f197df1767
FWIW there was one regression in WebGL (bug 800657) and one in Canvas2D (bug 800658). Nothing too terrible.
(In reply to Benoit Jacob [:bjacob] from comment #26)
> Anyway, does this fix the issue originally discussed here?

Nope. We seem to have skipped running Linux opt moth most of the time since your push, but the clearest example of unfixed so far is https://tbpl.mozilla.org/php/getParsedLog.php?id=16032121&tree=Mozilla-Inbound#error3
https://hg.mozilla.org/mozilla-central/rev/a0297358abc1
https://hg.mozilla.org/mozilla-central/rev/022c39b4aae6
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
Should have put [leave open].
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Unassigning from myself based on comment 32.
Assignee: bjacob → nobody
I'll unfortunately have to be hiding Linux32 m-oth soon, it's pretty much perma-orange & the failures cause the TBPL UI to hang (pending bug 795460 and a bunch of other TBPL cleanup I'm doing).
Un-categorizing from WebGL, as there is no reason anymore to think that that has anything to do with WebGL.
Component: Canvas: WebGL → Untriaged
Product: Core → Firefox
Target Milestone: mozilla19 → ---
Component: Untriaged → General
Product: Firefox → Core
Is there a regression window?
Yeah, depending on which regression you mean, the first or the second, "around May 28th" and "September 11th when Ionmonkey merged to mozilla-central" are about as close as you'll get (bug 759157 is probably the best leading indicator, and notice how we starred it a few times on the ionmonkey branch, before the multi-branch explosion on the 12th after the merge).

The window I'd most like to know about, but don't know if we can pinpoint even to a week much less a day, is what landed in mid-June which caused the first episode to fade away? Did we, say, stop optimizing something in tracemonkey because we were hitting a gcc optimization bug, but either that same thing or some other thing is still being optimized and broken in ionmonkey?
> Did we, say, stop optimizing something in
> tracemonkey because we were hitting a gcc optimization bug

(Tracemonkey was removed in bug 698201 some time ago last year, so it is unlikely to be the cause)
My suspicion right now is that we're just running out of memory. I don't know why that would happen, since the tests shouldn't keep around any long-lived objects. I'm investigating the memory usage on Linux right now.
I have trouble keeping my monkey species straight, what I actually meant was "in the js engine we had last June, the one which predated the one we got on September 11th."
Summary: Linux32 browser-chrome nearly completely broken → Linux32 opt±pgo browser-chrome nearly completely broken
I found something that might be causing the problem. I stopped the browser-chrome tests partway through and looked at what threads were running. There are tons of them. Way, way too many. The worst offender seems to be "mozStorage Thread", of which we have 190. There are also 20 "SteamTrans" threads and 55 "Analysis Helper" threads. When I started the test there were "only" 77 threads running, and when I stopped it (before the end) there were 331, so they seem to accumulate over time. I've attached a complete list.

I think this is a possible explanation for why we're running out of memory. Each thread appears to allocate 8MB of address space for the stack. With 331 threads, that's over 2.6 GB of address space just for stacks. When you add in everything else, it wouldn't surprise me that we're running out of address space. That could easily be causing all sorts of random failures since we don't deal with OOMs very well.

Hopefully some people on the CC list will be able to sort of what these threads are for. I'm pretty sure we're not even using the Analysis Helper ones right now. Also, I think that was added along with IonMonkey, so that would explain a spike in failures around that time. However, the mozStorage threads should probably be our biggest target.
(In reply to Bill McCloskey (:billm) from comment #44)
> I think this is a possible explanation for why we're running out of memory.
> Each thread appears to allocate 8MB of address space for the stack. With 331
> threads, that's over 2.6 GB of address space just for stacks.

Relatedly, we should probably set a lower limit for stack space for some if not all of these threads. We already do for some classes of threads, iirc.
Are too many threads observed only on the linux-opt? If so, why?
(In reply to Masatoshi Kimura [:emk] from comment #46)
> Are too many threads observed only on the linux-opt? If so, why?

My guess is that other platforms see the same number of threads, but it's not a problem for them: windows and mac, iirc, have a smaller stack space for new threads, and they probably don't exhaust the 32-bits address space. On Linux 64-bits, there's more address space, so it doesn't reach exhaustion either.
Thank you Bill, good spot :-)
(In reply to Mike Hommey [:glandium] from comment #47)
> My guess is that other platforms see the same number of threads, but it's
> not a problem for them: windows and mac, iirc, have a smaller stack space
> for new threads, and they probably don't exhaust the 32-bits address space.
> On Linux 64-bits, there's more address space, so it doesn't reach exhaustion
> either.
Unlike linux-opt, does linux-debug (not linux64-debug) actually run on 64-bit machine?
https://tbpl.mozilla.org/?tree=Try&jobname=Rev3%20Fedora%2012%20try%20debug%20test%20mochitest-other
See bug 664341 for an example of where we reduced the size of stacks used for video threads -- there's an internal API for it, apparently.

Bill, this makes sense w.r.t. our earlier conversation, because about:memory doesn't measure thread stacks.  Perhaps it should... is there a way to iterate over all existing threads?
speculation: linux-debug may run slower and so some threads may complete and join before others start.
Component: General → Storage
Product: Core → Toolkit
(In reply to Nicholas Nethercote [:njn] from comment #50)
> See bug 664341 for an example of where we reduced the size of stacks used
> for video threads -- there's an internal API for it, apparently.
> 
> Bill, this makes sense w.r.t. our earlier conversation, because about:memory
> doesn't measure thread stacks.  Perhaps it should... is there a way to
> iterate over all existing threads?

The nsThreadManager keeps a list of all nsThreads. That probably won't cover threads initiated from ipc or webrtc code... which makes me think it may be worth adding a thin layer to those so that all thread activity we have do go through the thread manager.
Luke filed bug 801961 for disabling the unused IM compilation threads.
Depends on: 801961
(In reply to Masatoshi Kimura [:emk] from comment #49)
> https://tbpl.mozilla.org/
> ?tree=Try&jobname=Rev3%20Fedora%2012%20try%20debug%20test%20mochitest-other

It doesn't matter -- a 32-bit binary can only touch 4GB of address space, because its pointers are only 32 bits long.  That's true whether the binary is running on a 32-bit OS or a 64-bit OS.
(In reply to Justin Lebar [:jlebar] from comment #54)
> (In reply to Masatoshi Kimura [:emk] from comment #49)
> > https://tbpl.mozilla.org/
> > ?tree=Try&jobname=Rev3%20Fedora%2012%20try%20debug%20test%20mochitest-other
> 
> It doesn't matter -- a 32-bit binary can only touch 4GB of address space,
> because its pointers are only 32 bits long.  That's true whether the binary
> is running on a 32-bit OS or a 64-bit OS.

The address space for 32-bit programs running on a 64-bit OS is wider than when running on a 32-bit OS.
> The address space for 32-bit programs running on a 64-bit OS is wider than when running on a 32-bit 
> OS.

You mean, there's more address space available for use on 64-bit OS'es?  I hadn't thought about that, but that's of course true.  But on Linux, the difference is a few hundred megs, right?  (In contrast, on Windows, it's 1 or 2GB.)
32-bit process on 32-bit linux generally (not always) has 3GB address
space available.  32 bit process on 64-bit linux generally has a full
4GB available.
(In reply to Julian Seward from comment #57)
> 32-bit process on 32-bit linux generally (not always) has 3GB address
> space available.  32 bit process on 64-bit linux generally has a full
> 4GB available.

Depends if PAE is enabled on the 32-bit linux kernel.
"StreamTrans" would be the stream transport service's I/O threadpool.  I'm pretty sure we can reduce the default stack size for those threads to way below 8GB.  Furthermore, this threadpool should be capped to 4 threads by default.  It looks like initialization of the dom FileService raises that to 5 threads, and I don't see anything else that would change that limit offhand.  So how did we get to 20?  Bill, do you still have this in a debugger?  I wonder what the thread limit is set to there during a test run...  Though note that afaict the list of threads attached only shows one StreamTrans thread.

The "mozStorage" threads are of course used by the storage stuff.  It looks like mozStorage uses a thread per storage connection, not an overall threadpool.  So as long as storage connection objects are hanging out (which they could be because chrome stashed them in some data structure or because we haven't gced in a bit or whatever), their threads will be too.  This seems bad; I wonder why we don't use a threadpool here.
One thing I don't get: where would the mozStorage threads get destroyed?  I don't see any Shutdown() calls on the nsThread in that code...
IIRC we depend on the Connection object being destroyed.
I guess there's only one StreamTrans thread. It was named #20, but maybe that's because 19 others were created and destroyed before it.
It looks like we have 5 JSRuntimes going, which means that there are 5 GC helper threads and 5 JS source compression threads. Also, each runtime starts (#CPUs)-1 threads for Ion compilation, and I have 12 cores, so that's where the 55 analysis threads come from.

Clearly we need a way to say whether a JS runtime is the "main" one or not.
> IIRC we depend on the Connection object being destroyed.

Yes, but even when that happens, I don't see how it actually joins the thread!

> It was named #20, but maybe that's because 19 others were created and destroyed before it.

Yes, this is very likely.

> It looks like we have 5 JSRuntimes going

I'd guess workers?  Wonder why they're still alive, though.
Here are the JSRuntimes:
* XPConnect (the main one)
* Three DOM worker threads seem to be around at all times
* There's a thread to process the proxy auto-config files
(In reply to Boris Zbarsky (:bz) from comment #64)
> > IIRC we depend on the Connection object being destroyed.
> 
> Yes, but even when that happens, I don't see how it actually joins the
> thread!

When the connection dies it will release its mAsyncExecutionThread. When that refcount finally goes to 0 it will automatically call Shutdown on the thread.

> > It looks like we have 5 JSRuntimes going
> 
> I'd guess workers?  Wonder why they're still alive, though.

There's at least one in chrome now: http://mxr.mozilla.org/mozilla-central/source/browser/components/thumbnails/PageThumbsWorker.js

Maybe others. Each one gets a ton of threads at the moment due to some JS engine fun (bug 801961, bug 714050, maybe others).
(In reply to Bill McCloskey (:billm) from comment #63)
> It looks like we have 5 JSRuntimes going, which means that there are 5 GC
> helper threads and 5 JS source compression threads. Also, each runtime
> starts (#CPUs)-1 threads for Ion compilation, and I have 12 cores, so that's
> where the 55 analysis threads come from.

Bill, any way you can twist someone's arm into fixing bug 801961 or bug 714050?
Depends on: 714050
(In reply to ben turner [:bent] from comment #66)
> When the connection dies it will release its mAsyncExecutionThread. When
> that refcount finally goes to 0 it will automatically call Shutdown on the
> thread.

How does that happen? The nsThread destructor doesn't do anything.

> Bill, any way you can twist someone's arm into fixing bug 801961 or bug 714050?

I'll take a look at these today.
> How does that happen? The nsThread destructor doesn't do anything.

Yes, that is _exactly_ my question.
(In reply to Boris Zbarsky (:bz) from comment #69)
> > How does that happen? The nsThread destructor doesn't do anything.
> 
> Yes, that is _exactly_ my question.

Um... It doesn't, and I guess it never did. That's crazy, I would have bet a large chunk of money that this behavior existed.

Looks like only thread pools auto-shutdown on destruction. Regular threads can't because a reference is kept in thread-local data and on the stack of the executing thread.
So my current hypothesis is that mozStorageConnection never shuts down its thread, ever.  I would love someone to produce data to the contrary....
Yep, looks like the threads aren't shut down until nsThreadManager::Shutdown.  That's insane.
I filed bug 802239.  Now all's we need is an owner....
Depends on: 802239
(In reply to ben turner [:bent] from comment #74)
> Looks like this regressed in bug 496019, see comment 71.

So I guess applying the interdiff between v4.4 and v4.5 of the patch from bug 496019 would solve it.
(In reply to Mike Hommey [:glandium] from comment #75)
> So I guess applying the interdiff between v4.4 and v4.5 of the patch from
> bug 496019 would solve it.

Hard to say... There have been lots of changes to that code in the last 2 years...
Can we reenable browser_tilt_* and browser_bug666317 on linux opt?
Yep, done.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
(In reply to Benoit Jacob [:bjacob] from comment #26)
> Inbound:
> http://hg.mozilla.org/integration/mozilla-inbound/rev/4476b1668b33

Can you please uplift this on aurora(considering the original issue in this bug is fixed and is still a problem on aurora) if its ready?

> 
> Now... here's the commit message:
> 
> > This fixes Valgrind errors while running TestWebGLElementArrayCache.
> > 
> > This also fixes the way we do content/canvas/test in the Makefiles, which had been broken by my attempt to enable compiled-code tests there (bug 732660).
> > 
> > Since I still don't know how to do compiled-code tests there, I've disabled them. At least we have the rest of content/canvas/test back.
> > 
> > Let's hope that no regression happened. 
> 
> I'm very sorry to have accidentally disabled content/canvas mochitests for 3
> weeks. Let's hope that no regression slipped in.
> 
> I'll also file a separate bug about correctly enabling the compiled-code
> tests.
> 
> Anyway, does this fix the issue originally discussed here?
(In reply to bhavana bajaj [:bajaj] from comment #79)
> (In reply to Benoit Jacob [:bjacob] from comment #26)
> > Inbound:
> > http://hg.mozilla.org/integration/mozilla-inbound/rev/4476b1668b33
> 
> Can you please uplift this on aurora(considering the original issue in this
> bug is fixed and is still a problem on aurora) if its ready?

For all we know, this valgrind uninitialized value was a red herring as far as the present bug is concerned. The fix is nice to have, but it's not related to the real bug discussed here.
Un-tracking since the tests are now unhidden again on aurora; so whilst the patch attached in this bug wasn't backported, enough of the dependent bugs were that this is no longer an issue.
Whiteboard: [qa-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: