Last Comment Bug 634711 - Firefox 4.0b11 Crash [@ avmplus::BitSet::grow(int) ]
: Firefox 4.0b11 Crash [@ avmplus::BitSet::grow(int) ]
Status: RESOLVED FIXED
fixed-in-nanojit
: crash
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: Trunk
: x86 Windows XP
: -- normal (vote)
: ---
Assigned To: Paul Biggar
:
: Jason Orendorff [:jorendorff]
Mentors:
: 552355 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-02-16 12:40 PST by Marcia Knous [:marcia - use ni]
Modified: 2011-08-11 17:00 PDT (History)
3 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Remove the second bitset implementation (9.22 KB, patch)
2011-05-16 10:57 PDT, Paul Biggar
n.nethercote: review+
Details | Diff | Splinter Review
patch to fix lifetime problems (2.63 KB, patch)
2011-05-23 22:43 PDT, Nicholas Nethercote [:njn]
paul.biggar: review+
Details | Diff | Splinter Review

Description Marcia Knous [:marcia - use ni] 2011-02-16 12:40:35 PST
Seen while reviewing crash stats. Low volume crash that is present in B11 as well as on trunk. http://tinyurl.com/4cy6nxa links to the crashes which are all Win XP. Only one comment, but it is in Spanish. 

First crash showed in up in crash stats using the 2010120300 build.



Frame 	Module 	Signature [Expand] 	Source
0 	mozjs.dll 	avmplus::BitSet::grow 	js/src/nanojit/avmplus.h:333
1 	mozjs.dll 	js::Oracle::Oracle 	js/src/jstracer.cpp:1286
2 	mozjs.dll 	js::InitJIT 	js/src/jstracer.cpp:7712
3 	mozjs.dll 	JSCompartment::init 	js/src/jscompartment.cpp:111
4 	mozjs.dll 	js::gc::NewCompartment 	js/src/jsgc.cpp:2862
5 	mozjs.dll 	JS_NewCompartmentAndGlobalObject 	js/src/jsapi.cpp:2979
6 	xul.dll 	CreateNewCompartment 	js/src/xpconnect/src/nsXPConnect.cpp:965
7 	xul.dll 	xpc_CreateGlobalObject 	js/src/xpconnect/src/nsXPConnect.cpp:1003
8 	xul.dll 	xpc_CreateSandboxObject 	js/src/xpconnect/src/xpccomponents.cpp:3219
9 	xul.dll 	nsXPCComponents_utils_Sandbox::CallOrConstruct 	js/src/xpconnect/src/xpccomponents.cpp:3415
10 	xul.dll 	nsXPCComponents_utils_Sandbox::Construct 	js/src/xpconnect/src/xpccomponents.cpp:3306
Comment 1 Andreas Gal :gal 2011-02-16 12:42:02 PST
This is fallout from moving the tracer into the compartment. We make a sandbox, which makes a compartment, which allocates a new tracer set, which OOMs, which we don't treat right. pbiggar wanted to look at this.
Comment 2 Marcia Knous [:marcia - use ni] 2011-04-26 09:39:41 PDT
A few more of these crashes showed up on the trunk today: http://tinyurl.com/64rjrup
Comment 3 Paul Biggar 2011-05-09 02:55:52 PDT
njn, what's the appropriate way to add OOM checks to nanojit? I can just make bitset functions return bool - would upstream accept that?
Comment 4 Nicholas Nethercote [:njn] 2011-05-09 17:54:53 PDT
pbiggar: you've stepped into a mess.  Various pertinent facts:

- In theory, Nanojit has infallible allocators.  This is implemented by pre-allocating some reserve space that we use if an OOM happens.  See the comment in jstracer.cpp above the definition of DataReserveSize.  Also see bug 624590 for more details and suggestions for a longjmp/setjmp alternative that wouldn't require reserve space.

- Also, all Nanojit allocation goes through its allocators, not via vanilla malloc/calloc/new.  See class Allocator in nanojit/Allocator.h, and also the sub-class VMAllocator in jstracer.h.

- So why does BitSet use calloc?  Well, nanojit/avmplus.h is actually not part of Nanojit per se.  It's a TM-specific file.  Tamarin has its own avmplus.h file which is in a different spot in the Tamarin tree.  (Actually, it has two different files called avmplus.h!  But that's irrelevant here.)  So we don't have to worry about upstreaming any changes to it.  I guess this explains how a call to calloc() crept in.

- Also, there are two BitSet implementations in TM;  the one in nanojit/avmplus.h and also the one in nanojit/Containers.cpp (which is a proper part of Nanojit).  They're almost the same, but the latter uses Nanojit's allocators instead of calloc.  Bug 552355 is open to remove the one in avmplus.h.

Confusing, eh?

So, the proper fix here is to get rid of the BitSet in avmplus.h which uses calloc and change TM to use the BitSet in nanojit/Containers.{h,cpp}.  Then the BitSet would use Nanojit's infallible allocators which have the reserve space as fallback.  That would fix this bug and bug 552355.  I'd love it if you could do this;  I don't think it would be hard.

The quick hacky fix would be to change BitSet in avmplus.h to return bool on operations such as grow() and set() that can fail, and check for failure in TM.  No need to upstream.
Comment 5 Paul Biggar 2011-05-13 04:27:45 PDT
Two questions: 

Back in bug 624878, why do parameters to js_new (now in rt->new_) use |const P1 &p1|? This is hitting me here.

Secondly, which vmallocator to use for the bitsets? An existing one (which) or a new one?
Comment 6 Nicholas Nethercote [:njn] 2011-05-13 06:43:06 PDT
(In reply to comment #5)
> Two questions: 
> 
> Back in bug 624878, why do parameters to js_new (now in rt->new_) use |const
> P1 &p1|? This is hitting me here.

I don't remember... how is it hitting you?

> Secondly, which vmallocator to use for the bitsets? An existing one (which)
> or a new one?

Use an existing one.  Their lifetimes are described in jscompartment.h, look for "There are 4 allocators here".
Comment 7 Paul Biggar 2011-05-15 09:30:19 PDT
(In reply to comment #6)
> (In reply to comment #5)
> > Two questions: 
> > 
> > Back in bug 624878, why do parameters to js_new (now in rt->new_) use |const
> > P1 &p1|? This is hitting me here.
> 
> I don't remember... how is it hitting you?

Consider:
   Oracle::Oracle(Allocator& a) { .. } // constructor
   cx->new_<Oracle> (allocator);


The const is added to allocator, and the compiler complains because allocator is not const (and should not be). (Actually, I'm surprised there aren't more subtle bugs where we think we're copying when we pass to a constructor, but actually pass by reference).


> Use an existing one.  Their lifetimes are described in jscompartment.h, look
> for "There are 4 allocators here".

What's the lifetime of the bitsets?
Comment 8 Nicholas Nethercote [:njn] 2011-05-15 16:49:47 PDT
(In reply to comment #7)
> 
> > Use an existing one.  Their lifetimes are described in jscompartment.h, look
> > for "There are 4 allocators here".
> 
> What's the lifetime of the bitsets?

IIRC they're in the Oracle, right?  The Oracle is in the TraceRecorder, alongside the TraceMonitor.  So I think you want 'dataAlloc'.
Comment 9 Paul Biggar 2011-05-15 19:45:08 PDT
That's what I thought, but it was causing problems, so I wanted to be clear on that. I'll have a closer look then.

Any thoughts on comment 7 part 1 (constness)?
Comment 10 Nicholas Nethercote [:njn] 2011-05-15 20:21:17 PDT
If you post what you have so far I can try to fix your compile error for you.
Comment 11 Paul Biggar 2011-05-16 10:57:28 PDT
Created attachment 532686 [details] [diff] [review]
Remove the second bitset implementation

Remove avmplus:bitset, and use nanojit::bitset instead. The semantics of the two are the same (the implementation is nearly identical). The only difference is the allocators.

It needed a small bit of consting because Oracle uses const properly.

I removed a call to Oracle::clear(). I'm not sure of the right thing to do here. It was segfaulting (valgrind said that dataAlloc->reset() was freeing the memory first). In truth, I dont understand why we flush() during initialization, or why we would be reset()ing allocators that just had things allocated out of them. I'll CC gal, he added this in bug 515852, if my archaeology is correct.
Comment 12 Paul Biggar 2011-05-16 10:58:11 PDT
(In reply to comment #10)
> If you post what you have so far I can try to fix your compile error for you.

Moved to bug 657384.
Comment 13 Paul Biggar 2011-05-20 02:59:02 PDT
*** Bug 552355 has been marked as a duplicate of this bug. ***
Comment 14 Paul Biggar 2011-05-23 09:11:16 PDT
gal, review ping?
Comment 15 Nicholas Nethercote [:njn] 2011-05-23 21:26:21 PDT
Comment on attachment 532686 [details] [diff] [review]
Remove the second bitset implementation

Review of attachment 532686 [details] [diff] [review]:
-----------------------------------------------------------------

Looks fine to me.
Comment 16 Nicholas Nethercote [:njn] 2011-05-23 21:28:20 PDT
(In reply to comment #4)
> 
> - So why does BitSet use calloc?  Well, nanojit/avmplus.h is actually not
> part of Nanojit per se.  It's a TM-specific file.  Tamarin has its own
> avmplus.h file which is in a different spot in the Tamarin tree.  (Actually,
> it has two different files called avmplus.h!  But that's irrelevant here.) 
> So we don't have to worry about upstreaming any changes to it.  I guess this
> explains how a call to calloc() crept in.

I'm wrong about that.  nanojit/avmplus.h is part of Nanojit.  However, Tamarin doesn't use it.  So even though it's part of Nanojit per se, it's TM-specific.  So amvplus.h and Containers.h changes have to land on Nanojit first.  I'll do that.
Comment 17 Nicholas Nethercote [:njn] 2011-05-23 22:42:38 PDT
I landed the nanojit part:

http://hg.mozilla.org/projects/nanojit-central/rev/8bf26319d7f1

I was about to merge it to TM and land the TM part, but Valgrind started complaining:

==16988==    at 0x82342CE: nanojit::BitSet::get(int) const (Containers.h:81)
==16988==    by 0x81F6AE9: js::Oracle::isStackSlotUndemotable(JSContext*, unsigned int, void const*) const (jstracer.cpp:1331)
==16988==    by 0x81F6B1C: js::Oracle::isStackSlotUndemotable(JSContext*, unsigned int) const (jstracer.cpp:1337)
==16988==    by 0x8239B99: js::CaptureTypesVisitor::visitStackSlots(js::Value*, int, js::StackFrame*) (jstracer.cpp:2066)
==16988==    by 0x82311C3: bool js::VisitFrameSlots<js::CaptureTypesVisitor>(js::CaptureTypesVisitor&, JSContext*, unsigned int, js::StackFrame*, js::StackFrame*) (jstracer.cpp:1838)
==16988==    by 0x822F2CE: bool js::VisitStackSlots<js::CaptureTypesVisitor>(js::CaptureTypesVisitor&, JSContext*, unsigned int) (jstracer.cpp:1849)
==16988==    by 0x822F7F3: void js::VisitSlots<js::CaptureTypesVisitor>(js::CaptureTypesVisitor&, JSContext*, JSObject*, unsigned int, unsigned int, unsigned short*) (jstracer.cpp:1879)
==16988==    by 0x822ED67: void js::VisitSlots<js::CaptureTypesVisitor>(js::CaptureTypesVisitor&, JSContext*, JSObject*, unsigned int, js::Queue<unsigned short> const&) (jstracer.cpp:1897)
==16988==    by 0x81F7CCF: js::TypeMap::captureTypes(JSContext*, JSObject*, js::Queue<unsigned short>&, unsigned int, bool) (jstracer.cpp:2111)
==16988==    by 0x81F7430: js::TreeFragment::initialize(JSContext*, js::Queue<unsigned short>*, bool) (jstracer.cpp:1633)
==16988==    by 0x8202745: js::RecordTree(JSContext*, js::TraceMonitor*, js::TreeFragment*, JSScript*, unsigned char*, unsigned int, js::Queue<unsigned short>*) (jstracer.cpp:5739)
==16988==    by 0x822C153: js::RecordTracePoint(JSContext*, js::TraceMonitor*, unsigned int&, bool*, bool) (jstracer.cpp:16806)
==16988==  Address 0x53b1530 is 264 bytes inside a block of size 2,008 free'd
==16988==    at 0x47E1B3A: free (vg_replace_malloc.c:366)
==16988==    by 0x8049D2E: js_free (jsutil.h:255)
==16988==    by 0x8058466: js::Foreground::free_(void*) (jsutil.h:497)
==16988==    by 0x81F3D84: nanojit::Allocator::freeChunk(void*) (jstracer.cpp:218)
==16988==    by 0x825185B: nanojit::Allocator::reset() (Allocator.cpp:62)
==16988==    by 0x81FB674: js::TraceMonitor::flush() (jstracer.cpp:2838)
==16988==    by 0x820A4F6: js::TraceMonitor::init(JSRuntime*) (jstracer.cpp:7719)
==16988==    by 0x80A6A62: JSCompartment::init() (jscompartment.cpp:134)
==16988==    by 0x80EBFF6: js::gc::NewCompartment(JSContext*, JSPrincipals*) (jsgc.cpp:2762)
==16988==    by 0x8071D72: JS_NewCompartmentAndGlobalObject (jsapi.cpp:3029)
==16988==    by 0x8057453: NewGlobalObject(JSContext*, CompartmentKind) (js.cpp:5810)
==16988==    by 0x8057757: Shell(JSContext*, int, char**, char**) (js.cpp:5862)

Lifetime problems!  The dataAlloc->reset() call in TraceMonitor::flush() frees all memory in dataAlloc, including the BitSets, but the Oracle doesn't realize this and keeps trying to access them.  You removed the oracle->clear() call, which used to zero all the BitSets.  Where that call was we now need code that reallocates and clears the BitSets.  Eg. add a BitSet::resetAndRealloc() function and call that for each BitSet within Oracle::clear().
Comment 18 Nicholas Nethercote [:njn] 2011-05-23 22:43:23 PDT
Created attachment 534682 [details] [diff] [review]
patch to fix lifetime problems

Patch on top of the previous patch.
Comment 19 Paul Biggar 2011-05-25 16:58:14 PDT
Comment on attachment 534682 [details] [diff] [review]
patch to fix lifetime problems

It's a bit weird that the Oracle immediately deallocates and reallocates itself, but that's not this bug.
Comment 20 Nicholas Nethercote [:njn] 2011-05-25 19:49:33 PDT
My follow-up fix:

http://hg.mozilla.org/projects/nanojit-central/rev/8eec2117b714
Comment 21 Nicholas Nethercote [:njn] 2011-05-25 19:49:50 PDT
(In reply to comment #19)
> 
> It's a bit weird that the Oracle immediately deallocates and reallocates
> itself, but that's not this bug.

Where does it do that?
Comment 22 Paul Biggar 2011-05-26 07:29:49 PDT
Oracle::Oracle() -> Oracle::clear() -> Oracle::clearDemotability() -> resetAndAlloc()

Also: TraceMonitor::init() -> TraceMonitor::flush() -> Oracle::clear() ...
Comment 23 Paul Biggar 2011-08-11 17:00:33 PDT
Am pretty sure we're done here.

Note You need to log in before you can comment on or make changes to this bug.