Closed
Bug 268535
Opened 21 years ago
Closed 21 years ago
crashes where rt->gcFreeList turns into a small int after last-ditch GCs
Categories
(Core :: JavaScript Engine, defect, P1)
Tracking
()
RESOLVED
FIXED
mozilla1.8alpha5
People
(Reporter: jband_mozilla, Assigned: brendan)
References
Details
(Keywords: fixed-aviary1.0.1, fixed1.7.6, js1.5)
Attachments
(16 files, 1 obsolete file)
6.18 KB,
text/plain
|
Details | |
15.94 KB,
text/plain
|
Details | |
16.44 KB,
text/plain
|
Details | |
10.11 KB,
text/plain
|
Details | |
12.29 KB,
text/plain
|
Details | |
51.02 KB,
text/plain
|
Details | |
4.89 KB,
text/plain
|
Details | |
173.29 KB,
text/plain
|
Details | |
1.25 KB,
text/plain
|
Details | |
5.12 KB,
text/plain
|
Details | |
2.99 KB,
text/plain
|
Details | |
1.85 KB,
text/plain
|
Details | |
844 bytes,
patch
|
Details | Diff | Splinter Review | |
2.33 KB,
text/plain
|
Details | |
7.44 KB,
text/plain
|
Details | |
1.80 KB,
patch
|
brendan
:
review+
asa
:
approval-aviary1.0.1+
mkaply
:
approval1.7.6+
|
Details | Diff | Splinter Review |
I created a multithreaded test app in an attempt reproduce atom related crashes
we've been seeing. It didn't expose crashes with the exact same stacks. But, it
does show some interesting problems.
I'll attach the source. I don't have a generic Makefile for it. But, it is only
one .c file that builds as a standalone executable that uses only spidermonkey
and nspr. It creates one JSRuntime and a bunch of threads that each create a
JSContext and loop through the evaluation of scripts each containing a bunch of
randomly generated identifiers (all within JS requests).
The scripts currently look like...
function FOO {var FOO_1; var FOO_2};
function BAR {var BAR_1; var BAR_2};
FOO();
BAR();
Of course, it creates a ton of those lines. And the identifier lengths are also
randomized.
I believe that my test code represents a valid use of the api and that all
scripts are themselves valid.
I've seen this test fail in a few ways:
1) Crash in js_AllocGCThing (jsgc.c:479) where rt->gcFreeList is equal to 1 (but
should be a valid pointer or NULL). This *seems* to happen when one or more
other threads are doing last ditch GCs. I've seen this a number of times.
2) A "ReferenceError" reported by the error reporter (e.g. "ReferenceError: FOO
is not defined"). The identifier in question in these cases in shown as text by
the error reporter. It is always actually a valid identifier that is in the
script (the test code guarantees that - and I've validated the script contents
in some of these cases to be sure). This error indicates that the interpreter
failed to find an atom at runtime that it absolutely should have found. I've
seen this a bunch of times. This *might* indicate a corrupted atom.
3) A crash in "QuoteString" (jsopcode.c:402) called by
"js_AtomToPrintableString" in the case where the interpreter is trying to build
the string to report a ReferenceError (as above). This looks like a case where
the atom that the interpreter is trying to lookup is not only not found as a
property, but that the atom used for the search actually represents a bogus
JSString while the interpreter is still trying to use it. I've seen this once so
far.
There are a bunch of #defines at the top for thread counts, GC frequency,
identifier length, etc. I've been twiddling this stuff a lot and seeing
different effects. This test is just a quick hack at this point.
I'm hopeful that this will help in figuring out problems we've been seeing. And,
that this might be a starting point for a multithreaded smoketest (at least).
Reporter | ||
Comment 1•21 years ago
|
||
Assignee | ||
Comment 2•21 years ago
|
||
jband: thanks for the bug! Can you attach any stacks and dumps of relevant
variables that you might have handy? Thanks again.
/be
Assignee | ||
Comment 3•21 years ago
|
||
The ReferenceError=>QuoteString path is odd. If only last-ditch GC's are nesting
within (one request) and interleaving with request, then there should be no atom
GC'ing -- the last-ditch call to js_GC from js_AllocGCThing passes
GC_KEEP_ATOMS. If a full GC is racing, then something must be running outside of
a request, or the request/GC interlock is broken somehow.
It seems much more likely that there's a hole in the model, a race or other hard
to hit window of vulnerability.
/be
This looks like one of the crashes that we keep hitting here and for which I
was unable to provide a nice testcase.
Attachment #165339 -
Attachment mime type: application/octet-stream → text/plain
Reporter | ||
Comment 8•21 years ago
|
||
Brendan,
Thanks for looking into this. I didn't bother to capture stacks before. They are
all shallow in this test with (I think) no big surprises. The js_AllocGCThing
stack I've seen is very much like the one timeless posted. I'll capture anything
interesting going forward.
The js_AllocGCThing crash is clearly correleated with last-ditch GCs. I've been
trying to trigger one of the other problems *without* a last-ditch GC - as a
half-assed way to see if they are dependent on the last-ditch GC or not. So far,
no luck.
So, I'm trying to figure out if there is some path whereby a stalled regular GC
(without the GC_KEEP_ATOMS flag) can get restarted by the last-ditch GC such
that the GC_KEEP_ATOMS are not in effect when they should be.
Reporter | ||
Comment 10•21 years ago
|
||
For instance, I notice that the rt->gcKeepAtoms check is above the restart:
label. Can there be a case where control might goto restart: and a compile has
started since the last time that rt->gcKeepAtoms check was done?
Assignee | ||
Comment 11•21 years ago
|
||
jband: I was hot on your heels!
Patch next.
/be
Assignee | ||
Comment 12•21 years ago
|
||
There may be another bug lurking, but this should be a huge help.
/be
Assignee | ||
Updated•21 years ago
|
Status: NEW → ASSIGNED
Priority: -- → P1
Target Milestone: --- → mozilla1.8alpha5
Assignee | ||
Comment 13•21 years ago
|
||
I want to track this for branch inclusion when the fix is proven.
/be
Flags: blocking1.7.x?
Flags: blocking-aviary1.0?
Comment 14•21 years ago
|
||
Comment 15•21 years ago
|
||
Comment 16•21 years ago
|
||
As an additional data point I was running a Solaris version of this test with
the following changes:
#define THREAD_COUNT 90
#define LOOP_COUNT 5000
#define GC_FREQUENCY 90
#define IDENTIFIER_COUNT 5000
#define MAX_IDENTIFIER_LEN 512
and I came across a core dump with:
1Assertion failure: scope->ownercx == NULL, at jslock.c:1084
Abort
unfortunately the core file was huge and truncated and as a result I can't get
the stack trace. In talking to jband he mentioned that he had seen this
assertion hit in his running of the test on windows although he also hadn't
captured the stack.
Assignee | ||
Comment 17•21 years ago
|
||
> Assertion failure: scope->ownercx == NULL, at jslock.c:1084
There's no such assertion in current code. I removed it in the patch for bug
229200, replacing it with defensive code that reassociates the scope with the
context that's unlocking it.
jband, is your code nesting object (scope) locks using two or more contexts? If
anyone gets the stack, please post it. You might also try the patch from bug
229200 and see whether it has any ill effects. I still don't like its leniency,
or lack of crispness.
/be
Reporter | ||
Comment 18•21 years ago
|
||
Brendan, No. the code is attached above and is pretty simple. Contexts are used
on one thread only and there is no request nesting (except that JS_GC is called
within a request if one wants to call that nesting).
I happened to see that same assertion botch once on Win32. But, it was early on
in 'crafting' the test. So, I discounted it as likely caused by something I was
doing wrong at that point. But, bodell's instance on Solaris suggests that my
instance might have been real. Sorry I didn't get the stack that time.
So, I tried a variation of your patch where I left the setting of the gcFlag
above the label in place and also added the same block below the label. I
included an assertion like:
restart:
/* If a suspended compile is running on another context, keep atoms. */
if (rt->gcKeepAtoms)
{
JS_ASSERT(gcflags & GC_KEEP_ATOMS);
gcflags |= GC_KEEP_ATOMS;
}
I have seen the js_AllocGCThing and the ReferenceError problem with this code in
place (and without botching the assertion). So, apparently this was not the
specific fix that will light up my life.
I also added some test code to verify that js_atom_sweeper finds all atoms to be
in a marked state if the jc_GC caller has (gcflags & GC_KEEP_ATOMS). That
assertion has not botched even though I've seen both the js_AllocGCThing and the
ReferenceError problem with that test code in place.
I did *not* manage to get any of these problems to occur when a last-ditch GC
had *not* 'recently' taken place. So, last-ditch GC very much seems to be a
component of the problem.
I added a printf just before calling the last-ditch GC. And also printfs before
and after the 'regular' calls to gc. The js_AllocGCThing crash tends to happen
amid a flurry of last-ditch gc attempts (most are apparently blocked and waiting
when the crash happens). I have *not* tended to see a printf indicating a
'regular' gc preceeding a js_AllocGCThing crash (though this text might be stuck
in the stream on it's way to the console). I *have* pretty consistently been
seeing a regular gc preceeding the ReferenceError problem. Like...
...
doing last-ditch gc
doing last-ditch gc
doing last-ditch gc
start gc on thread 0 at loop 35
Error: ReferenceError:
UEWTRLPDXTDUQZXILTDIWKMQPHXOHVAVXZTVGALVSKDXAEBIAQHIVENABXXIZYAJYETRHZRODBXLCYWXXSKMXNWHGPKSYKEHT
KGRYFDLYJVTIKEJWQNELMAVULWLNQFJKMCLZGFVUVIZTAZRZBDRLEWRLZHPZELNFDFHILSQIWLDKHWRNCGGLCPII
is not defined
finish gc on thread 0 at loop 35
start gc on thread 39 at loop 40
finish gc on thread 39 at loop 40
...
I haven't put in any reasonable way to tell the real interleaving. But, it looks
very possible that I am still seeing a problem with last-ditch and 'regular' gc
fighting it out in some why in the case where the ReferenceError occurs. I
certainly do *not* see a detectable problem for every last-ditch GC.
BTW, I have still only seen that "QuoteString" crash happen one time. While the
other two problems can generally be recreated within a minute or so.
Also, I should mention that in these stacks and one's we've seen on our real
product, there remains a guess that the reordering of the linked list within a
bucket that is done in JS_HashTableRawLookup might be playing a part in causing
heap corruption when doing a lookup in a table that has gotten itself into a bad
state. That might not be what we are looking at here. But, this is all in the
neighborhood of the JSRuntime's atom table and I thought I'd bring it up.
So, I don't have it figured out. But, I still think we are in the right area. I
hope this data helps.
Reporter | ||
Comment 19•21 years ago
|
||
I hit the "QuoteString" crash again. Here some data...
- it is running the top level script.
- the opcode is JSOP_NAME
- the atom vector looks reasonable. len == 680. It is choking on 626.
- the specific atom looks fine to me except that its JSString is garbage...
- entry.next = 0
- entry.keyHash = 2538396477
- entry.key = 0x03034331c
- entry.value = 0
- flags = 0
- number = 1285079 (this is high but typical for the atoms here)
- I did the bit twiddling to get at the JSString and it is garbage. The pointer
is into a reasonable place, but the contents their are not JSString-like. Nearby
atoms in the vector have reasonable JSStrings.
I'll include a dump that starts at the address of the garbage pointed to by the
JSString*. Maybe you'll see something I don't (like some other kind of
GCThing?). My guess that this is a reasonable place in the heap is based on the
fact that the jschar data of some of the ALL CAPS identifiers my test uses
follow closely after that address in the dump. Is that a valid assumption? Or
should the string data blocks be further away from the GCThing location? I forget.
So, I'm still left thinking that either...
a) The string of an atom *somehow* didn't get marked.
OR
b) Maybe the freelist usage was whacked and this same GCThing got handed to
someone else? Perhaps that is more consistent with js_AllocGCThing problem?
Reporter | ||
Comment 20•21 years ago
|
||
Reporter | ||
Comment 21•21 years ago
|
||
Assignee | ||
Comment 22•21 years ago
|
||
Comment on attachment 165352 [details] [diff] [review]
proposed fix per jband's latest comment
This can't help if the GC lock works, because no one can start compiling
outside of a request, and all requests must wait for the GC to finish, so if it
restarts, all requests are still waiting.
Clearly something's wrong with last-ditch MT correctness. Suspect multiple
racing last-ditches.
/be
Attachment #165352 -
Attachment is obsolete: true
Reporter | ||
Comment 23•21 years ago
|
||
I added an assertion in js_AllocGCThing:
retry:
thing = rt->gcFreeList;
if (thing) {
rt->gcFreeList = thing->next;
JS_ASSERT(rt->gcFreeList != (void*)1);
I have seen this botch a couple of times. This shows an GCThing item in the list
is in a 'bad' state rather than just rt->gcFreeList getting poked with '1'
somewhere. Interestingly, I've seen thing->flagp == 0x610a3b60 on more than one
run. I haven't puzzled out how that value is set and used at this point. So, I'm
not sure if this is reasonable or not. Maybe you can tell me if there is
evidence that this is an active GCThing or free one that has been whacked or what.
Also, I put the following check after the freelist rebuild in js_gc and have not
seen it botch.
{
JSGCThing *p;
for(p = rt->gcFreeList; p; p = p->next)
{
JS_ASSERT(p != (void*)1);
}
}
Reporter | ||
Comment 24•21 years ago
|
||
Here's a stack for the botch of JS_ASSERT(rt->gcFreeList != (void*)1); I added
to js_AllocGCThing.
Assignee | ||
Comment 25•21 years ago
|
||
Comment on attachment 165476 [details]
stack for botch of JS_ASSERT(rt->gcFreeList != (void*)1);
The 1 is a string length, which overlays the next pointer of JSGCThing. IOW,
the last thing on the freelist before this js_AllocGCThing tried pulling 1 off
the list was a live JSString.
This can happen when a last ditch GC nests in the interpreter at an unsafe
point. There were recent fixes (what rev of the engine are you using?) to
JSOP_IN and JSOP_ADD, but we're focused on the compiler here, and it seems the
interpreter is not implicated.
/be
Reporter | ||
Comment 26•21 years ago
|
||
I'm told we are using JS_150_RC6.
I'm thinking about the rest :)
Reporter | ||
Comment 27•21 years ago
|
||
A couple more data points...
- I verified that the same js_AllocGCThing crash occurs using the mozilla trunk.
- I verified that the same js_AllocGCThing crash occurs even when no 'regular'
gcs are done; i.e. when all gcs are last-ditch gcs.
Reporter | ||
Comment 28•21 years ago
|
||
In the interests of further narrowing the search...
- I confirmed that it can still hit the js_AllocGCThing crash when doing only
JS_CompileScript (rather than JS_EvaluateScript) AND also reducing the generated
script to only include minimal function declatations. So the script is of the form:
...
function FOO(){}
function BAR(){}
...
It tends to take longer to crash. And, the rt->gcFreeList value tends to be a
number larger than one (but less than my max identifer length for the generated
script). I changed my assertion to the below to veryify that otherwise things
look pretty much the same.
retry:
thing = rt->gcFreeList;
if (thing) {
rt->gcFreeList = thing->next;
JS_ASSERT(!rt->gcFreeList || (int)rt->gcFreeList > 1000);
This leaves a pretty short path through the compiler. I'll attach a stack.
Reporter | ||
Comment 29•21 years ago
|
||
Reporter | ||
Comment 30•21 years ago
|
||
So, this is not a multithreaded bug! The gcFreeList gets corrupted from
last-ditch gc even when this is run with only one thread ever using JS.
I added debug code to iterate and check the members of gcFreeList in
js_AllocGCThing and also after the list rebuild in js_GC. The aim is to catch
the corruption as soon as possible after it happens. It never botches the
assertion in js_GC. The botch in js_AllocGCThing seems to have a consistent
stack. And, the value of the number has been consistently 0x14. This is with
random generated identifier strings from run to run. So, it is not likely to be
the string length of the previous identifier.
I am not a great student of the parser. I would *love* it if someone would pick
this up and solve the problem!
I'll attach a diff to the trunk that adds my asserts, the stack (using my
slightly patched trunk build), and the current version of the test program that
I'm using. Note that it can take many thousand loop iterations before it botches
the assertion.
Reporter | ||
Comment 31•21 years ago
|
||
Reporter | ||
Comment 32•21 years ago
|
||
Reporter | ||
Comment 33•21 years ago
|
||
Assignee | ||
Comment 34•21 years ago
|
||
jband: thanks again -- I'm the loser in charge here, I'll noodle on this and
figure it out, ideally before the cantina tomorrow. Stop by!
/be
Updated•21 years ago
|
Flags: blocking-aviary1.0?
Reporter | ||
Comment 35•21 years ago
|
||
Hey Brendan. Any progress? Thanks, John.
Assignee | ||
Comment 36•21 years ago
|
||
Resummarizing. Still have no clue, but I saw this in trunk firefox when testing
the mark-stack-overflow torture test in bug 203278.
/be
Summary: crashes and atom problems with mutithreaded tests → crashes where rt->gcFreeList turns into a small int after last-ditch GCs
Comment 37•21 years ago
|
||
1.7.5 has shipped. Moving request to 1.7.6.
Flags: blocking1.7.5? → blocking1.7.6?
Assignee | ||
Comment 38•21 years ago
|
||
This is a crash fix that's good for the branches.
/be
Attachment #171823 -
Flags: review+
Attachment #171823 -
Flags: approval1.7.6?
Attachment #171823 -
Flags: approval-aviary1.0.1?
Comment 39•21 years ago
|
||
Comment on attachment 171823 [details] [diff] [review]
aviary branch patch from bug 278725 for js_NewObject
a=mkaply for 1.7.6
Attachment #171823 -
Flags: approval1.7.6? → approval1.7.6+
Assignee | ||
Comment 40•21 years ago
|
||
Checked into the 1.7 branch.
What about the AVIARY 1.0.1 branch? Asa, is there going to be a
blocking-aviary1.0.1 flag? I've set the approval-aviary1.0.1 flag on the patch
(which btw, was a patch against the AVIARY 1.0 branch, and so does not apply
as-is to the 1.7 branch, because 1.7 has taken fixes for 1.7.6 that are not on
the aviary 1.0.1 radar...).
Cc'ing a few other drivers.
/be
Flags: blocking1.7.6? → blocking1.7.6+
Comment 41•21 years ago
|
||
Comment on attachment 171823 [details] [diff] [review]
aviary branch patch from bug 278725 for js_NewObject
I haven't created the blocking flag because it initially seemed like it would
just collect a lot of non-security nominations and I assumed that any blockers
would have patches and so the approval flag would be sufficient.
Now that Bugzilla has the ability to restrict a flag to a group, I could
restrict it to drivers and security and that would probably work pretty well.
Should I go ahead with that?
a=asa for 1.0.1
Attachment #171823 -
Flags: approval-aviary1.0.1? → approval-aviary1.0.1+
Assignee | ||
Comment 42•21 years ago
|
||
Fixed on the AVIARY_1_0_1_20050124_BRANCH branch too.
/be
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Flags: blocking-aviary1.0.1+
Resolution: --- → FIXED
Updated•21 years ago
|
Keywords: fixed-aviary1.0.1,
fixed1.7.6
Updated•20 years ago
|
Flags: testcase-
You need to log in
before you can comment on or make changes to this bug.
Description
•