Closed Bug 268535 Opened 21 years ago Closed 21 years ago

crashes where rt->gcFreeList turns into a small int after last-ditch GCs

Categories

(Core :: JavaScript Engine, defect, P1)

x86
Windows XP
defect

Tracking

()

RESOLVED FIXED
mozilla1.8alpha5

People

(Reporter: jband_mozilla, Assigned: brendan)

References

Details

(Keywords: fixed-aviary1.0.1, fixed1.7.6, js1.5)

Attachments

(16 files, 1 obsolete file)

6.18 KB, text/plain
Details
15.94 KB, text/plain
Details
16.44 KB, text/plain
Details
10.11 KB, text/plain
Details
12.29 KB, text/plain
Details
51.02 KB, text/plain
Details
4.89 KB, text/plain
Details
173.29 KB, text/plain
Details
1.25 KB, text/plain
Details
5.12 KB, text/plain
Details
2.99 KB, text/plain
Details
1.85 KB, text/plain
Details
844 bytes, patch
Details | Diff | Splinter Review
2.33 KB, text/plain
Details
7.44 KB, text/plain
Details
1.80 KB, patch
brendan
: review+
Details | Diff | Splinter Review
I created a multithreaded test app in an attempt reproduce atom related crashes we've been seeing. It didn't expose crashes with the exact same stacks. But, it does show some interesting problems. I'll attach the source. I don't have a generic Makefile for it. But, it is only one .c file that builds as a standalone executable that uses only spidermonkey and nspr. It creates one JSRuntime and a bunch of threads that each create a JSContext and loop through the evaluation of scripts each containing a bunch of randomly generated identifiers (all within JS requests). The scripts currently look like... function FOO {var FOO_1; var FOO_2}; function BAR {var BAR_1; var BAR_2}; FOO(); BAR(); Of course, it creates a ton of those lines. And the identifier lengths are also randomized. I believe that my test code represents a valid use of the api and that all scripts are themselves valid. I've seen this test fail in a few ways: 1) Crash in js_AllocGCThing (jsgc.c:479) where rt->gcFreeList is equal to 1 (but should be a valid pointer or NULL). This *seems* to happen when one or more other threads are doing last ditch GCs. I've seen this a number of times. 2) A "ReferenceError" reported by the error reporter (e.g. "ReferenceError: FOO is not defined"). The identifier in question in these cases in shown as text by the error reporter. It is always actually a valid identifier that is in the script (the test code guarantees that - and I've validated the script contents in some of these cases to be sure). This error indicates that the interpreter failed to find an atom at runtime that it absolutely should have found. I've seen this a bunch of times. This *might* indicate a corrupted atom. 3) A crash in "QuoteString" (jsopcode.c:402) called by "js_AtomToPrintableString" in the case where the interpreter is trying to build the string to report a ReferenceError (as above). This looks like a case where the atom that the interpreter is trying to lookup is not only not found as a property, but that the atom used for the search actually represents a bogus JSString while the interpreter is still trying to use it. I've seen this once so far. There are a bunch of #defines at the top for thread counts, GC frequency, identifier length, etc. I've been twiddling this stuff a lot and seeing different effects. This test is just a quick hack at this point. I'm hopeful that this will help in figuring out problems we've been seeing. And, that this might be a starting point for a multithreaded smoketest (at least).
Attached file source for test
Keywords: js1.5
jband: thanks for the bug! Can you attach any stacks and dumps of relevant variables that you might have handy? Thanks again. /be
The ReferenceError=>QuoteString path is odd. If only last-ditch GC's are nesting within (one request) and interleaving with request, then there should be no atom GC'ing -- the last-ditch call to js_GC from js_AllocGCThing passes GC_KEEP_ATOMS. If a full GC is racing, then something must be running outside of a request, or the request/GC interlock is broken somehow. It seems much more likely that there's a hole in the model, a race or other hard to hit window of vulnerability. /be
This looks like one of the crashes that we keep hitting here and for which I was unable to provide a nice testcase.
Attached file crash in obj_setSlot
Attachment #165339 - Attachment mime type: application/octet-stream → text/plain
Brendan, Thanks for looking into this. I didn't bother to capture stacks before. They are all shallow in this test with (I think) no big surprises. The js_AllocGCThing stack I've seen is very much like the one timeless posted. I'll capture anything interesting going forward. The js_AllocGCThing crash is clearly correleated with last-ditch GCs. I've been trying to trigger one of the other problems *without* a last-ditch GC - as a half-assed way to see if they are dependent on the last-ditch GC or not. So far, no luck. So, I'm trying to figure out if there is some path whereby a stalled regular GC (without the GC_KEEP_ATOMS flag) can get restarted by the last-ditch GC such that the GC_KEEP_ATOMS are not in effect when they should be.
Attached file crash in Variables
For instance, I notice that the rt->gcKeepAtoms check is above the restart: label. Can there be a case where control might goto restart: and a compile has started since the last time that rt->gcKeepAtoms check was done?
jband: I was hot on your heels! Patch next. /be
There may be another bug lurking, but this should be a huge help. /be
Status: NEW → ASSIGNED
Priority: -- → P1
Target Milestone: --- → mozilla1.8alpha5
I want to track this for branch inclusion when the fix is proven. /be
Flags: blocking1.7.x?
Flags: blocking-aviary1.0?
Attached file all threads for type2
As an additional data point I was running a Solaris version of this test with the following changes: #define THREAD_COUNT 90 #define LOOP_COUNT 5000 #define GC_FREQUENCY 90 #define IDENTIFIER_COUNT 5000 #define MAX_IDENTIFIER_LEN 512 and I came across a core dump with: 1Assertion failure: scope->ownercx == NULL, at jslock.c:1084 Abort unfortunately the core file was huge and truncated and as a result I can't get the stack trace. In talking to jband he mentioned that he had seen this assertion hit in his running of the test on windows although he also hadn't captured the stack.
> Assertion failure: scope->ownercx == NULL, at jslock.c:1084 There's no such assertion in current code. I removed it in the patch for bug 229200, replacing it with defensive code that reassociates the scope with the context that's unlocking it. jband, is your code nesting object (scope) locks using two or more contexts? If anyone gets the stack, please post it. You might also try the patch from bug 229200 and see whether it has any ill effects. I still don't like its leniency, or lack of crispness. /be
Brendan, No. the code is attached above and is pretty simple. Contexts are used on one thread only and there is no request nesting (except that JS_GC is called within a request if one wants to call that nesting). I happened to see that same assertion botch once on Win32. But, it was early on in 'crafting' the test. So, I discounted it as likely caused by something I was doing wrong at that point. But, bodell's instance on Solaris suggests that my instance might have been real. Sorry I didn't get the stack that time. So, I tried a variation of your patch where I left the setting of the gcFlag above the label in place and also added the same block below the label. I included an assertion like: restart: /* If a suspended compile is running on another context, keep atoms. */ if (rt->gcKeepAtoms) { JS_ASSERT(gcflags & GC_KEEP_ATOMS); gcflags |= GC_KEEP_ATOMS; } I have seen the js_AllocGCThing and the ReferenceError problem with this code in place (and without botching the assertion). So, apparently this was not the specific fix that will light up my life. I also added some test code to verify that js_atom_sweeper finds all atoms to be in a marked state if the jc_GC caller has (gcflags & GC_KEEP_ATOMS). That assertion has not botched even though I've seen both the js_AllocGCThing and the ReferenceError problem with that test code in place. I did *not* manage to get any of these problems to occur when a last-ditch GC had *not* 'recently' taken place. So, last-ditch GC very much seems to be a component of the problem. I added a printf just before calling the last-ditch GC. And also printfs before and after the 'regular' calls to gc. The js_AllocGCThing crash tends to happen amid a flurry of last-ditch gc attempts (most are apparently blocked and waiting when the crash happens). I have *not* tended to see a printf indicating a 'regular' gc preceeding a js_AllocGCThing crash (though this text might be stuck in the stream on it's way to the console). I *have* pretty consistently been seeing a regular gc preceeding the ReferenceError problem. Like... ... doing last-ditch gc doing last-ditch gc doing last-ditch gc start gc on thread 0 at loop 35 Error: ReferenceError: UEWTRLPDXTDUQZXILTDIWKMQPHXOHVAVXZTVGALVSKDXAEBIAQHIVENABXXIZYAJYETRHZRODBXLCYWXXSKMXNWHGPKSYKEHT KGRYFDLYJVTIKEJWQNELMAVULWLNQFJKMCLZGFVUVIZTAZRZBDRLEWRLZHPZELNFDFHILSQIWLDKHWRNCGGLCPII is not defined finish gc on thread 0 at loop 35 start gc on thread 39 at loop 40 finish gc on thread 39 at loop 40 ... I haven't put in any reasonable way to tell the real interleaving. But, it looks very possible that I am still seeing a problem with last-ditch and 'regular' gc fighting it out in some why in the case where the ReferenceError occurs. I certainly do *not* see a detectable problem for every last-ditch GC. BTW, I have still only seen that "QuoteString" crash happen one time. While the other two problems can generally be recreated within a minute or so. Also, I should mention that in these stacks and one's we've seen on our real product, there remains a guess that the reordering of the linked list within a bucket that is done in JS_HashTableRawLookup might be playing a part in causing heap corruption when doing a lookup in a table that has gotten itself into a bad state. That might not be what we are looking at here. But, this is all in the neighborhood of the JSRuntime's atom table and I thought I'd bring it up. So, I don't have it figured out. But, I still think we are in the right area. I hope this data helps.
I hit the "QuoteString" crash again. Here some data... - it is running the top level script. - the opcode is JSOP_NAME - the atom vector looks reasonable. len == 680. It is choking on 626. - the specific atom looks fine to me except that its JSString is garbage... - entry.next = 0 - entry.keyHash = 2538396477 - entry.key = 0x03034331c - entry.value = 0 - flags = 0 - number = 1285079 (this is high but typical for the atoms here) - I did the bit twiddling to get at the JSString and it is garbage. The pointer is into a reasonable place, but the contents their are not JSString-like. Nearby atoms in the vector have reasonable JSStrings. I'll include a dump that starts at the address of the garbage pointed to by the JSString*. Maybe you'll see something I don't (like some other kind of GCThing?). My guess that this is a reasonable place in the heap is based on the fact that the jschar data of some of the ALL CAPS identifiers my test uses follow closely after that address in the dump. Is that a valid assumption? Or should the string data blocks be further away from the GCThing location? I forget. So, I'm still left thinking that either... a) The string of an atom *somehow* didn't get marked. OR b) Maybe the freelist usage was whacked and this same GCThing got handed to someone else? Perhaps that is more consistent with js_AllocGCThing problem?
Comment on attachment 165352 [details] [diff] [review] proposed fix per jband's latest comment This can't help if the GC lock works, because no one can start compiling outside of a request, and all requests must wait for the GC to finish, so if it restarts, all requests are still waiting. Clearly something's wrong with last-ditch MT correctness. Suspect multiple racing last-ditches. /be
Attachment #165352 - Attachment is obsolete: true
I added an assertion in js_AllocGCThing: retry: thing = rt->gcFreeList; if (thing) { rt->gcFreeList = thing->next; JS_ASSERT(rt->gcFreeList != (void*)1); I have seen this botch a couple of times. This shows an GCThing item in the list is in a 'bad' state rather than just rt->gcFreeList getting poked with '1' somewhere. Interestingly, I've seen thing->flagp == 0x610a3b60 on more than one run. I haven't puzzled out how that value is set and used at this point. So, I'm not sure if this is reasonable or not. Maybe you can tell me if there is evidence that this is an active GCThing or free one that has been whacked or what. Also, I put the following check after the freelist rebuild in js_gc and have not seen it botch. { JSGCThing *p; for(p = rt->gcFreeList; p; p = p->next) { JS_ASSERT(p != (void*)1); } }
Here's a stack for the botch of JS_ASSERT(rt->gcFreeList != (void*)1); I added to js_AllocGCThing.
Comment on attachment 165476 [details] stack for botch of JS_ASSERT(rt->gcFreeList != (void*)1); The 1 is a string length, which overlays the next pointer of JSGCThing. IOW, the last thing on the freelist before this js_AllocGCThing tried pulling 1 off the list was a live JSString. This can happen when a last ditch GC nests in the interpreter at an unsafe point. There were recent fixes (what rev of the engine are you using?) to JSOP_IN and JSOP_ADD, but we're focused on the compiler here, and it seems the interpreter is not implicated. /be
I'm told we are using JS_150_RC6. I'm thinking about the rest :)
A couple more data points... - I verified that the same js_AllocGCThing crash occurs using the mozilla trunk. - I verified that the same js_AllocGCThing crash occurs even when no 'regular' gcs are done; i.e. when all gcs are last-ditch gcs.
In the interests of further narrowing the search... - I confirmed that it can still hit the js_AllocGCThing crash when doing only JS_CompileScript (rather than JS_EvaluateScript) AND also reducing the generated script to only include minimal function declatations. So the script is of the form: ... function FOO(){} function BAR(){} ... It tends to take longer to crash. And, the rt->gcFreeList value tends to be a number larger than one (but less than my max identifer length for the generated script). I changed my assertion to the below to veryify that otherwise things look pretty much the same. retry: thing = rt->gcFreeList; if (thing) { rt->gcFreeList = thing->next; JS_ASSERT(!rt->gcFreeList || (int)rt->gcFreeList > 1000); This leaves a pretty short path through the compiler. I'll attach a stack.
So, this is not a multithreaded bug! The gcFreeList gets corrupted from last-ditch gc even when this is run with only one thread ever using JS. I added debug code to iterate and check the members of gcFreeList in js_AllocGCThing and also after the list rebuild in js_GC. The aim is to catch the corruption as soon as possible after it happens. It never botches the assertion in js_GC. The botch in js_AllocGCThing seems to have a consistent stack. And, the value of the number has been consistently 0x14. This is with random generated identifier strings from run to run. So, it is not likely to be the string length of the previous identifier. I am not a great student of the parser. I would *love* it if someone would pick this up and solve the problem! I'll attach a diff to the trunk that adds my asserts, the stack (using my slightly patched trunk build), and the current version of the test program that I'm using. Note that it can take many thousand loop iterations before it botches the assertion.
jband: thanks again -- I'm the loser in charge here, I'll noodle on this and figure it out, ideally before the cantina tomorrow. Stop by! /be
Flags: blocking-aviary1.0?
Hey Brendan. Any progress? Thanks, John.
Resummarizing. Still have no clue, but I saw this in trunk firefox when testing the mark-stack-overflow torture test in bug 203278. /be
Summary: crashes and atom problems with mutithreaded tests → crashes where rt->gcFreeList turns into a small int after last-ditch GCs
1.7.5 has shipped. Moving request to 1.7.6.
Flags: blocking1.7.5? → blocking1.7.6?
This is a crash fix that's good for the branches. /be
Attachment #171823 - Flags: review+
Attachment #171823 - Flags: approval1.7.6?
Attachment #171823 - Flags: approval-aviary1.0.1?
Comment on attachment 171823 [details] [diff] [review] aviary branch patch from bug 278725 for js_NewObject a=mkaply for 1.7.6
Attachment #171823 - Flags: approval1.7.6? → approval1.7.6+
Blocks: 279636
Checked into the 1.7 branch. What about the AVIARY 1.0.1 branch? Asa, is there going to be a blocking-aviary1.0.1 flag? I've set the approval-aviary1.0.1 flag on the patch (which btw, was a patch against the AVIARY 1.0 branch, and so does not apply as-is to the 1.7 branch, because 1.7 has taken fixes for 1.7.6 that are not on the aviary 1.0.1 radar...). Cc'ing a few other drivers. /be
Flags: blocking1.7.6? → blocking1.7.6+
Comment on attachment 171823 [details] [diff] [review] aviary branch patch from bug 278725 for js_NewObject I haven't created the blocking flag because it initially seemed like it would just collect a lot of non-security nominations and I assumed that any blockers would have patches and so the approval flag would be sufficient. Now that Bugzilla has the ability to restrict a flag to a group, I could restrict it to drivers and security and that would probably work pretty well. Should I go ahead with that? a=asa for 1.0.1
Attachment #171823 - Flags: approval-aviary1.0.1? → approval-aviary1.0.1+
Fixed on the AVIARY_1_0_1_20050124_BRANCH branch too. /be
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Flags: blocking-aviary1.0.1+
Resolution: --- → FIXED
Flags: testcase-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: