Closed Bug 268535 Opened 21 years ago Closed 21 years ago

crashes where rt->gcFreeList turns into a small int after last-ditch GCs

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla1.8alpha5

People

(Reporter: jband_mozilla, Assigned: brendan)

References

Details

(Keywords: fixed-aviary1.0.1, fixed1.7.6, js1.5)

Attachments

(16 files, 1 obsolete file)

source for test 21 years ago John Bandhauer 6.18 KB, text/plain		Details
crash in js_DeflateString 21 years ago timeless 15.94 KB, text/plain		Details
crash in js_AllocGCThing 21 years ago timeless 16.44 KB, text/plain		Details
crash in obj_setSlot 21 years ago timeless 10.11 KB, text/plain		Details
disassembly for js_AllocGCThing 21 years ago timeless 12.29 KB, text/plain		Details
crash in Variables 21 years ago timeless 51.02 KB, text/plain		Details
proposed fix per jband's latest comment 21 years ago Brendan Eich [:brendan] 3.36 KB, patch		Details \| Diff \| Splinter Review
crash in js_DeflateString type2 21 years ago timeless 4.89 KB, text/plain		Details
all threads for type2 21 years ago timeless 173.29 KB, text/plain		Details
stack for QuoteString crash 21 years ago John Bandhauer 1.25 KB, text/plain		Details
dump of JSString* memory for QuoteString crash 21 years ago John Bandhauer 5.12 KB, text/plain		Details
stack for botch of JS_ASSERT(rt->gcFreeList != (void*)1); 21 years ago John Bandhauer 2.99 KB, text/plain		Details
stack for crash with JS_CompileScript only 21 years ago John Bandhauer 1.85 KB, text/plain		Details
test asserts for trunk 21 years ago John Bandhauer 844 bytes, patch		Details \| Diff \| Splinter Review
single threaded crash stack 21 years ago John Bandhauer 2.33 KB, text/plain		Details
test code source setup for one thread only 21 years ago John Bandhauer 7.44 KB, text/plain		Details
aviary branch patch from bug 278725 for js_NewObject 21 years ago Brendan Eich [:brendan] 1.80 KB, patch	brendan : review+ asa : approval-aviary1.0.1+ mkaply : approval1.7.6+	Details \| Diff \| Splinter Review

John Bandhauer

Reporter

Description

•

21 years ago

I created a multithreaded test app in an attempt reproduce atom related crashes we've been seeing. It didn't expose crashes with the exact same stacks. But, it does show some interesting problems. I'll attach the source. I don't have a generic Makefile for it. But, it is only one .c file that builds as a standalone executable that uses only spidermonkey and nspr. It creates one JSRuntime and a bunch of threads that each create a JSContext and loop through the evaluation of scripts each containing a bunch of randomly generated identifiers (all within JS requests). The scripts currently look like... function FOO {var FOO_1; var FOO_2}; function BAR {var BAR_1; var BAR_2}; FOO(); BAR(); Of course, it creates a ton of those lines. And the identifier lengths are also randomized. I believe that my test code represents a valid use of the api and that all scripts are themselves valid. I've seen this test fail in a few ways: 1) Crash in js_AllocGCThing (jsgc.c:479) where rt->gcFreeList is equal to 1 (but should be a valid pointer or NULL). This *seems* to happen when one or more other threads are doing last ditch GCs. I've seen this a number of times. 2) A "ReferenceError" reported by the error reporter (e.g. "ReferenceError: FOO is not defined"). The identifier in question in these cases in shown as text by the error reporter. It is always actually a valid identifier that is in the script (the test code guarantees that - and I've validated the script contents in some of these cases to be sure). This error indicates that the interpreter failed to find an atom at runtime that it absolutely should have found. I've seen this a bunch of times. This *might* indicate a corrupted atom. 3) A crash in "QuoteString" (jsopcode.c:402) called by "js_AtomToPrintableString" in the case where the interpreter is trying to build the string to report a ReferenceError (as above). This looks like a case where the atom that the interpreter is trying to lookup is not only not found as a property, but that the atom used for the search actually represents a bogus JSString while the interpreter is still trying to use it. I've seen this once so far. There are a bunch of #defines at the top for thread counts, GC frequency, identifier length, etc. I've been twiddling this stuff a lot and seeing different effects. This test is just a quick hack at this point. I'm hopeful that this will help in figuring out problems we've been seeing. And, that this might be a starting point for a multithreaded smoketest (at least).

John Bandhauer

Reporter

Comment 1

•

21 years ago

Attached file source for test — Details

Brendan Eich [:brendan]

Assignee

Updated

•

21 years ago

Keywords: js1.5

Brendan Eich [:brendan]

Assignee

Comment 2

•

21 years ago

jband: thanks for the bug! Can you attach any stacks and dumps of relevant variables that you might have handy? Thanks again. /be

Brendan Eich [:brendan]

Assignee

Comment 3

•

21 years ago

The ReferenceError=>QuoteString path is odd. If only last-ditch GC's are nesting within (one request) and interleaving with request, then there should be no atom GC'ing -- the last-ditch call to js_GC from js_AllocGCThing passes GC_KEEP_ATOMS. If a full GC is racing, then something must be running outside of a request, or the request/GC interlock is broken somehow. It seems much more likely that there's a hole in the model, a race or other hard to hit window of vulnerability. /be

timeless

Comment 4

•

21 years ago

Attached file crash in js_DeflateString — Details

timeless

Comment 5

•

21 years ago

Attached file crash in js_AllocGCThing — Details

This looks like one of the crashes that we keep hitting here and for which I was unable to provide a nice testcase.

timeless

Comment 6

•

21 years ago

Attached file crash in obj_setSlot — Details

timeless

Updated

•

21 years ago

Attachment #165339 - Attachment mime type: application/octet-stream → text/plain

timeless

Comment 7

•

21 years ago

Attached file disassembly for js_AllocGCThing — Details

John Bandhauer

Reporter

Comment 8

•

21 years ago

Brendan, Thanks for looking into this. I didn't bother to capture stacks before. They are all shallow in this test with (I think) no big surprises. The js_AllocGCThing stack I've seen is very much like the one timeless posted. I'll capture anything interesting going forward. The js_AllocGCThing crash is clearly correleated with last-ditch GCs. I've been trying to trigger one of the other problems *without* a last-ditch GC - as a half-assed way to see if they are dependent on the last-ditch GC or not. So far, no luck. So, I'm trying to figure out if there is some path whereby a stalled regular GC (without the GC_KEEP_ATOMS flag) can get restarted by the last-ditch GC such that the GC_KEEP_ATOMS are not in effect when they should be.

timeless

Comment 9

•

21 years ago

Attached file crash in Variables — Details

John Bandhauer

Reporter

Comment 10

•

21 years ago

For instance, I notice that the rt->gcKeepAtoms check is above the restart: label. Can there be a case where control might goto restart: and a compile has started since the last time that rt->gcKeepAtoms check was done?

Brendan Eich [:brendan]

Assignee

Comment 11

•

21 years ago

jband: I was hot on your heels! Patch next. /be

Brendan Eich [:brendan]

Assignee

Comment 12

•

21 years ago

Attached patch proposed fix per jband's latest comment (obsolete) — Details — Splinter Review

There may be another bug lurking, but this should be a huge help. /be

Brendan Eich [:brendan]

Assignee

Updated

•

21 years ago

Status: NEW → ASSIGNED

Priority: -- → P1

Target Milestone: --- → mozilla1.8alpha5

Brendan Eich [:brendan]

Assignee

Comment 13

•

21 years ago

I want to track this for branch inclusion when the fix is proven. /be

Flags: blocking1.7.x?

Flags: blocking-aviary1.0?

timeless

Comment 14

•

21 years ago

Attached file crash in js_DeflateString type2 — Details

timeless

Comment 15

•

21 years ago

Attached file all threads for type2 — Details

Michael Bodell

Comment 16

•

21 years ago

As an additional data point I was running a Solaris version of this test with the following changes: #define THREAD_COUNT 90 #define LOOP_COUNT 5000 #define GC_FREQUENCY 90 #define IDENTIFIER_COUNT 5000 #define MAX_IDENTIFIER_LEN 512 and I came across a core dump with: 1Assertion failure: scope->ownercx == NULL, at jslock.c:1084 Abort unfortunately the core file was huge and truncated and as a result I can't get the stack trace. In talking to jband he mentioned that he had seen this assertion hit in his running of the test on windows although he also hadn't captured the stack.

Brendan Eich [:brendan]

Assignee

Comment 17

•

21 years ago

> Assertion failure: scope->ownercx == NULL, at jslock.c:1084 There's no such assertion in current code. I removed it in the patch for bug 229200, replacing it with defensive code that reassociates the scope with the context that's unlocking it. jband, is your code nesting object (scope) locks using two or more contexts? If anyone gets the stack, please post it. You might also try the patch from bug 229200 and see whether it has any ill effects. I still don't like its leniency, or lack of crispness. /be

John Bandhauer

Reporter

Comment 18

•

21 years ago

Brendan, No. the code is attached above and is pretty simple. Contexts are used on one thread only and there is no request nesting (except that JS_GC is called within a request if one wants to call that nesting). I happened to see that same assertion botch once on Win32. But, it was early on in 'crafting' the test. So, I discounted it as likely caused by something I was doing wrong at that point. But, bodell's instance on Solaris suggests that my instance might have been real. Sorry I didn't get the stack that time. So, I tried a variation of your patch where I left the setting of the gcFlag above the label in place and also added the same block below the label. I included an assertion like: restart: /* If a suspended compile is running on another context, keep atoms. */ if (rt->gcKeepAtoms) { JS_ASSERT(gcflags & GC_KEEP_ATOMS); gcflags |= GC_KEEP_ATOMS; } I have seen the js_AllocGCThing and the ReferenceError problem with this code in place (and without botching the assertion). So, apparently this was not the specific fix that will light up my life. I also added some test code to verify that js_atom_sweeper finds all atoms to be in a marked state if the jc_GC caller has (gcflags & GC_KEEP_ATOMS). That assertion has not botched even though I've seen both the js_AllocGCThing and the ReferenceError problem with that test code in place. I did *not* manage to get any of these problems to occur when a last-ditch GC had *not* 'recently' taken place. So, last-ditch GC very much seems to be a component of the problem. I added a printf just before calling the last-ditch GC. And also printfs before and after the 'regular' calls to gc. The js_AllocGCThing crash tends to happen amid a flurry of last-ditch gc attempts (most are apparently blocked and waiting when the crash happens). I have *not* tended to see a printf indicating a 'regular' gc preceeding a js_AllocGCThing crash (though this text might be stuck in the stream on it's way to the console). I *have* pretty consistently been seeing a regular gc preceeding the ReferenceError problem. Like... ... doing last-ditch gc doing last-ditch gc doing last-ditch gc start gc on thread 0 at loop 35 Error: ReferenceError: UEWTRLPDXTDUQZXILTDIWKMQPHXOHVAVXZTVGALVSKDXAEBIAQHIVENABXXIZYAJYETRHZRODBXLCYWXXSKMXNWHGPKSYKEHT KGRYFDLYJVTIKEJWQNELMAVULWLNQFJKMCLZGFVUVIZTAZRZBDRLEWRLZHPZELNFDFHILSQIWLDKHWRNCGGLCPII is not defined finish gc on thread 0 at loop 35 start gc on thread 39 at loop 40 finish gc on thread 39 at loop 40 ... I haven't put in any reasonable way to tell the real interleaving. But, it looks very possible that I am still seeing a problem with last-ditch and 'regular' gc fighting it out in some why in the case where the ReferenceError occurs. I certainly do *not* see a detectable problem for every last-ditch GC. BTW, I have still only seen that "QuoteString" crash happen one time. While the other two problems can generally be recreated within a minute or so. Also, I should mention that in these stacks and one's we've seen on our real product, there remains a guess that the reordering of the linked list within a bucket that is done in JS_HashTableRawLookup might be playing a part in causing heap corruption when doing a lookup in a table that has gotten itself into a bad state. That might not be what we are looking at here. But, this is all in the neighborhood of the JSRuntime's atom table and I thought I'd bring it up. So, I don't have it figured out. But, I still think we are in the right area. I hope this data helps.

John Bandhauer

Reporter

Comment 19

•

21 years ago

I hit the "QuoteString" crash again. Here some data... - it is running the top level script. - the opcode is JSOP_NAME - the atom vector looks reasonable. len == 680. It is choking on 626. - the specific atom looks fine to me except that its JSString is garbage... - entry.next = 0 - entry.keyHash = 2538396477 - entry.key = 0x03034331c - entry.value = 0 - flags = 0 - number = 1285079 (this is high but typical for the atoms here) - I did the bit twiddling to get at the JSString and it is garbage. The pointer is into a reasonable place, but the contents their are not JSString-like. Nearby atoms in the vector have reasonable JSStrings. I'll include a dump that starts at the address of the garbage pointed to by the JSString*. Maybe you'll see something I don't (like some other kind of GCThing?). My guess that this is a reasonable place in the heap is based on the fact that the jschar data of some of the ALL CAPS identifiers my test uses follow closely after that address in the dump. Is that a valid assumption? Or should the string data blocks be further away from the GCThing location? I forget. So, I'm still left thinking that either... a) The string of an atom *somehow* didn't get marked. OR b) Maybe the freelist usage was whacked and this same GCThing got handed to someone else? Perhaps that is more consistent with js_AllocGCThing problem?

John Bandhauer

Reporter

Comment 20

•

21 years ago

Attached file stack for QuoteString crash — Details

John Bandhauer

Reporter

Comment 21

•

21 years ago

Attached file dump of JSString* memory for QuoteString crash — Details

Brendan Eich [:brendan]

Assignee

Comment 22

•

21 years ago

Comment on attachment 165352 [details] [diff] [review] proposed fix per jband's latest comment This can't help if the GC lock works, because no one can start compiling outside of a request, and all requests must wait for the GC to finish, so if it restarts, all requests are still waiting. Clearly something's wrong with last-ditch MT correctness. Suspect multiple racing last-ditches. /be

Attachment #165352 - Attachment is obsolete: true

John Bandhauer

Reporter

Comment 23

•

21 years ago

I added an assertion in js_AllocGCThing: retry: thing = rt->gcFreeList; if (thing) { rt->gcFreeList = thing->next; JS_ASSERT(rt->gcFreeList != (void*)1); I have seen this botch a couple of times. This shows an GCThing item in the list is in a 'bad' state rather than just rt->gcFreeList getting poked with '1' somewhere. Interestingly, I've seen thing->flagp == 0x610a3b60 on more than one run. I haven't puzzled out how that value is set and used at this point. So, I'm not sure if this is reasonable or not. Maybe you can tell me if there is evidence that this is an active GCThing or free one that has been whacked or what. Also, I put the following check after the freelist rebuild in js_gc and have not seen it botch. { JSGCThing *p; for(p = rt->gcFreeList; p; p = p->next) { JS_ASSERT(p != (void*)1); } }

John Bandhauer

Reporter

Comment 24

•

21 years ago

Attached file stack for botch of JS_ASSERT(rt->gcFreeList != (void*)1); — Details

Here's a stack for the botch of JS_ASSERT(rt->gcFreeList != (void*)1); I added to js_AllocGCThing.

Brendan Eich [:brendan]

Assignee

Comment 25

•

21 years ago

Comment on attachment 165476 [details] stack for botch of JS_ASSERT(rt->gcFreeList != (void*)1); The 1 is a string length, which overlays the next pointer of JSGCThing. IOW, the last thing on the freelist before this js_AllocGCThing tried pulling 1 off the list was a live JSString. This can happen when a last ditch GC nests in the interpreter at an unsafe point. There were recent fixes (what rev of the engine are you using?) to JSOP_IN and JSOP_ADD, but we're focused on the compiler here, and it seems the interpreter is not implicated. /be

John Bandhauer

Reporter

Comment 26

•

21 years ago

I'm told we are using JS_150_RC6. I'm thinking about the rest :)

John Bandhauer

Reporter

Comment 27

•

21 years ago

A couple more data points... - I verified that the same js_AllocGCThing crash occurs using the mozilla trunk. - I verified that the same js_AllocGCThing crash occurs even when no 'regular' gcs are done; i.e. when all gcs are last-ditch gcs.

John Bandhauer

Reporter

Comment 28

•

21 years ago

In the interests of further narrowing the search... - I confirmed that it can still hit the js_AllocGCThing crash when doing only JS_CompileScript (rather than JS_EvaluateScript) AND also reducing the generated script to only include minimal function declatations. So the script is of the form: ... function FOO(){} function BAR(){} ... It tends to take longer to crash. And, the rt->gcFreeList value tends to be a number larger than one (but less than my max identifer length for the generated script). I changed my assertion to the below to veryify that otherwise things look pretty much the same. retry: thing = rt->gcFreeList; if (thing) { rt->gcFreeList = thing->next; JS_ASSERT(!rt->gcFreeList || (int)rt->gcFreeList > 1000); This leaves a pretty short path through the compiler. I'll attach a stack.

John Bandhauer

Reporter

Comment 29

•

21 years ago

Attached file stack for crash with JS_CompileScript only — Details

John Bandhauer

Reporter

Comment 30

•

21 years ago

So, this is not a multithreaded bug! The gcFreeList gets corrupted from last-ditch gc even when this is run with only one thread ever using JS. I added debug code to iterate and check the members of gcFreeList in js_AllocGCThing and also after the list rebuild in js_GC. The aim is to catch the corruption as soon as possible after it happens. It never botches the assertion in js_GC. The botch in js_AllocGCThing seems to have a consistent stack. And, the value of the number has been consistently 0x14. This is with random generated identifier strings from run to run. So, it is not likely to be the string length of the previous identifier. I am not a great student of the parser. I would *love* it if someone would pick this up and solve the problem! I'll attach a diff to the trunk that adds my asserts, the stack (using my slightly patched trunk build), and the current version of the test program that I'm using. Note that it can take many thousand loop iterations before it botches the assertion.

John Bandhauer

Reporter

Comment 31

•

21 years ago

Attached patch test asserts for trunk — Details — Splinter Review

John Bandhauer

Reporter

Comment 32

•

21 years ago

Attached file single threaded crash stack — Details

John Bandhauer

Reporter

Comment 33

•

21 years ago

Attached file test code source setup for one thread only — Details

Brendan Eich [:brendan]

Assignee

Comment 34

•

21 years ago

jband: thanks again -- I'm the loser in charge here, I'll noodle on this and figure it out, ideally before the cantina tomorrow. Stop by! /be

Asa Dotzler [:asa]

Updated

•

21 years ago

Flags: blocking-aviary1.0?

John Bandhauer

Reporter

Comment 35

•

21 years ago

Hey Brendan. Any progress? Thanks, John.

Brendan Eich [:brendan]

Assignee

Comment 36

•

21 years ago

Resummarizing. Still have no clue, but I saw this in trunk firefox when testing the mark-stack-overflow torture test in bug 203278. /be

Summary: crashes and atom problems with mutithreaded tests → crashes where rt->gcFreeList turns into a small int after last-ditch GCs

Asa Dotzler [:asa]

Comment 37

•

21 years ago

1.7.5 has shipped. Moving request to 1.7.6.

Flags: blocking1.7.5? → blocking1.7.6?

Brendan Eich [:brendan]

Assignee

Comment 38

•

21 years ago

Attached patch aviary branch patch from bug 278725 for js_NewObject — Details — Splinter Review

This is a crash fix that's good for the branches. /be

Attachment #171823 - Flags: review+

Attachment #171823 - Flags: approval1.7.6?

Attachment #171823 - Flags: approval-aviary1.0.1?

Mike Kaply [:mkaply]

Comment 39

•

21 years ago

Comment on attachment 171823 [details] [diff] [review] aviary branch patch from bug 278725 for js_NewObject a=mkaply for 1.7.6

Attachment #171823 - Flags: approval1.7.6? → approval1.7.6+

Brendan Eich [:brendan]

Assignee

Updated

•

21 years ago

Blocks: 279636

Brendan Eich [:brendan]

Assignee

Comment 40

•

21 years ago

Checked into the 1.7 branch. What about the AVIARY 1.0.1 branch? Asa, is there going to be a blocking-aviary1.0.1 flag? I've set the approval-aviary1.0.1 flag on the patch (which btw, was a patch against the AVIARY 1.0 branch, and so does not apply as-is to the 1.7 branch, because 1.7 has taken fixes for 1.7.6 that are not on the aviary 1.0.1 radar...). Cc'ing a few other drivers. /be

Flags: blocking1.7.6? → blocking1.7.6+

Asa Dotzler [:asa]

Comment 41

•

21 years ago

Comment on attachment 171823 [details] [diff] [review] aviary branch patch from bug 278725 for js_NewObject I haven't created the blocking flag because it initially seemed like it would just collect a lot of non-security nominations and I assumed that any blockers would have patches and so the approval flag would be sufficient. Now that Bugzilla has the ability to restrict a flag to a group, I could restrict it to drivers and security and that would probably work pretty well. Should I go ahead with that? a=asa for 1.0.1

Attachment #171823 - Flags: approval-aviary1.0.1? → approval-aviary1.0.1+

Brendan Eich [:brendan]

Assignee

Comment 42

•

21 years ago

Fixed on the AVIARY_1_0_1_20050124_BRANCH branch too. /be

Status: ASSIGNED → RESOLVED

Closed: 21 years ago

Flags: blocking-aviary1.0.1+

Resolution: --- → FIXED

Daniel Veditz [:dveditz]

Updated

•

21 years ago

Keywords: fixed-aviary1.0.1, fixed1.7.6

Bob Clary [:bc] (inactive)

Updated

•

20 years ago

Flags: testcase-

You need to log in before you can comment on or make changes to this bug.