Closed Bug 670702 Opened 13 years ago Closed 6 years ago

GC minidump findings/analysis

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: billm, Unassigned)

References

Details

Crash Data

Attachments

(1 file)

It seems useful to have a single bug for findings from the latest minidump analysis. This will be that bug.

First, the stack traces for previous GCs are not useful. There was one PushMarkStack crash (52d9b486-9d38-45c7-891b-77ca02110710) where we crashed in the first GC. In all the other cases I looked at, the previous GC happened based on a timer, so there was nothing interesting on the stack.

The ring buffer information about previous GCs was a little useful. In every case I saw, there were no shape regenerating GCs or compartment GCs. I suspect this is due to Gregor's recent timer-based GCing patch. Since there are no compartment GCs, we can say with some certainty that the crashes are not caused by pointers incorrectly crossing compartments.

However, there have been a number of intentional crashes caused by witnessing invalid cross-compartment pointers. However, it's possible that the assertion is too strong, and nothing bad is actually happening. I'll need to look at these more closely.

Many of the crashes still seem to be tied to scripts that are invalid. This is true of the crashes in JSCompartment::purge, MarkAtomRange, and ScanRope (but not all of them). I think these crashes warrant more diagnostics. Maybe we should poison scripts when they're freed, and also add magic cookies to them to see if what we have is even a script at all.

Interestingly, I have not seen a crash yet where the crash address is 0xdadadada. This means that we are not trying to dereference memory that has been GCed. This is quite unexpected, at least to me.
Cool stuff. How can I see the crashes?
(In reply to comment #1)
> Cool stuff. How can I see the crashes?

The instrumentation is present in the 20100709 nightlies. It added a few things:
- When the GC traces across compartments when it shouldn't, we intentionally crash in JS_Assert.
- Objects are poisoned with 0xdadadada after the GC sweeps them (just as they are normally in debug builds).
- There's also a special region of the minidump where some extra info gets saved:
  * Each GC is recorded in a ring buffer, along with the time and whether it was a compartment GC.
  * A stack trace is taken at the end of each GC, so now if there's a crash within the GC, we have the current stack as well as a stack for the previous GC.
  * If we ever OOM, we save the stack.

The dumps are available on crash-stats as usual. You have to check the buildid to make sure it's new enough. I have some tools for extracting the extra stack data from the minidumps. I can post them if you're interested.
MarkObjectRange is also another apparent js_TraceScript crash.
(In reply to comment #2)
> (In reply to comment #1)
> > Cool stuff. How can I see the crashes?
> 
> The instrumentation is present in the 20100709 nightlies. It added a few
> things:
> - When the GC traces across compartments when it shouldn't, we intentionally
> crash in JS_Assert.
> - Objects are poisoned with 0xdadadada after the GC sweeps them (just as
> they are normally in debug builds).

I remember the memset being a 30% sweep regression so it was never in optimize builds. I guess this will be removed for the next release?
Yes, this stuff will all come out soon.
Assignee: general → wmccloskey
Blake has confirmed that the cross-compartment pointer assertion signals an actual problem in the code. We're going to add some additional instrumentation to try to get more data on that. I filed bug 671080 for that.

I filed the script poisoning stuff as bug 671113.
Blocks: 663691
Depends on: 662646
I got the first crash report from the script poisoning diagnostics back. (These ones are somewhat rare, about 0-1 per day.) The results are interesting but puzzling.

The poisoning patch worked as follows. I inserted a bunch of sanity checks when we create and destroy scripts. We didn't hit those. I also overwrote scripts with garbage when they're deleted. Finally, right before we've been crashing in JSCompartment::purge, I added an extra loop to iterate over all scripts and check that they're not garbage (by checking that a cookie field has the right value).

We crash now in this check, right before JSCompartment::purge. It detects that one of the cookies is wrong. The minidump contains the contents of the bad script as well as its predecessor in the list of scripts. As far as I can tell, everything looks valid except that there's a single-bit error in one of the cookies. Rather than having the value 0x00c00cee, it has 0x20c00cee. I have no idea why this would happen.

It's probably best to wait for more of these and see what happens. Maybe this one was just a random instance of memory corruption, and the rest will be different.
I found three more crashdumps. (By the way, we reeeeeeally need a way to search based on crash address.) The signatures for these are DestroyScript and js_TraceScript (both of which inline CheckScript).

The first two are pretty useful. We crash inside DestroyScript, which is called from the GC while finalizing a JSFunction. The script is entirely poisoned, which means that we already freed it at some prior time.

The third one is stranger. We crash in js_TraceScript, while being called from UnmarkGrayChildren. The first 20 bytes or so (up to JSScript::nfixed) seem to be garbage, while the rest of the script looks valid. So it would not appear to be a freed script, since it would have been poisoned. This appears to be a more generic case of memory corruption.

I think it makes sense to focus on the DestroyScript crashes, where it's fairly clear what's going on. I'll try to come up with some more diagnostics for this.
Some new info related to comment 8. First, fixing bug 672436 eliminated all but one of the cases described there. The only thing left is the case where we crash trying to DestroyScript a script containing 0xda, with fun_finalize as the caller. That's the case that sounded like a double-free on scripts.

Bug 673625 has landed and it added ownership assertions for scripts. These were designed to track down the double free. However, these assertions haven't been as helpful as I hoped. We never trigger one of the ownership assertions. We're still crashing in DestroyScript with a 0xda script.

How could this happen? Well, the sequence of events must be something like this:
  1. Script is created and its owner is unset.
  2. Later, someone destroys the script for some reason and it gets poisoned.
  3. We GC a function object that points to the already destroyed script.

Between steps 2 and 3, we cannot have set the owner to anything. If we tried to do so, it would look like the existing owner was 0xdadadada, and we're asserting that the owner is unset when we set it.

The DestroyScript call in step 2 asserts that the owner is correct for the given DestroyScript caller. So if we give the script an owner between steps 1 and 2, it must agree with the first DestroyScript caller. And we're never allowed to change the owner once we've set it. We clearly created a function object and attached a script to it, or else step 3 wouldn't happen. And doing so *should* cause us to set an owner for that script. So there are three possibilities:
  1. We GC the function object after we attach the script but before we set the owner. This seems pretty unlikely. I'll tighten up the code a little to make absolutely sure this isn't happening.
  2. We do set the owner to the function object and then we GC it twice. This seems almost impossible though. The first time we GC the function object, we would poison it. So even if we GC it again, the script pointer would be invalid the second time.
  3. Somehow I missed a path that creates a function and attaches a script to it. But that should cause assertions in DestroyScript because the owner isn't properly set. I'm not seeing those.

There is another possibility. Maybe the "script" that DestroyScript is called on in step 3 is actually just a pointer into dead GC memory. I stupidly used the same 0xda value to poison both scripts and GC things.

I'll file a new instrumentation bug to try and see what's going on here. This is getting a little frustrating, but I think it would be foolish to give up now.

I've also found some other crashes of interest. They're crashes inside DestroyScript, where we seemingly have a good script (i.e., the cookies are valid) but we die while trying to access memory that should be within bounds of the script. The signatures for this are memset and JSScript::numNotes. I haven't been able to come up with a plausible hypothesis for why this might happen.
This comment is about compartment assertions:

I've seen a few assertions where we trace through a function object that points to a script in a different compartment. Here's one example:
  7282c546-4eff-4623-9345-e0f182110728

One really weird thing that happened is that the main compartment assertions that were occurring on proxy_TraceObject stopped happening after the 20110726030825 build (i.e., this is the last build where I've seen them). There's nothing I know of that should have fixed them, so this is pretty weird.
Regarding the "really weird thing" in comment 10:
I actually went and found the exact set of things that landed between the 7/26 and 7/27 nightlies.

  http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=58c04967ac5b&tochange=e4c8a1e7b373

It turns out that Blake's patch in bug 673503 landed during that time. This was almost certainly the fix for these assertions. When I looked this up before, I had been misreading the regression range. It helps to be careful, I guess.
I looked over some of the crashes a bit more this morning and found a few things.

1. I saw a few cases where we have a compartment mismatch between a function and the script that it points to.

2. A surprising number of the PushMarkStack crashes occur when loading from addresses that look sort of like 0xdadadada. One example is 0xdadfdb68. There are enough that this is more than random chance. So it looks like we're tracing through memory that was GCed previously. Compartment mismatches are a possible cause.

I haven't actually enabled full compartment assertion checking in opt builds yet. That will have a more significant performance impact than what we've seen so far, but I think it would pay off. I'll put something together for this.
I looked at one crash that has the new script instrumentation. The instrumentation poisons the script so that the first 4 bytes tell you who is responsible for destroying it and the remaining bytes are 0xdb.

When we crash in fun_finalize, we get a script pointer that points to memory containing nothing but 0xda. This could be caused by two things:

1. The script pointer for the function points into the middle of a script. This seems somewhat unlikely.
2. When we do the poisoning, we're writing too far and writing over another script. This seems possible, since I've seen crashes that occur while doing the memset.

Maybe it makes sense to only poison the first sizeof(JSScript) bytes of the script. I can also add a field that contains the allocated length of the script and add frequent assertions that this length is equal to JSScript::totalSize().
Crash Signature: [@ CrashInJS ]
More on comment 12:

I looked back over some of the script compartment assertions. I think I was wrong in how I characterized them. I thought they were caused by the script and the function belonging to different compartments. However, there are actually two assertions:
  1. We first assert that script->compartment == fun->compartment()
  2. Then we assert that script->compartment == rt->gcCheckCompartment
It's the second one that's failing. That suggests that we're
Oops, I didn't mean for that to be a cliffhanger.

That suggests that we're tracing through a function that's in the wrong compartment, but we don't realize the problem until we get to the script.

Turning on a more complete set of compartment checks in the GC will solve this problem. I think I'll wait until the next release train to do that.
I filed bug 681071 about the compartment problems.

Some interesting things happened related to scripts. Previously, script poisoning overwrote the entire script, as follows:
  memset(script, 0xda, script->totalSize());

However, I noticed that this was causing a lot of weird things to happen. We were crashing inside the memset as well as when computing totalSize. The problems seem to come from the fact that we're computing totalSize() by iterating over the source notes in the script, and somehow that traversal can go wrong.

So I recently changed the memset to the following:
  memset(script, 0xda, sizeof(JSScript));

This made most (maybe all) of the script cookie assertions disappear. There are still some crashes when dealing with source notes when computing script size for about:memory.

So I think we need some new assertions to track down this problem.
I looked at some recent minidumps today (the ones that have come in since bug 679593 landed). I found:

* Two script cookie assertions. I wasn't able to get data out of the scripts, so it's hard to know what's happening here.

* One assertion from within fun_trace where the script's owner is not the function being traced.

* One assertion from fun_trace where the script's owner is correct (i.e., == to the function being traced) but the compartment is different from the function's compartment.
I was told this would be the right place for this test. It triggers this assert:

Assertion failure: false, at jsscript.cpp:302

I tested this on mozilla-central revision a351ae35f2c4 (with shell build fix from mozilla-inbound rev fff3dc9478ce) 32 bit debug build. For exact running instructions, see README inside the tarball. 

Note that the original assert was different, it was:

Assertion failure: allocated(), at ../../jsgc.h:740

If you think that is a different bug, let me know.
Thanks so much, Christian! It sounds like you found something serious. Could you file it as a separate bug? I would like to keep this one open for crash analysis, even if the script thing gets fixed.
Depends on: 684682
(In reply to Bill McCloskey (:billm) from comment #21)
> Thanks so much, Christian! It sounds like you found something serious. Could
> you file it as a separate bug? I would like to keep this one open for crash
> analysis, even if the script thing gets fixed.

Sure, opened bug 684682 :)
Shouldn't the "CrashInJS" signature be limited to Nightly or did I misremember?
It now happens a whole lot in Aurora as https://crash-stats.mozilla.com/report/list?signature=CrashInJS shows and I'm even seeing crashes in 8.0 in the list.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #23)
> Shouldn't the "CrashInJS" signature be limited to Nightly or did I
> misremember?
> It now happens a whole lot in Aurora as
> https://crash-stats.mozilla.com/report/list?signature=CrashInJS shows and
> I'm even seeing crashes in 8.0 in the list.

Is there a way that we can categorize CrashInJS bugs based on the function one below it in the stack trace? That would be very useful.

In the crashes you linked, that's a different assertion Luke added to Aurora. Luke, these look like AbortIfWrongThread crashes. The stack trace in the Windows crashes isn't very helpful, but the Linux crashes seem to have a good stack trace.

The ones in 8.0 may be some existing assertions that we've had for a long time from the tracejit.
I'll try to put some extra JS fuzzing efforts to Aurora 32 bit opt and see if I can find a test there.
(In reply to Christian Holler (:decoder) from comment #25)
> I'll try to put some extra JS fuzzing efforts to Aurora 32 bit opt and see
> if I can find a test there.

The crashes in comment 23 are browser-only, so I don't think that fuzzing will help.
(In reply to Bill McCloskey (:billm) from comment #24)
> Is there a way that we can categorize CrashInJS bugs based on the function
> one below it in the stack trace? That would be very useful.

That's easy. We just need to file a Socorro bug to add it to the prefix skiplist, and then the signatures will be "CrashInJS | foo".
Depends on: 715713
It's #8 top crasher in 10.0b2.
Unassigning myself from old bugs. Don't worry, not quitting.
Assignee: wmccloskey → nobody
Is this work concluded, and bug can be closed? (ditto bug 663691)

All the blocking bugs here are closed.
Summary: GC minidump findings → GC minidump findings/analysis
Closing because no crash reported since 12 weeks.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: