Closed Bug 653330 Opened 13 years ago Closed 13 years ago

Sudden browser hang, usually inside of [@ js_TraceWatchPoints]

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 653309

People

(Reporter: dholbert, Unassigned)

Details

(Keywords: hang, regression)

I've had two sudden browser-hangs today, when I wasn't doing anything.  In both cases, I allowed Firefox at least a minute to recover, and then killed it when it didn't.

The first crash was on my desktop at work, with lots of tabs open.
The second was on my desktop at home, when I wasn't around -- I just came home to an unresponsive browser. (It was fine when I left it this morning, with just 2 tabs open -- http://planet.mozilla.org/ and http://www.w3.org/TR/SVG/paths.html -- but when I returned home in the evening, it was unresponsive.)

I killed both with "kill -11" to interrupt the hang w/ stack traces. The results were:
 (work) bp-17b66c7e-4417-4742-9bc0-747842110427
 (home) bp-acc92db0-5ee5-43c8-8f51-267002110427

Both stacks were inside of js_GC calling js_TraceWatchPoints, so I'm assuming they're related.

I haven't noticed this happening before, and then it happened twice today, so I'm assuming this is a recent regression.  Today's nightly's change-pushlog includes a fairly-sizeable TraceMonkey merge, so I suspect that might be related...
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=c5e8cc100248&tochange=c833fb1623ca

Mozilla/5.0 (X11; Linux i686; rv:6.0a1) Gecko/20110427 Firefox/6.0a1
Summary: Unexpected browser hang in [@ js_TraceWatchPoints ] → Sudden browser hang in [@ js_TraceWatchPoints ]
Note that this is a hang, so it probably won't cause any noticeable blips on crash-stats, since the standard methods of killing a frozen browser (ctrl+c, gnome's force quit UI, "kill [pid]") don't generate crash-reports. (at least on Linux)
I just hit this 3 more times in the last 10 minutes.
bp-b9fc0755-c188-422f-9628-f52412110427 
 (^ js_GC calling js::GCMarker::drainMarkStack )

bp-9a5fba47-e36b-4aba-8edb-da3c02110427
 (^ js_GC calling js_TraceWatchPoints calling js::gc::MarkObject)

bp-0ea796cd-1ba3-4573-af0a-743992110427
 (^ js_GC calling js_TraceWatchPoints)
Summary: Sudden browser hang in [@ js_TraceWatchPoints ] → Sudden browser hang, usually inside of [@ js_TraceWatchPoints]
Hardware: x86 → All
It's definitely hanging when marking the weak stuff. I have two theories about what this might be:

1. The new WeakMap uses IsAboutToBeFinalized to test if the thing is already marked. If it's a static string, this will return false. Since we can't mark it, we'll loop around forever.

2. Stuff that crosses compartments could also be a problem during compartmental GCs. I'm assuming weak maps don't do this. However, the watch point list looks like it's global across compartments. I think we could loop forever in this case as well.

I'll try to come up with a test case tomorrow.
Unfortunately, neither of these ideas panned out.

It seems like this might be caused by some sort of compartment mismatch. The most likely thing I can think of is that there's a watchpoint whose closure parameter is from the wrong compartment.
(In reply to comment #2)
> bp-9a5fba47-e36b-4aba-8edb-da3c02110427
>  (^ js_GC calling js_TraceWatchPoints calling js::gc::MarkObject)

FWIW, I did a crash-stats search on this crash signature for the last 4 weeks, and it only came up with 3 hits -- my crash yesterday, and 2 others today:
  bp-173247e6-3cf2-4a75-9e6c-408942110428
  bp-840c76a0-e77a-4520-b272-033fe2110428
The 2 new ones are from different machines (neither of which is mine), and their stacks match each other exactly, though they don't quite match mine.  In all three cases, the build ID is 20110427030633.
I'm getting this on my machine too, running Mac OS X 10.6.7.

I got it to crash trying to view an article on Huffington Post 3 times in a row, and on trying to view the article the 4th time it worked.

After force-quitting firefox I got this in the report (the one you have an option to send to Apple):
  Thread 900        DispatchQueue 100
  User stack:
    3 ??? [0x36218b80]
      1 js_TraceWatchPoints + 310 (in XUL) [0x11a3016]
      1 js::gc::MarkShape(JSTracer*, js::Shape const*, char const*) + 44 (in XUL) [0x11e0ddc]
      1 js_TraceWatchPoints + 228 (in XUL) [0x11a2fc4]
    2 ??? [0x267668c0]
      1 js::GCMarker::drainMarkStack() + 26 (in XUL) [0x11e0f5a]
      1 js::GCMarker::drainMarkStack() + 15 (in XUL) [0x11e0f4f]
    2 ??? [0x379a91f0]
      1 js_TraceWatchPoints + 238 (in XUL) [0x11a2fce]
      1 js_TraceWatchPoints + 301 (in XUL) [0x11a300d]
    1 ??? [0x2c2b1ea0]
      1 js_TraceWatchPoints + 267 (in XUL) [0x11a2feb]
    1 js_TraceWatchPoints + 272 (in XUL) [0x11a2ff0]
  Kernel stack:
    9 hndl_allintrs + 242 [0xffffff80002e2262]
      9 interrupt + 153 [0xffffff80002cf259]
        9 lapic_interrupt + 180 [0xffffff80002d5c30]
          9 mp_kdp_exit + 740 [0xffffff80002d6878]
            9 sync_iss_to_iks + 192 [0xffffff80002cefc0]


This first appeared after the most recent tracemonkey merge mentioned in comment 0.
(In reply to comment #6)
> I got it to crash trying to view an article on Huffington Post 3 times in a
> row, and on trying to view the article the 4th time it worked.

(Just for clarity's sake -- I'm assuming you meant "hang", not "crash", given that you mentioned having to force-quit Firefox.)
(In reply to comment #7)
> (In reply to comment #6)
> > I got it to crash trying to view an article on Huffington Post 3 times in a
> > row, and on trying to view the article the 4th time it worked.
> 
> (Just for clarity's sake -- I'm assuming you meant "hang", not "crash", given
> that you mentioned having to force-quit Firefox.)

Yes, my mistake, I meant hang.
Just got a report from highlight_me on IRC that sounded like this bug. (As with my home-desktop experience in Comment 0, he left Firefox with 3 uncomplicated tabs unattended for an extended period of time, and when he came back, it was hanging.)  This was with today's nightly build.

It looks like there's no easy way to forcibly generate a crash report on Windows XP (his platform), so I'm not sure there's a way for him to submit a crash report if it happens again, but I'm assuming it's this bug.

Updating platform to 'All', as we've now had reports on Linux, Mac, & WinXP.
OS: Linux → All
A patch in bug 637985 replaces all this code, fwiw. (But the current code looks correct to me, and I wrote the patch, so who knows.)

dholbert, can you reproduce this reliably?
(In reply to comment #10)
> dholbert, can you reproduce this reliably?

No, sadly.

But I did hit a sudden hang yesterday that looked a lot like this bug. (I kill -11'd it, but sadly the crash report wasn't helpful, due to bug 654595)
Actually, I can reproduce what _looks_ like this bug by visiting...
http://www.computerworld.com/s/article/9216294/Mozilla_patches_Firefox_4_fixes_programming_bungle
...and clicking "next page", in my normal browsing profile. (no issues with a fresh profile.)  I got this backtrace: bp-3a95b084-7d07-423b-9622-ee8d62110504

Last week, I'd hit a JS assertion at that page, when running a debug build. At that point, billm did a little inspection and told me that it was a (different) known bug.  Maybe the opt-build hang is related to that bug?
My current hypothesis is that this is some sort of compartment mismatch. If you call a mark function during a per-compartment GC, the mark bit won't be set if the object is in a compartment that's different from the one being collected.

This debugger code could get messed up here if any of its watchpoints point across compartments. It would repeatedly set the modified flag and try to mark the object, but it wouldn't get marked. So the same thing would happen the next time around.

I pushed some code to the tryserver that puts in some additional compartment assertions. It also has some fixup code that should avoid the hang. Could you try it out, dholbert?
  http://tbpl.mozilla.org/?tree=Try&rev=3034e0acebf5
The assertions are enabled in opt builds, so those are okay to use.
I tried the x86_64 opt version of Bill's TryServer build w/ my normal profile
> Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110503 Firefox/6.0a1
> Built from http://hg.mozilla.org/try/rev/3034e0acebf5
...and comment 12 would *not* hang for me, after repeated tries.  (I didn't get any assertions or warnings printed out either, though -- not sure if you were expecting any output)

Then I tried again with my nightly, just to be sure, and it hanged the first time I tried comment 12.

So, with my normal browsing profile, nightly builds have a 100% hang-rate for me on comment 12, whereas Bill's TryServer build has a 0% hang rate.
I also tried the debug x86_64 version of Bill's TryServer build, for good measure, and it also performed great (no hang).  Also, I saw no JS asserts or warning output at the point where the nightly build would've been hanging.
I made a new build with some more fatal assertions. Hopefully one of these will trigger and we can narrow down what's happening.
  http://tbpl.mozilla.org/?tree=Try&rev=767020f67bfe
Performing Comment 12 on Bill's new x86_64 build gives me an abort, with this output:
Assertion failure: wp->closure->isMarked(), at /builds/slave/try-lnx64/build/js/src/jsdbgapi.cpp:713
and this crash report:  bp-7b1924da-7de8-438b-b40b-51f242110504
(no symbols in crash report, so hopefully the assertion is enough?)
(the lack-of-symbols might be from bug 654595)
Here's another build. This one includes some printfs right before the assertion fails. Could you try this and include the output?
  http://tbpl.mozilla.org/?tree=Try&rev=1696caaa85dc

I still don't really know what's going on. I never expected that particular assertion to fail. If the new info isn't helpful, maybe we can sit down tomorrow with a debugger. This is pretty frustrating.
Hm, so I tried 4 times with the new build, and each time I just got this output:
> Assertion failure: wp->shape->isMarked(), at /builds/slave/try-lnx64/build/js/src/jsdbgapi.cpp:701

This is a different instance of the same assertion. (but before billm's printfs, sadly)

I'll try a few more times, and I'll report back if I hit the post-printf one from comment 17 again.
(4 more tries, still no instances of the later abort)

I also tried another 4 times with the build I was using in Comment 17, and *that* one aborted at the earlier assertion 2x and the later assertion 2x.  I just can't seem to hit the later assertion with the newer printf-enabled build, though.
Weird. Just in case, could you try again and check the log? There's a very slight chance that you'll still get some printf data (starting with the line "ASSERTION INFO") but it won't actually assert there.
Tried again, same result. (I'm using the opt builds, so there's zero output, aside from 3 lines from an addon and then the assertion-faiulre.)
Ok, I can reproduce the ComputerWorld hang from comment 12, in a fresh profile, with these steps:
 1. Visit http://noscript.net/getit & install noscript "development build" (scroll down a little to find it) (probably affects normal build, too)
 2. Restart to complete installation.
 3. Visit computerworld URL from comment 12.
 4. In noscript notification bar, allow scripts from computerworld.com & facebook.com
 5. Click "Next page" at the bottom of the article body.
 6. if you're still alive, click "previous page"

I'm just including the above steps for posterity, 'cause I think billm has a possible fix for this (woot!), from talking to him on IRC.
This turns out to be an instance of bug 653309. That bug causes us to mark across compartments, even during a per-compartment GC. So if we're GCing compartment A, we can marking an object X in compartment B. The problem happens if X has a watchpoint on it. Then we'll see that X is marked, so we'll try to mark its watchpoint's shape, closure, etc. But we won't actually set the mark bit on them, because they're in the wrong compartment. We will set the modified flag. This is what causes the infinite loop.

I've reproduced the bug using Daniel's STR in comment 23. Then I applied Blake's patch and the problem goes away.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.