Closed Bug 1154263 Opened 5 years ago Closed 3 years ago

Intermittent indefinite hangs that use no CPU in Nightly


(Core :: XPCOM, defect)

Windows 7
Not set



Tracking Status
firefox40 --- affected


(Reporter: ehoogeveen, Unassigned)




(2 files)

Since yesterday, along with bug 1154080, I started seeing occasional indefinite hangs when launching Nightly or sometimes when visiting a page. Firefox does not use any CPU while it is frozen, so it seems as though the process is waiting for something. I've seen these both with e10s enabled and disabled, although I usually have it turned off for add-on compatibility reasons.

I tried sampling the stack with MSVC 2013 while the process was stuck, and the main thread always seems to be trying to enter a critical section from RegisterExecutableMemory. I'll attach the stack for each thread (though most are likely not interesting) and a minidump.

Unfortunately I have a hard time reproducing this hang, so I'm not sure if I'll be able to get a better regression range than 'in the past few days'. I initially got this hang while trying to track down bug 1154080, but I get it fairly frequently while loading up my profile. I think I also got this hang in an inbound build with bug 1049290 backed out, so I don't think it's responsible (but I don't have symbols for the tinderbox build, so I didn't look too deep).
Attached file Minidump
This is a minidump saved with Visual Studio 2013. It applies to the win64 2015-04-13 Nightly based on revision 2c9708e6b54d.
One tidbit that might be relevant: I don't know if I saw this before installing the Gecko Profiler to debug bug 1154080. Unfortunately it's intermittent enough that it might be hard to confirm, but it might be profiler-related.
I just got sick of all the random hangs in the nightly and have reverted back to Firefox Beta. I feel sad!
Ajay, the latest Nightly should fix the Adblock Plus-related hangs, which were temporary but made the browser very slow.

I haven't seen this hang since disabling the Gecko Profiler, so I think it might be implicated. Unfortunately that means I don't have a regression range - and since I can't reproduce consistently, I don't know that I can get one! Leaving this in the JIT component for now because of the stack.
I just installed the 4/15 nightly and things are back to normal. The old issues of slow response and hanging tabs seem to have been fixed. Woohoo! Thanks all.
What's the status of this issue?  The stack doesn't seem to indicate anything specifically profiler related.

Are you still seeing 0-cpu hangs when using the profiler in nightly?

(Just for refs, I run with gecko profiler on 100% of the time on nightly on my windows 8 laptop, and this hasn't been noticeable). The bug indicates x86 and win7, which shouldn't be too far off from my setup.
Sorry, I missed this. I'm using the 64-bit Nightly, which might be relevant. shu noted on IRC that NS_StackWalk appears to be waiting on a lock, and I know that 64-bit stack walking on Windows was changed in bug 1088343 and bug 1123533, though the latter looks harmless enough.

I haven't had time to try and bisect this, but I'll see if I can still reproduce. I haven't had these hangs since disabling the profiler add-on though.
I can confirm that the hangs still happen. I tried to bisect them, and was able to reproduce as far back as the 2015-02-28 Nightly, but got the following bogus range on inbound:
(that is, I was able to reproduce with 495753e0d44f, but obviously the start of the range is incorrect)

So this dates back to *at least* February 23, but that's not super helpful. Unfortunately this is really intermittent, so all I can do is keep trying builds and hope they fail.
OK, I found a set of pages (lots of pinned gmail tabs) that helped me reproduce this more consistently, and narrowed the range down to:

Unfortunately the 2015-01-10 through 2015-01-13 Nightlies crash on startup, so that's the best I can do, but this range does contain bug 1088343, which seems like the most likely culprit.

njn, any idea? (for when you're back)
Blocks: 1088343
Component: JavaScript Engine: JIT → XPCOM
Flags: needinfo?(n.nethercote)
Ah, I see what you mean about how it does play if you drag it onto Firefox. So perhaps there's a race condition with accessing the file metadata, or something?
Ack, wrong bug.
> njn, any idea? (for when you're back)

It's very plausible that bug 1088343 is relevant. Before that bug's patches landed, stack walking totally didn't work on Win64, so all the stacks obtained by the profiler were junk; those patches changed our Win64 code to use a different system function (RtlVirtualUnwind()) to obtain stack traces.

Unfortunately, I don't know much about RtlVirtualUnwind(), and I was only able to get it working by cribbing from other examples. There are no critical sections involved with the RtlVirtualUnwind() call, though there are with the Win32 alternative (StackWalk64()); I don't know if that's relevant.

Since the JITs are involved, Luke *might* have an idea -- Luke, this looks like a bad interaction between the profiler's stack walking and the JIT's registration of executable code...
Flags: needinfo?(n.nethercote) → needinfo?(luke)
Ah, so this sounds like the "JIT Exception Filter" I had to add so that breakpad works on Win64.  The situation is described here:
So I'm guessing that, by using RtlVirtualUnwind(), you're causing us to call the unwind handler we registered for the JIT code here:
which calls a hook:
which calls the unhandled exception filter:
which I guess causes the hang (I would have expected crash report).

FWIW, Chrome does the same thing (bug 844196 comment 98).  IIUC, IE preserves fp in JIT code which means that they actually *can* do real unwinding for JIT code (see Yuhong Bao's comments in bug 844196).  x64 has enough regs that we could probably do that w/o a huge perf hit, but not exactly a tiny change.
Flags: needinfo?(luke)
Do we have a path forward here? Can we figure out what lock is being accessed and do something about the reentrancy? This bug means I can't leave the profiler enabled on the win64 Nightly during normal use (which would be useful for diagnosing random hangs).
Closed: 3 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1263595
You need to log in before you can comment on or make changes to this bug.