Closed Bug 429592 (hang-detector) Opened 16 years ago Closed 13 years ago

catch process hangs

Categories

(Toolkit :: Crash Reporting, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla11
Tracking Status
firefox11 + fixed

People

(Reporter: Dolske, Assigned: benjamin)

References

(Blocks 1 open bug)

Details

(Whiteboard: [Snappy:p1][qa-])

Attachments

(8 files)

Due to some recent change, Minefield has been hanging on me a few times daily. It doesn't seem to actually crash, but just pegs the CPU and is completely unresponsive.

It would be interesting to be able to catch and report this kind of problem, as we do with crashes. For example, have a watchdog timer fire if the application fails to service the event loop within 30 seconds.

A variation of this would be for general detection and reporting of when the application is responding sluggishly. For example, send a report if it took longer than 5 seconds to service the event loop (even if the browser recovered).

Hard hangs (30+ seconds) might be nicely handled for the user by killing and restarting the application, like a crash, while sluggish response is more of a metrics-collection thing.

What kind of data to report? A single stack is better than nothing, a few sequential stacks collected X ms apart would be better, and a full-fledged sampling run would be amazing.
personally i'd suggest borrowing the algorithm (or implementation) from windbg's !analyze -v -hang

:)
Our integration of Breakpad currently expects to write a dump and exit the app, so if you're expecting it to recover, that will need to be changed. We're also not equipped to do anything more than write a single dump and send it at the moment. We could quite easily provide a method on nsICrashReporter to write a dump with a bit of extra info saying that it's a suspected hang (although we'd probably need to fix bug 397199 to make the info useful).
sicking and damon were asking about the status of this at a recent crash kill.  it would help right now in particular when we seem to have a few OOPP hangs.
This bug is about arbitrary hangs, which are very difficult to catch. We are already detecting and aborting for hangs which involve IPC calls to a plugin (although currently we don't get any crash reports from it, it usually un-hangs the browser process). Bug 544936 tracks getting minidumps from both processes involved in an IPC hang.
Blocks: 686025
Summary: breakpad (or something like it) should catch process hangs → catch process hangs
Attached patch patchSplinter Review
Attached is a crude patch to kill the browser if we don't process an event within 30s (or well really 30s..60s, depending on what interval the hang starts). This isn't great and we should make this timer more aggressive over time, but for now it should avoid accidental browser killing due to **** plugins.

A browser with no windows open produces 87 events in 30s for me, so this should be pretty safe in all scenarios. Since we already wake up 87 times, this extra wakeup twice a minute is not a significant power issue.

This might mess with very very very long running JavaScript. If this turns into a problem in the wild, we can address it (since it will get us crash data!)
Attachment #562643 - Flags: review?(benjamin)
Plugin hangs can be up to 45s in our default config, so if blocking on a plugin counts here (and I think it would), then we probably need to do something different.

If we can manage to *not* count plugin-hangs, then I think 30s is right, or honestly lower if we can avoid hurting from the long JS scripts (I fear we can't :-( ).  At 30s I think most people will have force-killed the browser, but I guess we'll get some unattended ones or from more patient people.
This patch looks pretty sensible. We'll probably want to add an additional crashreporter annotation so we can display these differently in crash-stats. Currently the NS_DebugBreak code will add a freeform textual annotation that includes the string you passed to NS_RUNTIMEABORT, but that's a pain to deal with programmatically. We should do something like CrashReporter::AnnotateCrashReport("Hang", "1"); to indicate that this is a browser hang.
Assignee: nobody → gal
Comment on attachment 562643 [details] [diff] [review]
patch

So, will this be OK when Firefox is being debugged?  I imagine it's quite common for Firefox to not process events for 30s when it's being debugged, will this just cause the debug session to be aborted?

Also, as a nit, please use 0 to initialize sEventCounter instead of NULL.  :-)
Comment on attachment 562643 [details] [diff] [review]
patch

A. Do we really want to crash at this point? We could submit a "crash" report without actually crashing.

B. What happens if there are no events to process, e.g. the computer is sleeping/going to sleep or just unused, or in the background on Android?

C. Waking up every 30 seconds doesn't sound like a good idea in general on battery-powered devices.
Also, FWIW alex keybl is spinning up our winqual integration which gets reports every time windows and/or the user kills firefox for "not responding". Ccing him.
Not tracking for Firefox 8 or 9 given that this is not ready for check-in and is not specific to these releases.
clegneato, the blocked bug is a case where we hang and we seem to have no visibility into that at all (considering we didn't detect it). If we have such windows data, are we analyzing it?
#13.C: Read the comment above please. We wake up at minimum 87 times in 30s (empty browser, all windows closed). Waking up 88 times isn't going to kill us. We should in general reduce wakeup count, but this seems like a wakeup well spent.
Andreas, I spoke to Alex this morning and he is just starting to look into the winqual stuff right now. He will need a bit of time to figure out what info it gives us but he's on it.
I can't see an obvious way to do this which takes a sleeping laptop into account, and I'm pretty sure that this patch will kill any browser open while a laptop goes to sleep. Do we get reliable notifications for suspensions on any platform?

We should try the option of not force-killing the browser, but bringing up the crash reporter while the browser is still running. Getting the UI right might be tricky, though, since you really want to then give the user the option of killing Firefox from within the crashreporter window. We may need a quick-and-dirty solution for now and followup work for better UI.
It won't since we would have to sleep exactly after a watchdog tick with no events processed yet and then wake up just at the right time that the first thing that runs is the watchdog thread runs before gecko gets any events. This already doesn't happen in practice, but of course needs a bit more work to be more bullet proof. I can do 15s ticks and two subsequent periods must be without events processed. That way its impossible to hit the window (you would have to sleep twice with perfect timing).
Production Firefox 5, 6, and 7 have a bug that frequently hangs them for over 30 seconds (686025 / 679498) in many real life circumstances.  About the only thing more annoying than having to wait 30 seconds for your browser to wake up would be having it kill itself.

The detection and reporting part of this bug makes an enormous amount of sense; killing the process? Not so much in my opinion.
I wouldn't wait around for a 30s pause. Thats totally utterly unacceptable. I am surprised some users do. We can see if we can wire this into breakpad so that a report is sent but the process is allowed to continue. That would solve bsmedberg's worries too. Ted, any advise?
I often wait for more than 30s to have something unfreeze, often Nightly or SeaMonkey, but even the whole OS UI sometimes, when my harddisk is churning like wild and my CPU cores are mostly consumed with I/O wait.
From a product standpoint, a hang is worse than a crash -- assuming it's not going to recover, most users do not know how to kill the browser and some even resort to rebooting their computers to get back to a working state.  I'd like us to better understand any time we're experiencing hangs longer than 10 or 15 seconds and I'd like us to kill the process if the hang is a lot longer than that. I think a 30 second kill, maybe a 40 second kill, would be a fine place to start.
That won't work that well for hangs that grow longer slowly with time, so the user has time to get accustomed to them.  They'll be mad that the browser killed itself when they know to expect it will come back.
(In reply to MasterLeep from comment #25)
> That won't work that well for hangs that grow longer slowly with time, so
> the user has time to get accustomed to them.  They'll be mad that the
> browser killed itself when they know to expect it will come back.

I'm not quite sure what you're saying here. Users should not get accustomed to long hangs. We should fix long hangs or turn them into crashes so that users have an escape from the long hangs. Long hangs are not acceptable. Most users will not know how to deal with them and the few that are willing to ride them out are not the users we should be focusing on. We should be focused on the many who will not want to wait for the app to "come back".
My hypothesis, observed in my family, is that non technical users will ride out hangs if they grow in duration over time.  It is then an annoying browser behavior but a manageable one.  I am quite convinced that these users would prefer the browser to recover from the hang than to commit suicide.

Your proposal would make sense if these hangs are quite rare and users do not know how to deal with them.  I suspect that there are many more such hangs, and users accustomed to them, than you realize.
Asa, what about Ben's idea in comment 19? Basically having some sort of crash reporter-like UI pop up saying something like:

  We noticed that Firefox has gotten stuck. Would you like to restart it? And
  would you also like to report the problem so Mozilla can try to fix it for a
  future version of Firefox?

That should solve the "I don't know how to kill Firefox, so now I need to reboot" problem, but without creating a new "WTH! Sometimes I wait and it recovers, but other times I wait in is vain because it then just spontaneously dies anyway!" problem.
It looks like everyone is talking hypothesis here, including Asa's "assuming it's not going to recover" which might be a wrong assumption given all info I heard so far.
Can we at least add a patch that for now makes us *report* hangs to us even if it may not yet kill the process? Then we'd at least have some data on how common the problem is at all.
Reporting hangs without killing the process is going to be a lot more work. We'll need different UI for that. I have two suggestions:
1) Immediately write a dump and launch the existing crash reporter client (that you would get if your whole browser crashed), but modify the strings to indicate that the browser was hung, and not crashed. Offer a button to kill/restart the browser, or just send a report.

2) Immediately write a dump, queue a notification so that when the app recovers, we provide some in-browser UI telling the user that we noticed the hang, and they can send it to us so we can fix the issue.
We should be able to leverage Windows Error Reporting (WER, or sometimes referenced as winqual) to analyze many process hangs. See [1] for info on the hang behavior in Vista and later. It implies that there is no need for the user to do a special force-kill on Firefox in order to get hang reporting - clicking the close button on an unresponsive application should report hang info to the winqual site, which we could then datamine. I don't have any data to back this up, but I feel like a majority of users know to close/reopen an app when it's not working. It would be great to verify that we do in fact see the "ghost window" behavior on Vista/7 when Firefox hangs. It's unclear at this time what the hang reporting behavior is in XP.

I don't believe WER tracks any instances where Firefox becomes unresponsive and the user patiently waits for interactivity to return (which makes conversation in this bug still relevant). It also obviously does not help us with hangs on other platforms, which will likely need to be tracked separately.  

I'm currently tracking the investigation of WER and integrating it into our bug/crash tracking at [2]. Feel free to add any questions/comments into that document and I'll try to follow up with appropriate answers once researched.


[1] http://blogs.msdn.com/b/meason/archive/2010/01/04/windows-error-reporting-for-hangs.aspx
[2] https://wiki.mozilla.org/Release_Management/WER_Investigation
:jwatt and :ted, I think I'm OK with those experiences from a Product standpoint but I'd like UX to chime in. 

Faaborg, Limi, this is probably a reasonable priority to get a design spec for. Can one of you take this on?
(In reply to Alex Keybl [:akeybl] from comment #31)
> We should be able to leverage Windows Error Reporting (WER, or sometimes
> referenced as winqual) to analyze many process hangs. See [1] for info on

This is good (and we should do this), but this a) doesn't give us visibility into the problem on other OSes, and b) doesn't integrate into our existing reporting in Socorro.
An additional comment - a lot of browser hangs on Windows are caused by plugin deadlocks. We currently have code in place on mc that aborts plugin processes that detect browser hangs. While we currently do not grab stacks for these, code posted in bug 679238 will add that functionality.
(In reply to Ted Mielczarek [:ted, :luser] from comment #33)
> b) doesn't integrate into our existing reporting in Socorro.

Figuring out the best way to do this is part of my WER investigation.
(In reply to Alex Keybl [:akeybl] from comment #35)
> (In reply to Ted Mielczarek [:ted, :luser] from comment #33)
> > b) doesn't integrate into our existing reporting in Socorro.
> 
> Figuring out the best way to do this is part of my WER investigation.

Isn't there a problem with the permissions on how is allowed to see any of this data? Socorro data is primarily targeted to be publicly visible, and only small details like URLs are designed to be limited to a rather small group. Having almost everything public is by design and heavily wanted. If that can't be done with WER data, then we'll need to investigate our own solution anyhow, next to needing a solution for other platforms, esp. Android.
Aborting the browser in case of a hang, and thus potentially adding more crashes in cases where the hang is temporary, scares me a lot.

Anecdotal evidence about us browser developers being eager to pull the force-quit trigger doesn't seem like a very convincing data point. I think that your average user is going to be much more reluctant to force-quit out of concern that it'll hurt the computer/browser.

If we're going go to down the path of force-quitting the browser in case of a hang, can we get telemetry data on this first? I.e. measure how often the following set of events happen:

1. The browser is locked up for 30 seconds.
2. The user does not force-quit the browser.
3. The browser eventually unlocks.

Such data could also provide input on if 30 seconds is the right cut-off point, or if it should be higher/lower.
(In reply to Jonas Sicking (:sicking) from comment #37)
> Aborting the browser in case of a hang, and thus potentially adding more
> crashes in cases where the hang is temporary, scares me a lot.

I think it is unacceptable that we hang for 30 seconds. If we hang for even 10 or 20 seconds, a force-quit and restoring restart is going to be a better user experience than a 1 minute hang that recovers. Restarts are quick. We restore data well.
Is it very unlikely for the user to lose any state if the browser restarts?

Asa, you have said a few times that hanging for 30 seconds is unacceptable.  I wish it were so, but it isn't, because these bugs have been out there for many months.  Is that because they weren't accepted by Firefox development as real bugs?  If so, improving detection of hangs in the Mozilla QA process is badly needed.

I'm not sure if the fixes are going to be patched into 7 either.
(In reply to Asa Dotzler [:asa] from comment #38)
> Restarts are quick. We
> restore data well.

No and not well enough. Until all my tabs have reloaded after a restart, usually way more than 30 seconds have gone by, and I need to submit one or two HTTP Auth prompts. Having that multiple times a day would be nasty. And I might not be alone with that. Also, I see that often we don't correctly restore what was loaded in a page (some pages even log me out on a restart, even after a crash) and we don't always restore scroll position on a page or text entered in a text field (even if it work often, it doesn't always), and even scroll position is often interesting to me to know how far I've looked through some long page or so.

All in all, even in my experience as a quite advanced user, restarts of the browser are disturbing workflow more than 30 seconds or even a minute of the browser being frozen. Both are nasty, but I'd prefer (temporary) freezing over restarting every time. I've sighed on hangs, but I have often cursed on restarts after crashes or so. That alone should say enough.
There is no reason not to at least turn this on for trunk, aurora, and probably beta. Can we end the debate on what the right thing to do here is and just get this patch into m-c? If we're doing our job, the browser won't be hanging enough at these lengths to matter and all the hand-wringing about what users prefer in an already horribly broken state can be pushed off for now.
> We restore data well.

No, we don't.  Each restart takes me at least 5-6 minutes to recover from because we don't restore window positions correctly, for example.  It's enough of a barrier that I'm usually a few weeks behind on actually installing updates because the restart experience is so painful.

More importantly, what happens when the hang temporary (basically a performance bug) and is due to a web page the user _must_ use (e.g. for work, a bank site, etc)?  We'd like to fix the issue, obviously, but making the user unable to use the page seems like a bad idea.
> There is no reason not to at least turn this on for trunk

There are tools I use commonly that would trigger this 30s hang timeout...  right now I just use a separate Firefox instance when I need to use one of those, but I suppose I could try using Chrome instead.

> If we're doing our job

We're not, if "doing our job" means never having these hangs, or not having them repeatably on particular sites.
I strongly disagree that the idea that a decision in this bug on acceptable lengths of hangs should have anything to do with "tools I use" or "restart takes me 5-6 minutes of windows arranging" or "I need to submit a bunch of HTTP auth promts".  We are not our users and it would be irresponsible to our hundreds of millions of users to make these decisions based on the inconvenience to ourselves. If we need to have an about:config entry or some extension to change timers for developers, let's do that, but not hitting this problem head on because it'll be uncomfortable to us is wrong.

We should decide what the right user experience is for our hundreds of millions of users and make that happen. Then we should figure out how to work around that for those of us who need more flexibility.
Asa, you're assuming that our users don't use Spaces, don't ever have any auth prompts, and don't ever have to use any websites that can cause 30-second hangs.

These are clearly false assumptions for some fraction of our users.

Your argument is that it's OK to make it completely impossible for that fraction to use the browser if it somewhat (arguably, but let's posit it) improves the random-30-second-hang experience for the other users, correct?
Or to put it another way...  my wife would certainly not appreciate Firefox killing itself just because she happens to be loading a slow-performing web page (youtube, say) while at the same time importing some movies into imovie.  Nothing to do with developers; I think you just severely underestimate how often this can happen in practice on underpowered hardware.
Comment on attachment 562643 [details] [diff] [review]
patch

ok, given the current situation, I think we want the following minimum changes:

* check a pref for the timeout, default to disabled (-1) in debug builds
* do the double-15 check to avoid the most common possible race condition with a very slow or hibernating computer

We also need a followup to add a "Unresponsive=1" annotation to the crash report, but that is currently problematic because the AnnotateCrashReport API is main-thread-only (need to add a lock).

Andreas, let me know if you want me to make the changes, especially the pref change which may not be completely trivial.
Attachment #562643 - Flags: review?(benjamin) → review-
Additionally, good state restoration is a function of the website as well as of the browser. For example bugzilla does not restore well since it uses too much dynamic DOM manipulation in its review UI. Likewise google calendar completely looses all its state when restored, including any meetings you're in the process of setting up.

Gmail likewise looses any text entered in mail composition area. However it aggressively saves information to the server, in which case it restores from there on restart. However anything written since the last save is lost. This is especially bad if you've been offline for a while.

I would expect this to hold true for a lot of sites out there. Any time the site does "complex" DOM manipulation form restoration gets very unreliable and it basically becomes the sites responsibility to do state restoration using localStorage or history.state.
review comments sound reasonable, I won't have time to work on this for a while, if someone wants to steal, please do
Bsmedberg - can you take this bug and make the changes?
There were some discussions at the all-hands about doing some sort of a user study to figure out if hang problems are perceived to be a major issue for Windows users of Firefox. Curious if we've kicked anything like that off?
We are discussing that as well but we have had enough independent reports of hang problems, even just internally, that we feel it's worth it to do this to collect data. We should really be doing both though.
We know we have hangs. We just fixed a big hang bug. We don't need to ask people how they perceive hangs. We need to identify and fix hangs. So lets start with the stack collection for it. It would be great if someone can steal this bug from me. I won't have time for at least a few weeks, sorry.
Ok, bsmedberg, can you take this and get something checked into m-c. If not, who can we assign it to that can take care of it ASAP.
Assignee: gal → benjamin
Status: NEW → ASSIGNED
(In reply to Asa Dotzler [:asa] from comment #24)
> From a product standpoint, a hang is worse than a crash -- assuming it's not
> going to recover, most users do not know how to kill the browser and some
> even resort to rebooting their computers to get back to a working state. 
> I'd like us to better understand any time we're experiencing hangs longer
> than 10 or 15 seconds and I'd like us to kill the process if the hang is a
> lot longer than that. I think a 30 second kill, maybe a 40 second kill,
> would be a fine place to start.

In following comments the arguments against this are all around the notion that the browser might come back from the hang, and that the user might want to wait for the browser to come back.  Without good telemetry data, I can't say for sure, but both of these sound very much like boundary cases.  Isn't it more likely that within 30 seconds users will have attempted to launch another browser to complete their simple task, or have attempted a reboot of the computer?  Also, I don't see a whole lot of patience in our user base, so I kind of doubt people would be willing to wait 30-40 seconds in hope of the browser coming back.

And of course, what percent of the time does the browser actually come back?  In my anecdotal experience it has either hung entirely, or it will respond within 30 seconds (even if you are doing a very processor intensive operation on the site, 30 seconds is a pretty long time).
In current Firefox 7 clearing the history can easily lock the browser for more than 30 seconds, yet it is entirely understandable that a user would wait for this process to complete.

It would be most annoying for the browser to kill itself during this operation.  How many more cases are there like this -- many of which may depend on large profiles, etc.?
30 seconds is also not a long time if the system is swapping.
Note that the final version of this patch will probably not kill the browser if the slowdown is system-wide, because the hang watchdog thread will probably not be scheduled twice in that window without at least one new browser event completing. But I agree that we need to be cautious about watching for false positives.
Attachment #565336 - Flags: review?(ted.mielczarek) → review+
r?cjones for the threading/monitor usage and general sanity check, r?bent to double-check that our chromium event loop integration won't trigger false positives
Attachment #566590 - Flags: review?(jones.chris.g)
Attachment #566590 - Flags: review?(bent.mozilla)
Comment on attachment 566590 [details] [diff] [review]
Patch, unsets the hang monitor when we're blocked waiting for an event, rev. 2

Review of attachment 566590 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/threads/HangMonitor.cpp
@@ +87,5 @@
> +  }
> +#endif
> +
> +  CrashReporter::AnnotateCrashReport(NS_LITERAL_CSTRING("HangMonitor"),
> +                                     NS_LITERAL_CSTRING("1"));

I probably would have just gone with "Hang=1". Not sure how confusing this will be on the Socorro side.
I was going to do that, but don't we already use Hang=1 to help identify plugin hang pairs? I was trying to use something different just to avoid complications.
Bug 694345 is filed on socorro signature changes to make these reports useful.
Blocks: 694345
Comment on attachment 566590 [details] [diff] [review]
Patch, unsets the hang monitor when we're blocked waiting for an event, rev. 2

>diff --git a/xpcom/build/nsXPComInit.cpp b/xpcom/build/nsXPComInit.cpp

>     mozilla::Omnijar::CleanUp();
> 
>+    HangMonitor::Stop();
>+

Drop the "mozilla::" for Omnijar plz.

>diff --git a/xpcom/threads/HangMonitor.cpp b/xpcom/threads/HangMonitor.cpp

>+Monitor* gMonitor;
>+PRInt32 gTimeout;
>+PRThread* gThread;
>+bool gShutdown;

Comments for these plz.

>+PRIntervalTime gTimestamp;

There needs to be an explanation here of why monotonicity isn't needed
here, why you're not using TimeStamp.  I don't believe it is, except
for the note below.  I guess you're trying to avoid the TimeStamp
locking on windows.

>+
>+void
>+Crash()
>+{
>+  if (gDebugDisableHangMonitor)
>+    return;
>+
>+#ifdef XP_WIN
>+  if (::IsDebuggerPresent()) {
>+    return;
>+  }
>+#endif

Nice touch :).

>+void
>+ThreadMain(void*)
>+{

>+    if (gTimestamp != PR_INTERVAL_NO_WAIT &&

gTimestamp can overflow to PR_INTERVAL_NO_WAIT, which would cause
false negatives here.  I don't think we care.

>+      ++waitCount;
>+      if (waitCount == 2) {

if (++waitCount == 2) {

>+        PRInt32 delay =
>+          PRInt32(PR_IntervalToSeconds(PR_IntervalNow() - gTimestamp));
>+        if (delay > gTimeout) {

False negatives here too.  Don't really care.

Please note that the reason we have to go through this relatively
complicated dance with timestamps (and the additional cost of syscalls
to get them), instead of a simpler protocol that used boolean flags,
is that Monitors are technically subject to spurious wake ups.  (It
would never happen in practice.)

>diff --git a/xpcom/threads/HangMonitor.h b/xpcom/threads/HangMonitor.h
>new file mode 100644
>--- /dev/null
>+++ b/xpcom/threads/HangMonitor.h
>+/**
>+ * Start monitoring hangs. Should be called by the XPCOM startup process only.
>+ */
>+void Start();
>+
>+/**
>+ * Stop monitoring hangs and join the thread.
>+ */
>+void Stop();
>+

This can't be turned on/off/on (per assertions), so please call these
Startup()/Shutdown() for clarity.

>+/**
>+ * Notify the hang monitor of new activity which should reset its internal
>+ * timer.
>+ */
>+void Notify();

Call this |NotifyActivity()| please.

>diff --git a/xpcom/threads/nsThread.h b/xpcom/threads/nsThread.h

>+  nsThread(bool aMainThread, PRUint32 aStackSize);

Please use something like |enum { MAIN_THREAD, NOT_MAIN_THREAD }|
instead of a bool.

These are all basically nits, r=me.
Attachment #566590 - Flags: review?(jones.chris.g) → review+
FTR, the simpler protocol is

bool armed, activity, deathRow;

main():
  while 1:
    if !armed:
      wait()
    else if !activity && deathRow:
      crash()
    else:
      deathRow = !activity
      activity = false
      wait(timeout/2)

notifyActivity():
  armed = activity = true

notifyIdle():
  armed = false

which we /could/ use if this one ends up moving talos, which I very seriously doubt.
Comment on attachment 566590 [details] [diff] [review]
Patch, unsets the hang monitor when we're blocked waiting for an event, rev. 2

Review of attachment 566590 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/libpref/src/init/all.js
@@ +1462,5 @@
>  pref("dom.max_chrome_script_run_time", 20);
>  pref("dom.max_script_run_time", 10);
>  
>  #ifndef DEBUG
> +pref("hangmonitor.timeout", 30);

Add a comment that this is in seconds? Or else maybe name 'timeoutSecs' like we do for dom.ipc.plugins.timeoutSecs.

::: xpcom/threads/HangMonitor.cpp
@@ +58,5 @@
> +Monitor* gMonitor;
> +PRInt32 gTimeout;
> +PRThread* gThread;
> +bool gShutdown;
> +PRIntervalTime gTimestamp;

Shouldn't you initialize these? I'm pretty sure that unless they're marked 'static' they're not guaranteed to be auto-zeroed on all platforms.

Also, since you're changing gTimestamp on multiple threads without a lock I think you should make it volatile too.

@@ +62,5 @@
> +PRIntervalTime gTimestamp;
> +
> +// PrefChangedFunc
> +int
> +PrefChanged(const char*, void*)

Hm, this will change gTimeout, but doesn't seem to reset the wait count... I guess if you're unlucky that means that you could crash sooner than you wanted. Maybe add a 'lastTimeout' check to the thread loop as well?

@@ +78,5 @@
> +void
> +Crash()
> +{
> +  if (gDebugDisableHangMonitor)
> +    return;

Nit: You're not bracing here, but you do just below... I vote for braces everywhere, but consistency is more important.

@@ +106,5 @@
> +
> +  while (true) {
> +    if (gShutdown) {
> +      return; // Exit the thread
> +    }

Nit: Brace inconsistency again.

@@ +115,5 @@
> +      ++waitCount;
> +      if (waitCount == 2) {
> +        PRInt32 delay =
> +          PRInt32(PR_IntervalToSeconds(PR_IntervalNow() - gTimestamp));
> +        if (delay > gTimeout) {

Hm... PR_IntervalNow() wraps around pretty easily since it's a 32bit number. If that happens then it looks like this logic will make delay very large (PR_IntervalToSeconds expects an unsigned int) and we'll crash before we want to. TimeStamp was introduced to solve this problem, could we use that instead?

@@ +116,5 @@
> +      if (waitCount == 2) {
> +        PRInt32 delay =
> +          PRInt32(PR_IntervalToSeconds(PR_IntervalNow() - gTimestamp));
> +        if (delay > gTimeout) {
> +          Crash();

Do you want to call Crash (or, really, AnnotateCrashReport) while holding this lock? Seems safer to add a MonitorAutoUnlock around it. Or, since you have the lock as a global, you could only do the unlock in Crash right before calling AnnotateCrashReport.

@@ +137,5 @@
> +  }
> +}
> +
> +void
> +Start()

Think it's worth asserting that we're on the main thread here? And in Stop, Notify, NotiftyIdle too?

@@ +148,5 @@
> +
> +  NS_ASSERTION(!gMonitor, "Hang monitor already initialized");
> +  gMonitor = new Monitor("HangMonitor");
> +
> +  Preferences::RegisterCallback(PrefChanged, kHangMonitorPrefName, NULL);

I don't understand how this works... This normally won't actually call the PrefChanged function until someone actually makes a change (usually callers manually call the callback right after they register). Has this changed or something?

@@ +159,5 @@
> +
> +  gThread = PR_CreateThread(PR_USER_THREAD,
> +                            ThreadMain,
> +                            NULL, PR_PRIORITY_LOW, PR_GLOBAL_THREAD,
> +                            PR_JOINABLE_THREAD, 0);

It's possible that this could fail, right? I'd at least assert here because you'll probably crash on PR_JoinThread if this stays null.

@@ +195,5 @@
> +
> +void
> +NotifyIdle()
> +{
> +  gTimestamp = PR_INTERVAL_NO_WAIT;

Maybe add a comment here saying that this also causes the wait count to be reset? It took me a second to figure that out.

::: xpcom/threads/HangMonitor.h
@@ +53,5 @@
> +/**
> + * Notify the hang monitor of new activity which should reset its internal
> + * timer.
> + */
> +void Notify();

Nit: The more I read this the more I think that 'Notify' should be renamed to something like 'EnableMonitoring' and 'NotifyIdle' should be renamed to 'DisableMonitoring' (maybe '[Begin|End]Monitoring'?). 'Notify' makes me think of waking a thread up or posting another event. Not a big deal, just a thought.

::: xpcom/threads/nsThread.cpp
@@ +574,5 @@
>  
>    NS_ENSURE_STATE(PR_GetCurrentThread() == mThread);
>  
> +  if (mIsMainThread && mayWait && !ShuttingDown())
> +    HangMonitor::NotifyIdle();

I wonder if you shouldn't move this closer to the GetEvent() call... Just below this line we'll call into sGlobalObserver and mObserver, both of which could theoretically hang, right?
Comment on attachment 566590 [details] [diff] [review]
Patch, unsets the hang monitor when we're blocked waiting for an event, rev. 2

Also... Shouldn't we disable this hang detector whenever we call into the plugin process? Otherwise the plugin hang detector is kind of useless if it has the same or longer timeout values. This patch sets hang detector to 30 sec and leaves the plugin hang detector to 45 seconds which seems wrong.
Additionally, the fact that we upped the plugin-timeout (from iirc 30) to 45 seconds shows that people are annoyed enough when their *plugin* crashes after being hung for 30 seconds. I think we have every reason to believe that they'll be even more annoyed if their browser crashes after being hung for 30 seconds.

Can we *please* get some data here before we start intentionally killing people's browsers.
> for the note below.  I guess you're trying to avoid the TimeStamp
> locking on windows.

yes

> gTimestamp can overflow to PR_INTERVAL_NO_WAIT, which would cause
> false negatives here.  I don't think we care.

gTimestamp can't overflow because we never modify it with interger math and NSPR promises to never return PR_INTERVAL_NO_WAIT from PR_IntervalNow.

> 
> >+      ++waitCount;
> >+      if (waitCount == 2) {
> 
> if (++waitCount == 2) {

I not made this change because I find it less readable.

> 
> >+        PRInt32 delay =
> >+          PRInt32(PR_IntervalToSeconds(PR_IntervalNow() - gTimestamp));
> >+        if (delay > gTimeout) {
> 
> False negatives here too.  Don't really care.

Bent pointed out a false-positive case with overflow here, which I believe I have fixed.

> to get them), instead of a simpler protocol that used boolean flags,
> is that Monitors are technically subject to spurious wake ups.  (It
> would never happen in practice.)

It also means you'd have to wake up the monitor thread a lot more, unless I misunderstand something. Anyway this way is really cheap and works well.


> >+/**
> >+ * Notify the hang monitor of new activity which should reset its internal
> >+ * timer.
> >+ */
> >+void Notify();
> 
> Call this |NotifyActivity()| please.

Between bent's comment and yours I have now named them NotifyActivity() and Suspend().

> Shouldn't you initialize these? I'm pretty sure that unless they're marked
> 'static' they're not guaranteed to be auto-zeroed on all platforms.

No, static data is guaranteed to be 0-initialized.

> Also, since you're changing gTimestamp on multiple threads without a lock I
> think you should make it volatile too.

Technically I don't think this was necessary because there's a write barrier at the beginning of the loop which will force the compiler to reread the variable. But I have made it volatile and read it once in the loop to avoid constantly rereading the volatile value.

> > +// PrefChangedFunc
> > +int
> > +PrefChanged(const char*, void*)
> 
> Hm, this will change gTimeout, but doesn't seem to reset the wait count... I
> guess if you're unlucky that means that you could crash sooner than you
> wanted. Maybe add a 'lastTimeout' check to the thread loop as well?

I don't see how I could possibly crash sooner than I wanted. It checks the timestamp against now every time around the loop.


> @@ +115,5 @@
> > +      ++waitCount;
> > +      if (waitCount == 2) {
> > +        PRInt32 delay =
> > +          PRInt32(PR_IntervalToSeconds(PR_IntervalNow() - gTimestamp));
> > +        if (delay > gTimeout) {
> 
> Hm... PR_IntervalNow() wraps around pretty easily since it's a 32bit number.
> If that happens then it looks like this logic will make delay very large
> (PR_IntervalToSeconds expects an unsigned int) and we'll crash before we
> want to. TimeStamp was introduced to solve this problem, could we use that
> instead?

No, because you'd have to have locks on gTimestamp every time you set it, but I've added an overflow check which should solve the problem.
 
> > +Start()
> 
> Think it's worth asserting that we're on the main thread here? And in Stop,
> Notify, NotiftyIdle too?

I have added these checks to NotifyActivity and Suspend, but not to startup/shutdown since they are called only by the component manager and the shutdown case happens after we shut down the main thread, which would cause the assertion to trigger oddly.

> ::: xpcom/threads/nsThread.cpp
> @@ +574,5 @@
> >  
> >    NS_ENSURE_STATE(PR_GetCurrentThread() == mThread);
> >  
> > +  if (mIsMainThread && mayWait && !ShuttingDown())
> > +    HangMonitor::NotifyIdle();
> 
> I wonder if you shouldn't move this closer to the GetEvent() call... Just
> below this line we'll call into sGlobalObserver and mObserver, both of which
> could theoretically hang, right?

Actually this is why I wanted you specifically to look at this.

mObserver *will* intentionall block if we have no XPCOM events to process and we're waiting for OS native events also (ends up here on Windows: http://hg.mozilla.org/mozilla-central/annotate/656c921623a8/widget/src/windows/nsAppShell.cpp#l350

We don't want to trigger the hang monitor if the app is just sitting there waiting for input. This patch may also need to call HangMonitor::NotifyActivity here: http://hg.mozilla.org/mozilla-central/diff/9b2a99adc05e/widget/src/windows/nsAppShell.cpp in order to catch all hangs which may occur while processing native events. But I want to make sure that this will all work correctly with the chromium event loop integration which I only kinda understand.

As for Jonas' concern: this patch is the only good way we have to get data. We want to land this on nightly and aurora and be prepared to disable it via pref if the value is wrong or there are false positives.

We should *not* disable this for plugin hangs, since incorrect plugin hangs caused by Windows are one of the primary issues we're trying to diagnose. We should probably make the plugin timeout 10 seconds shorter than whatever value we finally end up using for the main-process timeout.
I don't see a reason we couldn't use simple telemetry to catch the case when this patch would have aborted firefox, but where we later start processing events again, i.e. where we unfreeze.

I understand that measuring the times where we freeze and never unfreeze before exiting since logging telemetry needs to happen on the main thread (right?).

But logging the times when we unfreeze, and thus this patch would have caused a "new" crash for the user, seems imminently measurable.
Comment on attachment 568496 [details] [diff] [review]
Patch, unsets the hang monitor when we're blocked waiting for an event, rev. 3

Review of attachment 568496 [details] [diff] [review]:
-----------------------------------------------------------------

Ok, this looks good. I think we're going to be fine with the frankenloop.

::: xpcom/threads/HangMonitor.cpp
@@ +126,5 @@
> +    PRIntervalTime timestamp = gTimestamp;
> +
> +    PRIntervalTime now = PR_IntervalNow();
> +
> +    fprintf(stderr, 

Remove before checkin of course.

@@ +163,5 @@
> +    if (gTimeout <= 0) {
> +      timeout = PR_INTERVAL_NO_TIMEOUT;
> +    }
> +    else {
> +      timeout = PR_SecondsToInterval(gTimeout / 2);

I think this isn't quite right. Say my timeout is 31 seconds... This will require three cycles of 15 seconds before it actually crashes. I think instead you could just do PR_MillisecondsToInterval(gTimeout * 500) and that would get you closer.
Attachment #568496 - Flags: review?(bent.mozilla) → review+
BTW, how does this play with what we already have in nightly only from bug 677711?
bsmedberg landed this as:
 https://hg.mozilla.org/integration/mozilla-inbound/rev/52b481205766
 https://hg.mozilla.org/integration/mozilla-inbound/rev/564e841f1f57
but I backed it out as:
 https://hg.mozilla.org/integration/mozilla-inbound/rev/35cfc34bdc44
 https://hg.mozilla.org/integration/mozilla-inbound/rev/7c5f9e6c34c2
because it triggered Linux64 opt orange, with an abort related to (added by?) this bug:
> ABORT: HangMonitor triggered: file ../../../xpcom/threads/HangMonitor.cpp, line 106
https://tbpl.mozilla.org/php/getParsedLog.php?id=7006186&tree=Mozilla-Inbound

It also had an orange run of  Linux "Bm", likely for the same reason, but that log's not available on tbpl yet, so I can't confirm that.  Bq builds were also red, but that may (?) be of less concern since that's not tier-1.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #73)
> BTW, how does this play with what we already have in nightly only from bug
> 677711?

The child has a default 15 second timeout, so it should kick in before this does. If we are still hung 15 seconds later, this code would then kick in.
(In reply to Jim Mathies [:jimm] from comment #75)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #73)
> > BTW, how does this play with what we already have in nightly only from bug
> > 677711?
> 
> The child has a default 15 second timeout, so it should kick in before this
> does. If we are still hung 15 seconds later, this code would then kick in.

OK, good. Just wanted to make sure the two patches are not tripping on each other's toes.
bsmedberg, what's the next step with this?
https://tbpl.mozilla.org/?tree=Try&rev=a3ccd9a859ae is the current set of failures; I disabled the hang detector in the startupcache test because it uses a configuration which hangs on purpose waiting for a timer. I need to patch the --disable-crashreporter configuration (easy). The two linux "B" oranges are random. So the remaining issue is the hang detector triggering during "T(di)": there appears to be a modal dialog being run, which should disable the hang detector, but I haven't fully diagnosed this yet. I'm setting up local talos to try this.
There were two "di" hang stacks, which are different:

https://tbpl.mozilla.org/php/getParsedLog.php?id=7070308&full=1&branch=try the main thread is shutting down and mozilla::Storage::Statement::Finalize is on the main thread stack. The storage thread is off in sqlite-land. This looks like a "valid" hang detector trigger in that shutdown is hanging for 30 seconds.

https://tbpl.mozilla.org/php/getParsedLog.php?id=7070351&full=1&branch=try the main thread appears to be showing a modal dialog, but we're not sure what dialog this might be.

I cannot reproduce either of these failure modes locally (or even get close to 30 seconds runtime) locally running ts_places_generated_max, although I have much faster disk and CPU than the talos machines.

We could just disable the hang detector for this particular Talos test, although there are definitely problems to be fixed here somewhere...
Blocks: 504770
(In reply to Jonas Sicking (:sicking) from comment #70)
> I don't see a reason we couldn't use simple telemetry to catch the case when
> this patch would have aborted firefox, but where we later start processing
> events again, i.e. where we unfreeze.

Like in bug 668653? We should do that anyway, so we get basic data for shorter hangs and for users on branches without the hang killer.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #79)
> There were two "di" hang stacks, which are different:
> 
> https://tbpl.mozilla.org/php/getParsedLog.php?id=7070308&full=1&branch=try
> the main thread is shutting down and mozilla::Storage::Statement::Finalize
> is on the main thread stack. The storage thread is off in sqlite-land. This
> looks like a "valid" hang detector trigger in that shutdown is hanging for
> 30 seconds.

Do you happen to know which changeset was this run against? since a bunch of stuff changed recently in the shutdown path, the above stack looks coming from the old code, may you tell me which patches should I push to Try to check what happens with the new code?
https://tbpl.mozilla.org/?tree=Try&rev=e8384dd986f4 appears to have passed completely. I'll try to get this into inbound today and hope it sticks.
Backed out because of "extensions/regress-336409-2.js | Exited with code 1 during test run" in 32-bit Linux jsreftests:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a61d75d70c13
https://tbpl.mozilla.org/php/getParsedLog.php?id=7216280&tree=Mozilla-Inbound
Comment on attachment 572905 [details] [diff] [review]
Disable the hang monitor during test suites which also disable the DOM script timeout, rev. 1

Review of attachment 572905 [details] [diff] [review]:
-----------------------------------------------------------------

2 questions:
1) what about talos?
2) is this preference supported on mobile (xul and native)?
Attachment #572905 - Flags: review?(jmaher) → review+
If talos is hanging for 30+ seconds, something is wrong... it's only artificial tests which should be doing that.

This preference is a core pref so yes it should be general. Note that this hang detector will not catch Java frontend hangs.

But the dromaeo failure from comment 86 indicates that there is probably a false-positive condition in the code. I have a theory as to the scenario:

main event loop enters "native" event processing and ends up in cocoa here:

http://hg.mozilla.org/mozilla-central/annotate/4fb61ebbf8ff/widget/src/cocoa/nsAppShell.mm#l678
At this point the hang detector is suspended by the main event loop (this is a known case of false negatives).

Inside this native event somebody spins a nested/modal event loop. This causes the hang detector to be enabled after an XPCOM message is processed.

When exit back to this code, there are no XPCOM message pending, only native messages, and we don't suspend the hang detector while waiting on native messages.

This theory is kinda weird, but it basically means that in all the widget implementations, not just Windows, we need to suspend the hang detector before entering the "wait for next event" call. This doesn't look to hard for Cocoa widgets which I'll do shortly, but it will be more complicated for GTK, in which we probably need to break gtk_main_context_iteration into its parts as described at http://www.gtk.org/api/2.6/glib/glib-The-Main-Event-Loop.html#id2915982
Comment on attachment 573903 [details] [diff] [review]
Add more correct hang monitoring for Cocoa widgets, rev. 1

With one small quibble, this looks fine to me.

Which isn't to say that I'm sure this will do what you want.  Just
that (my quibble aside) I can't see anything wrong with it.

Instead of this:

>         if (gCocoaAppModalWindowList)
>           currentAppModalSession = gCocoaAppModalWindowList->CurrentSession();
>+
>+        mozilla::HangMonitor::NotifyActivity();
>+
>         if (currentAppModalSession) {
>           [NSApp _modalSession:currentAppModalSession sendEvent:nextEvent];
>         } else {
>           [NSApp sendEvent:nextEvent];
>         }

It'd be more correct to do this:

>         if (gCocoaAppModalWindowList)
>           currentAppModalSession = gCocoaAppModalWindowList->CurrentSession();
>         if (currentAppModalSession) {
>+
>+          mozilla::HangMonitor::NotifyActivity();
>+
>           [NSApp _modalSession:currentAppModalSession sendEvent:nextEvent];
>         } else {
>           [NSApp sendEvent:nextEvent];
>         }

"[NSApp sendEvent:nextEvent]" will call "-[GeckoNSApplication
sendEvent:]".  But "[NSApp _modalSession:currentAppModalSession
sendEvent:nextEvent]" won't.
Attachment #573903 - Flags: review?(smichaud) → review+
(Following up comment #91)

One possible source of trouble is that calls to nsAppShell::ProcessNextNativeEvent() with aMayWait == TRUE appear to be very rare.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #89)
> This theory is kinda weird, but it basically means that in all the widget
> implementations, not just Windows, we need to suspend the hang detector
> before entering the "wait for next event" call. This doesn't look to hard
> for Cocoa widgets which I'll do shortly, but it will be more complicated for
> GTK, in which we probably need to break gtk_main_context_iteration into its
> parts as described at
> http://www.gtk.org/api/2.6/glib/glib-The-Main-Event-Loop.html#id2915982

bsmedberg asked me to look at this for the GTK port.

Splitting the main event loop would be quite fiddly, especially to make it
work with gtk_dialog_run(), but the approach in this patch is simple.  I expect the Suspend and NotifyActivity in nsThread would no longer be necessary but I'm not sure what the other platforms require.

I haven't verified this does what it should yet.  I was seeing waitCount reach
49800 in mozilla::HangMonitor::ThreadMain, which seems odd.

I can investigate further tomorrow, but I'm posting this as a WIP at least.
Attachment #574232 - Flags: feedback?
Attachment #574232 - Attachment is patch: true
Attachment #574232 - Flags: feedback? → feedback?(benjamin)
Comment on attachment 574232 [details] [diff] [review]
disable hang monitor during GLib main loop poll

Yeah, that looks perfect. I don't think there's any reason to remove the calls in the outer XPCOM event loop code; extra calls don't hurt.
Attachment #574232 - Flags: feedback?(benjamin) → feedback+
smichaud, I'd like to keep the current version because the extra call to NotifyActivity is not harmful at all, and this version makes it clear that we want to cover both sides of the conditioanl.
Attachment #574232 - Flags: review?(roc)
Status update: the latest try run shows two failures, both potentially "real":

https://tbpl.mozilla.org/?tree=Try&rev=0bf27523532d

https://tbpl.mozilla.org/php/getParsedLog.php?id=7388052&tree=Try&full=1 is a Windows Talos dirty ts_places_generated_max, the hang detector is triggering during a places call into storage/sqlite. I have decided to wallpaper over this one by disabling the hang monitor for this particular Talos test, see bug 702368.

https://tbpl.mozilla.org/php/getParsedLog.php?id=7389972&tree=Try&full=1 is a Mac dromaeo Talos. While running dromaeo/dom-attr.html the hang detector fires during cycle collection. I don't yet understand this one yet, but my current theory is that some combination of a dromaeo test and a cycle collection is taking longer than 30 seconds. Normally this would trigger the DOM script timeout, but that appears to be disabled for Talos run, which means that the hang monitor kicks in. I suppose I could just disable the hang monitor for all Talos... bz being the dromaeo expert, does my hunch sound reasonable?
Let's file bugs for both tests. Neither places nor CC are allowed to take 30s ever.
I don't think that the dromaeo thing is really a bug: content script including dromaeo is out of our direct control, but it normally doesn't trigger the hang detector because the DOM slow-script dialog is triggered first. I believe that we explicitly disable the DOM slow script dialog during Talos tests.

I will file a bug on the places shutdown hang.
Attachment #575230 - Flags: review?(jmaher)
Comment on attachment 575230 [details] [diff] [review]
Disable hangmonitor for Talos in general, rev. 1

Review of attachment 575230 [details] [diff] [review]:
-----------------------------------------------------------------

looks great!  Sorry there were so many .config files to modify
Attachment #575230 - Flags: review?(jmaher) → review+
Comment on attachment 575230 [details] [diff] [review]
Disable hangmonitor for Talos in general, rev. 1

http://hg.mozilla.org/build/talos/rev/5dfaf26ff78a
Attachment #575230 - Flags: checkin+
Depends on: 702351
Whiteboard: [Snappy]
Whiteboard: [Snappy] → [Snappy:p1]
This may have caused a 3% Tp5 MozAfterPaint (Private Bytes) regression on Linux and Linux64, at least accordingly to the graphs.
I suspect that this is because of the additional thread. What is the default stack size of threads? I don't think that this regression by itself should cause a backout, although I will ask about reducing the stack size.
(In reply to Ed Morley [:edmorley] from comment #105)
> https://hg.mozilla.org/mozilla-central/rev/f6e4a9643786

Having a variable called mIsMainThread that's not a boolean (instead, it's an enum) seems pretty confusing.  It also seems likely to lead to errors since the enum values used are exactly the opposite of the boolean values you'd expect (MAIN_THREAD is 0, NOT_MAIN_THREAD is 1).
Depends on: 705154
I believe this is causing bug 705154
Blocks: 705154, 705240
No longer depends on: 705154
No longer blocks: 705154, 705240
Depends on: 705154, 705240
Depends on: 705365
Depends on: 705258
Depends on: 705287
It has significantly increased the number of crashes:
11.0a1/20111123:  714 crashes for the top 100 crashers
11.0a1/20111124: 2437 crashes for the top 100 crashers
With this landed, now I have to endure Firefox crashing now-and-then when clearing recent history, because bug 648232 is not fixed.
Depends on: 648232
Given that this is appears to be causing an uptick in crashes that don't appear to be related to hangs, shouldn't this be backed out until further testing is done?
This appears to be causing a large number of crashes on startup:

https://crash-stats.mozilla.com/report/list?signature=chromehang%20|%20NtGdiGetFontData
(In reply to John Daggett (:jtd) from comment #111)
> This appears to be causing a large number of crashes on startup:
> 
> https://crash-stats.mozilla.com/report/
> list?signature=chromehang%20|%20NtGdiGetFontData

Does this mean that those users were hanging for a long period on startup doing font lookups?
Looks like NtGdiGetFontData can take forever. Thats definitely something we shouldn't be doing on the main thread, and even less so during startup. Lets add some simple mechanism to temporarily disable the watchdog, and in parallel, can we move this call off the main thread (and re-enable when thats done).
(with temporarily disable I obviously meant disable for the duration of that syscall)
(In reply to Andreas Gal :gal from comment #113)
> Looks like NtGdiGetFontData can take forever. Thats definitely something we
> shouldn't be doing on the main thread, and even less so during startup. Lets
> add some simple mechanism to temporarily disable the watchdog, and in
> parallel, can we move this call off the main thread (and re-enable when
> thats done).

Getting data from fonts should never take 30+ seconds, something is seriously screwed if such is the case.  And the comments associated with the crash reports don't seem to indicate hangs, they seem to indicate random crashes.  Are we absolutely sure the timing code is correct here?
Comments from the crash signature in comment 111:

"Firefox Nightly is very unstable. Please fix."
"Crashed on, or shortly after a restart following nightly update. It was unattended at the time, doing nothing that was user-initiated."
"My Firefox crash with Kaspersky Internet Security"
"it just crashed when i was on xbox marketplace"
"I think it was Flash problem, but I closed plugin-container.exe and firefox wouldn't unfreeze. Other times, when faced to the same... ''issue'', closing plugincontainer through the ctrl+alt+supr would unfreeze firefox and show the 'the plugin has crashed' message"
"Happened immediately after previous crash upon reopening. Seems likely that it is an Adobe Flash Player related problem as the pages that didn't load before it crashed all had videos of some description which used the Player."
"Fucking slowed my computer down like shit 8gb ram, i7 quad and nightly dies? the fuck"
"Just started Firefox, clicked on a youtube-video and it crashed."
Win32 nightly crashes, build = 20111123
https://crash-stats.mozilla.com/query/query?product=Firefox&version=Firefox%3A11.0a1&build_id=20111123031034&do_query=1

Counts for top ten crashes:

58 49 49 44 34 31 29 25 25 22

Win32 nightly crashes, build = 20111124
https://crash-stats.mozilla.com/query/query?product=Firefox&version=Firefox%3A11.0a1&build_id=20111124031031&do_query=1

Counts for top ten crashes (all chromehang!!!):

427 312 239 138 117 100 95 74 52 48

I don't think we should be using our nightly users as guinea pigs like this.  I think we should doing a lot more conservative testing before rolling this out.
(In reply to John Daggett (:jtd) from comment #117)
> I don't think we should be using our nightly users as guinea pigs like this.
> I think we should doing a lot more conservative testing before rolling this
> out.

I agree. I see no comments addressing John's concerns. We should back this out for the next nightly build.
This comment is not meant to oppose a backout, I am merely commenting on the crash stacks.

1) The font enumeration during startup is a bug. Please file a bug and start work on it. That is known to take forever, and should not happen on the main thread.

2) I see three clusters of crashes:

a) Allocation (=> machine is probably swapping/paging itself to death?)
b) Disk I/O (=> waiting for slow IO?)
c) Main thread waiting on another thread (=> wtf???)

These seem all valid problems to me. We can discuss whether we should disable the crash/hang monitoring until we fixed all these stacks, but from what I can see these are all valid crash sites in the sense that they point to bad code.
Depends on: 705577
I agree that were getting a lot of valuable data, a hang detector is something we ^really^ want, but currently the user impact in Comment 116, Comment 117 indicates that were getting this data at the cost of our Nightly users. We should address the problem we've found first then re-land this.

Actually perhaps we should move this a preference and request power users to flip the flag. Then once we've stabilized this feature, and our hangs, we could re-enable this by default.
http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/HangMonitor.cpp#141

>   if (timestamp != PR_INTERVAL_NO_WAIT &&
>       timestamp == lastTimestamp &&
>       gTimeout > 0) {
>     ++waitCount;
>     if (waitCount == 2) {
>       PRInt32 delay =
>         PRInt32(PR_IntervalToSeconds(now - timestamp));
>       if (delay > gTimeout) {
>         MonitorAutoUnlock unlock(*gMonitor);
>         Crash();
>       }
>     }
>   }

Are we *absolutely* sure here PR_IntervalNow() is monotonically
increasing across all cores on all hardware?  If timestamp happens to be
just one greater than now, a negative value results but because
PR_IntervalToSeconds divides by ticks-per-second, it will produce a
result that when converted to a signed integer will not be a negative
value but a large positive value and will trigger the crash.

At the very least this code should probably be rewritten as:

    if (waitCount == 2) {
      PRIntervalTime diff = now - timestamp;
      PRInt32 delay = PRInt32(diff);
      if (delay > 0 && PR_IntervalToSeconds(diff) > gTimeout) {
        MonitorAutoUnlock unlock(*gMonitor);
        Crash();
      }
    }

This code doesn't need to assume PR_IntervalNow() is monotonically
increasing under all conditions.
robarnold might know more about whether PR_IntervalNow() is monotonic across cores. Rescheduling to a different core is a distinct possibility at the crash sites we have seen, so this might be a reasonable explanation.
Perhaps we should back out the change, create a separate meta bug tracking the investigation of new crashes introduced on 11/23 and absent as of the backout, and perform the investigations there.

As Benoit suggests, our goal should be to turn this into a pref in the short term and then once again turn this on by default once we feel the hang issues are understood.
(In reply to John Daggett (:jtd) from comment #121)
> Are we *absolutely* sure here PR_IntervalNow() is monotonically
> increasing across all cores on all hardware?

Absolutely not, I filed a bug about autoscroll going backwards whenever clock goes backwards months ago, but unfortunately nobody wants to go there and comment upon it. Bug 652769.
Sorry for double post but I should have added that the interesting part is analysis in bug 652769 comment 5.
tl;dr version is that there's a low precision timer (which keeps actual current realtime) and a high precision timer (which keeps precise but arbitrary-valued relative time).
High precision timer is calibrated using the realtime timer on startup, and is being used as return value. However, whenever low precision timer doesn't agree with the high precision timer (to within its poor precision), high precision timer gets re-calibrated again.
On my old PC, for some reason, this happens randomly and quite a lot, up to several times per second. There seems to be random drift that eventually gets too large. This leads to the timer being non-monotonic.
Perhaps we should be using mozilla/TimeStamp.h instead that guarantees to be monotonic if possible.
In IRC robarnold mentioned that Timestamp doesn't have a Windows impl just yet, he's reviewing one now.
(In reply to Radek 'sysKin' Czyz from comment #124)
> (In reply to John Daggett (:jtd) from comment #121)
> > Are we *absolutely* sure here PR_IntervalNow() is monotonically
> > increasing across all cores on all hardware?
> 
> Absolutely not, I filed a bug about autoscroll going backwards whenever
> clock goes backwards months ago, but unfortunately nobody wants to go there
> and comment upon it. Bug 652769.

That's a different time source (there are so many to choose from in Gecko!). I just checked the implementation of PR_IntervalNow [1]. It uses Windows' multimedia timer function which I believe is monotonic across cores (QueryPerformanceCounter is the problematic one).

[1] http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/ntinrval.c#71
(In reply to John Daggett (:jtd) from comment #127)
> In IRC robarnold mentioned that Timestamp doesn't have a Windows impl just
> yet, he's reviewing one now.

Those curious can follow along in bug 676349. Monotonicity and high resolution are hard on Windows.
(In reply to John Daggett (:jtd) from comment #127)
> In IRC robarnold mentioned that Timestamp doesn't have a Windows impl just
> yet, he's reviewing one now.

TimeStamp does have a Windows implementation, it's just low resolution (10ms).
(In reply to Rob Arnold [:robarnold] from comment #128)
> That's a different time source (there are so many to choose from in Gecko!).

Ah, apologies for the noise then. I had a feeling I should check, but I'm at work so couldn't.
(In reply to Andreas Gal :gal from comment #119)
> 2) I see three clusters of crashes:
> 
> a) Allocation (=> machine is probably swapping/paging itself to death?)
> b) Disk I/O (=> waiting for slow IO?)
> c) Main thread waiting on another thread (=> wtf???)

Just for reference, I looked at a few of the less common issues.  I saw some in GCs, a few in CCs, one that looked like it was in nsDocument::Destroy.  Which all sound plausible.
Depends on: GCHangs
(In reply to Benoit Girard (:BenWa) from comment #126)
> Perhaps we should be using mozilla/TimeStamp.h instead that guarantees to be
> monotonic if possible.

Please see above, we can't use TimeStamp due to lock performance.
(In reply to ben turner [:bent] from comment #133)
> (In reply to Benoit Girard (:BenWa) from comment #126)
> > Perhaps we should be using mozilla/TimeStamp.h instead that guarantees to be
> > monotonic if possible.
> 
> Please see above, we can't use TimeStamp due to lock performance.

I don't think the existing Windows implementation needs locking. However, the new one will. I don't think this task needs high resolution on the timestamp so perhaps we could have a lock-free low-resolution version specifically for this.
Depends on: 705613
Depends on: 705616
Depends on: 705543
Depends on: 705517
(In reply to John Daggett (:jtd) from comment #115)
> And the comments associated with the
> crash reports don't seem to indicate hangs, they seem to indicate random
> crashes.

Note that many users don#t even know the difference between a hang and a crash, and given that we are killing the app and showing a crash report dialog, it surely looks like a crash to them.

I think we should turn this off completely while we analyze the good data we did get from the days it has been active now, and re-enable later when we found out solutions to them. Right now, this mechanism alone seems to have made Nightly more than three times as unstable overall than it was before, and a few days of that are enough burden to testers with slow computer who are probably affect way more than others with fast machines.
I believe that is a decision Asa should make. He has argued forcefully that it would be better for the user to crash than to see a 30s pause in our app. Perhaps the data here indicating that 30s pauses are common would cause him to reconsider, but I believe that is ultimately a product decision.

I am on PTO today, but if Asa decides that this should be turned off, please just set the pref to 0. There is no reason to back out the code.
To be clear about the current status, because some of the details are in other bugs: the hang detector is not working correctly on Mac, and has been disabled as of the Saturday nightlies, bug 705154. Also as of Saturday's nightly I changed the plugin hang timeout to be 25s, shorter than the chromehang timeout, so that plugin hangs only kill the plugin and not the whole browser (bug 705365). As far as I know, the hang detector appears to be working correctly on Windows and Linux.

The current arguments in this bug should be focused on Windows/Linux, and whether the volume of reports from Saturday and Sunday's builds means that users were successfully tolerating 30s pauses before this patch and are now crashing "undesirably".
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #137)
> As far as I know, the hang detector appears to be working correctly on
> Windows and Linux.

I think the issue is less about the hang detector working or not working and more about the fact that hangs are commonplace with Firefox; thus all the crashing.  If all the now dependent bugs can be timely fixed, this becomes a non-issue.

And yeah I agree that setting the pref to 0 is preferable to a needless back-out.
(In reply to IU from comment #138)
> If all the now dependent bugs can be timely fixed, this becomes a non-issue.
It will be hard as 55 dependent crash signatures in the top 100 crashers, representing 60% of crash volume, have no related bugs.
See https://crash-stats.mozilla.com/query/query?product=Firefox&version=Firefox%3A11.0a1&build_id=20111127031032&do_query=1
Depends on: 705695
Depends on: 705717
Depends on: 705730
Blocks: 705748
Depends on: 705757
Depends on: 705761
Depends on: 705771
Depends on: 705781
Depends on: 705211
Alias: hang-detector
I overheard someone talking on Air Mozilla today before the weekly meeting (Asa, I think?) that it might be possible for the hang detector to submit the hang information back to Mozilla without then triggering the crash reporter.

Could that be done? (Maybe hooked to the toolkit.telemetry.enabled preference?)
So the crashkill team discussed this at length this morning with Asa. Here's a summary...

There has been much speculation about the impact of browser hangs. We hear lots of anecdotal evidence that there are issues but it's not something we really have our heads wrapped around or even measure at all. Getting a better handle on this bubbled up to a priority - hence the reason we were working on a patch. We are intentionally crashing and knew that the crash rate would go up...maybe significantly.

We have logged a bug to pref this off for now since it's my understanding that Ben and others want to verify that it's working correctly - bug 705748. Despite the fact that we disabled it on Mac, I think it still gives us good data so we are going to analyze it and hopefully fix a few things.

The crashkill team is logging bugs for the top 20 issues so we can get the investigations going. We are only looking at data for builds after Nov 26th when we made changes to fix the Mac and the plugins issues that Ben talks about in Comment 137. If we find and fix issues, we would need to turn the patch back on to help validate this. It was also Asa's request to turn it back on maybe in a couple of weeks for a 48 hr period on a weekday since we get more users.

I know there have been discussions about logging a crash but not actually crashing so I will let others comment on - bug 705888. Ben was tasked with getting something up and running quickly so we could collect some data. If we can make this change reasonably easily, great. I will let Asa make the call on whether he would want to wait for this change before turning it back on again.

This patch (in it's current state) was not meant to be migrated to other channels. We simply didn't know what it would give us so we hadn't mapped out a full strategy. The goal here is to help us pin point where some of the problems are and get those addressed.
I think the idea of crashing the browser after a fixed amount of time
away from the main event thread, even just for nightly builds, is a terrible idea
for a couple simple reasons.

I think we should only be doing experiments like this with users who
have explicitly been informed about the hangkill thread and opted in,
both for our sake and theirs.  We should really be trying to get these
users more involved with helping us eliminate these problems.  Simply
crashing ends the communication, it just says "use another browser, this
one is unstable".  Developers looking at chromehang bugs are stuck
trying to reconstruct the scenario that led up to the hang with little
hope for more feedback.

Distinguishing between a system under extreme load and a browser
misbehaving is hard. Try running Firefox on an underpowered netbook with
a crapware virus scanner running, the startup time will be glacial
because the system itself is basically still starting up concurrently
and the virus scanner is exacerbating all i/o operations.  Crashing in extreme load
situations is not really helpful, it essentially creates lots of false
positives that are absolute murder to try to reproduce. Crashstacks are
essentially a single sample of what's going on, they tell us where we
are at the 30 sec mark but not much about what was going on before that
and nothing about whether the system was under i/o or cpu strain at the
time.

I think having a passive form of the hang monitor thread in releases
would be much more useful, one that could sample the main event thread
when hangs occurred and send the data to us if permitted by the user.
It seems like more lock-related things need to be added to whatever thread 0 hang skip list is used to generate more useful categories.  I don't know how that is done.  For instance, here's one stack from Thread 0 that seems like it is clearly hanging in the cycle collector:

0	ntdll.dll	KiFastSystemCallRet	
1	ntdll.dll	NtWaitForSingleObject	
2	kernel32.dll	WaitForSingleObjectEx	
3	kernel32.dll	WaitForSingleObject	
4	nspr4.dll	PR_WaitCondVar	nsprpub/pr/src/threads/combined/prucv.c:547
5	xul.dll	mozilla::CondVar::Wait	obj-firefox/dist/include/mozilla/CondVar.h:103
6	xul.dll	nsCycleCollectorRunner::Collect	xpcom/base/nsCycleCollector.cpp:3625

...but it shows up with the signature [@ chromehang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar ].  In other crashes, it is mozilla::ipc::GeckoChildProcessHost::SyncLaunch that is sitting under the lock muck in the main thread.  Though maybe that is also generic lock much and it is really mozilla::plugins::PluginProcessParent::Launch, which is sitting under that, that is of interest.
You can file bugs in Webtools : Socorro to get more things added to the SkipList, which changes signature generation:
https://wiki.mozilla.org/Breakpad/Skip_List_How_To
So the hang detector uses the same SkipList processing?  It just applies it to the stack on thread 0 instead of the crashing stack, or something along those lines?  Thanks, I'll look into that.
(In reply to Andrew McCreight [:mccr8] from comment #145)
> So the hang detector uses the same SkipList processing?

Yes, it crashes the process and that sends a normal crash report, it just gets "chromehang |" prefixed by default.
Depends on: 706441
re: comment 142 about catching lots of data from under powered netbooks/old systems.

we should do some analysis on the hang data that we have to determine the distribution of systems that are reporting the hangs.  kairo, or maybe I can whip something up on data from the .csv files over the next couple of weeks.

in the end these are still hangs, and reflective of a poor user experience.  if we find a disproportionate number of hangs on old systems, or new netbooks, then maybe we need to adjust system requirements or at least warn users they may not get the best experience on the system where they just installed firefox to help set the right expectations.
Depends on: 706535
Depends on: 706727
Depends on: 706096
Blocks: 706916
Depends on: 706931
(In reply to chris hofmann from comment #147)
> in the end these are still hangs, and reflective of a poor user experience. 
> if we find a disproportionate number of hangs on old systems, or new
> netbooks, then maybe we need to adjust system requirements or at least warn
> users they may not get the best experience on the system where they just
> installed firefox to help set the right expectations.

My point is simply that a lot of low-powered environments *suck* at cold startup.  You basically can't touch the machine for a good 60 seconds after startup.  If multiple apps start up concurrently and we're part of that, we not "hanging" we just are one part of a larger suckful state.  Which is why I think it's important to understand the conditions at the time of a hang, to distinguish between "Firefox sucks and everything else is great" vs. "everything sucks when xyz happens". We want to give high priority to the former, no priority to the latter.
Depends on: 707436
Depends on: 705277
Depends on: 710705
Depends on: 711086
Depends on: 717382
Depends on: 717386
Depends on: 718750
Depends on: 723982
Depends on: 723999
Depends on: 724014
Depends on: 724027
Depends on: 724044
Depends on: 725291
Depends on: 726119
Depends on: 729643
Given that this seems pretty broad in scope, is there anything QA can do to verify this fix?
Whiteboard: [Snappy:p1] → [Snappy:p1][qa?]
This was preffed off, there is really nothing for QA to do at the moment.
Whiteboard: [Snappy:p1][qa?] → [Snappy:p1][qa-]
Depends on: 735257
Depends on: 732122
Depends on: 732405
Depends on: 734240
Depends on: 735247
Depends on: 735437
Depends on: 735440
Depends on: 735441
Depends on: 735446
Depends on: 748856
Depends on: 749224
Depends on: 753933
Depends on: 754922
Depends on: 738751
Depends on: 739594
Depends on: 743653
Depends on: 750672
Depends on: 753504
Depends on: 755436
Depends on: 757910
Depends on: 760081
Depends on: 762487
Depends on: 764371
Depends on: 764997
Depends on: 876425
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: