Closed Bug 1716946 Opened 3 years ago Closed 10 months ago

Investigate why hanging the main process' main thread doesn't eventually lead to a crash

Categories

(Toolkit :: Crash Reporting, defect, P2)

defect

Tracking

()

RESOLVED FIXED
116 Branch
Tracking Status
firefox116 --- fixed

People

(Reporter: gsvelto, Assigned: afranchuk)

References

Details

Attachments

(2 files)

As per title. Bug 1714779 is odd, we should have a hang detector which crashes Firefox after a while so that we get notified of the issue... but it appears not to be working, or maybe I misremember how it works? Either way it's very, very bad that a complete hang of the main process flies under our radar until it hits the release channel so making this a priority.

Mike, you know the background hang reporter well, do you think it could be used to detect and act on this kind of stuff? Like crashing Firefox if the main process' main thread doesn't respond for a specified amount of time.

Flags: needinfo?(mconley)

Hi gsvelto,

Sorry for the delay answering this. 1 month! My bad.

In theory, yes, BHR could do this, because it's already doing the work of noticing when a thread is hanging and then making decisions on what to do with that thread.

BHR is supposed to be somewhat passive, however - it observes and reports, but doesn't otherwise manipulate the hanging threads.

I don't think it would be too difficult to have some kind of main-thread hang threshold after which we intentionally crash the browser. If we wanted to go that route, I think we'd need to decide what that threshold is - the rest would be pretty straight-forward.

cc'ing dthayer because I know he has also worked on the BHR backend.

Flags: needinfo?(mconley)

We actually already do detect this on Nightly, and you can view the results here. (The &annotations=Unrecovered%3true bit in the url controls this)

I think surfacing this with the usual crash reporter system is a good idea, but BHR only runs on Nightly. Enabling it past that could be worth doing, but we haven't allocated resources to doing so and proving out that there's no perf cost for doing so. If you wanted to crash FF on a permahang, you'd probably want to do so here. However, that will happen after about 8 seconds of permahanging, which I think is fine for generating a report, but maybe not so fine for crashing the whole browser. So we may want to up that number if we want to start crashing.

Thanks to both! I think it would be worth crashing the main process eventually when perma-hung because we'd have a very visible signal that something's going wrong. My reasoning came from bug 1714779 - which was fixed in bug 1708224 - but not before going all the way to release. I agree that 8 seconds are too short of a time for that, 30 or even 60 might be enough though. We need to get some crashes to notice, but we don't need a lot and we don't want to needlessly crash Firefox just because the user has a slow machine.

Severity: -- → S2

Demoting this, it's bad but I don't have time to work on it.

Priority: P1 → P2
Blocks: 1766613
See Also: → 1793525
Duplicate of this bug: 1766613

This would have helped with bug 1823159. Some thoughts about this:

  • On Windows this should be relatively simple to implement. WER detect hangs and we can capture them directly (we already did in fact, but we disabled it in bug 1714334 because we don't know how to handle them).
  • Both Linux and macOS have OS facilities that detect hangs in graphical applications but I don't know how to leverage them (or if they can be leveraged at all).
  • Assuming we start accepting these crashes we need a way to flag them so that they're clearly labeled on Socorro
See Also: → 1823159

(In reply to Gabriele Svelto [:gsvelto] from comment #0)

As per title. Bug 1714779 is odd, we should have a hang detector which crashes Firefox after a while so that we get notified of the issue... but it appears not to be working, or maybe I misremember how it works? Either way it's very, very bad that a complete hang of the main process flies under our radar until it hits the release channel so making this a priority.

Although your above post is not related to bug 1817847 that I reported as soon as the first 111 beta hit my PC, I agree. And I thought it odd that my report gained no traction across the entire 111 beta, and only once 111 went RTM, more folks started reporting the freeze and it received attention. After all, isn't a beta meant to find bugs out in the wild that hadn't shown up on earlier channels? Now I understand you folks probably don't have the resources to immediately investigate each bug report that comes in (and I totally get that), but perhaps at least have bugbug request a dump in cases of freezes (or something similar), because I would've gladly provided that, or anything else that could've helped squash the bug. I don't want bugs like this to deter more folks from using Firefox.

perhaps at least have bugbug request a dump in cases of freezes (or something similar), because I would've gladly provided that, or anything else that could've helped squash the bug. I don't want bugs like this to deter more folks from using Firefox.

I've looked a bit at this in the interest of doing the incident analysis on bug 1817847, and we're at the point where to have such a severe problem hit release, a lot of different things have to all go wrong in sequence. It started with bugbug seeing "previous session should restore" in your description and triaging the bug in the wrong component (that's Artificial Intelligence for you!). The triage owner there was slow to react, and had to throw it back to General. This meant we lost about 3 weeks before QA reached out to you to ask about (standard) dumps. You also identified ESET as the culprit at some point (https://bugzilla.mozilla.org/show_bug.cgi?id=1817847#c11) which combined with zero other reports caused that bug to get less attention. That's a bit the nature of a race, the second report also fingered a MS update (KB5023706) as the cause and it's only because we noticed several independent reports of the same symptoms all being blamed on different things that alarms went off. Asking people for dumps can be tricky, it depends on how technical the user is and sometimes doesn't work very well (see e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=1822502#c16). When Yannis posted the stack, he missed the recursive lock. When I saw it, I noticed the Migration part and immediately thought of bug 1799470, but then discarded the idea (wrongly) after looking at the code. And so on. And finally, there's this bug, where comment 5 kind of summarizes the resourcing situation. At least we're in a better shape here now so I have some hopes that we can look at this in the near future.

(I'll mark both of our comments offtopic now, but I wanted to give a bit of background on what happened here. Please do keep reporting such issues even if sometimes we'll not manage to triage them correctly right away. We had zero people reproducing internally, so without you and other users providing those dumps this would have been a worse nightmare than it already was)

(In reply to Gabriele Svelto [:gsvelto] from comment #7)

This would have helped with bug 1823159. Some thoughts about this:

  • On Windows this should be relatively simple to implement. WER detect hangs and we can capture them directly (we already did in fact, but we disabled it in bug 1714334 because we don't know how to handle them).
  • Both Linux and macOS have OS facilities that detect hangs in graphical applications but I don't know how to leverage them (or if they can be leveraged at all).
  • Assuming we start accepting these crashes we need a way to flag them so that they're clearly labeled on Socorro

Is the use of OS facilities preferable over using some mostly-OS-generic integration with BHR (or some other independent watchdog solution)? In most cases detecting UI hangs can boil down to having a periodic callback registered in the event/render loop (assuming the GUI rendering is single-threaded, which I believe across these platforms is always true).

(In reply to Alex Franchuk from comment #10)

Is the use of OS facilities preferable over using some mostly-OS-generic integration with BHR (or some other independent watchdog solution)? In most cases detecting UI hangs can boil down to having a periodic callback registered in the event/render loop (assuming the GUI rendering is single-threaded, which I believe across these platforms is always true).

There's a few things about BHR that would need to be done to use it: first of all it's nightly only and I don't know if we could extend it to the beta/release channels. In the case of bug 1823159 it seems that a few instances of the problem manifested themselves in beta, so maybe beta would be enough but we would need to check. Either way it would be nice to have something that works in release too.

Another thing with BHR is that it doesn't yet do what we need (see comment 2) and it has power consumption implications because it needs to polls the threads it monitors.

Given the above I think that OS facilities might be a better option but - given I don't know if we have the appropriate mechanisms on Linux/macOS - we still need to explore our options before committing to an implementation.

Reading the BHR code, it looks like the main driver for only enabling on Nightly was the power consumption and potentially a lot of telemetry traffic, however https://bugzilla.mozilla.org/show_bug.cgi?id=965392#c10 indicates that maybe power consumption is no longer an issue. The detection method was also changed to use a timer which is cancelled when work completes, which means that there aren't wakeups firing all the time which would be bad for power.

That said, turning on all BHR stuff in non-nightly is likely unnecessary to address main hangs. If we were to use BHR rather than OS facilities, I think the course of action would be to generalize much of the existing BackgroundHangMonitor into a super class which allows you to provide callbacks to use when a hang or permahang is detected. Such a class would be unrelated to telemetry, and would be flexible enough to allow us to do something specific for the main thread and reuse the pertinent parts of the BHR code without changing a lot of existing functionality.

As far as OS facilities:

  • Windows: As mentioned, windows can report hangs with WER, though I've had some difficulty finding any way to tune it (but maybe there's no need). Also I failed to actually find documentation of the WER_FAULT_REPORTING flag related to hangs, which is a bit frustrating.
  • MacOS: MacOS has a watchdog which is supposed to kill hanging applications, with termination reason 0x8badf00d.
  • Linux: I couldn't find any syscall/api dedicated to catching hung processes in linux nor GTK.

(In reply to Alex Franchuk from comment #12)

  • Linux: I couldn't find any syscall/api dedicated to catching hung processes in linux nor GTK.

You might be able to put something together with timerfd + O_ASYNC + F_SETSIG; I haven't tried it but the combination of those features should let you raise a chosen signal at a given time and dynamically postpone/disarm.

The sandbox blocks O_ASYNC and related fcntls — see bug 1328896 and bug 1408438 — but that doesn't matter for the parent process, and if we want to do this for child process main threads as well, it should be enough to set up the timer during startup before the sandbox is entered, and allow timerfd_settime in the sandbox policy if it isn't already.

(In reply to Gabriele Svelto [:gsvelto] from comment #7)

  • Both Linux and macOS have OS facilities that detect hangs in graphical applications but I don't know how to leverage them (or if they can be leveraged at all).

On Linux that seems to be a Wayland thing: some(?) compositors will ping the client (there's a protocol message for this) and if it doesn't reply they'll close the connection; GTK will kill the client process once it sees the socket is closed, and annoyingly it does this with _exit rather than abort so we won't get a crash report (see also the upstream bug report).

Given that improving this situation is pretty easy on windows using WER, and windows is the largest portion of users, I think the plan can be:

  1. Detect hangs on Windows by using non-fatal exceptions from WER.
  2. Detect hangs on Mac by adding watchdog support (which amounts to checking for the watchdog termination cause).
  3. Possibly add Linux support, though this would fundamentally differ from Windows/Mac since we'd manually implement the "this process is hung" logic. The timerfd that :jld mentioned should work; it amounts to doing essentially the same thing as a portion of the background hang monitor, the main difference being that it doesn't use nsITimer. I'm still fairly naive when it comes to firefox internals, so if it's the case that nsITimer can't be used very early in startup (e.g. due to requiring some runtime to be initialized), then timerfd would be better. The timer should likely be started and cancelled in the GTK event loop (unless there's some other more appropriate firefox event loop).
    • Newer Linux kernels support hung task detection, however it only operates on tasks which are in the TASK_UNINTERRUPTIBLE state (i.e. in a syscall). So it's essentially hung kernel/driver/hardware detection.

In all cases, when a hang is detected we should generate a minidump, kill the process, and bring up the crash reporter. We should make sure a crash ping is sent, too. I think it's fine to use a crash ping for this (as opposed to a separate hang ping), since the user experiences a crash. We will want to distinguish the event as a hang in the ping.

For WER-detected hangs, should we try to be "smart" and change the crashing thread to be the main UI thread? We don't actually know that the UI is what caused WER to detect a hang, though on the other hand:

  1. it's quite likely that UI threads are the only means by which WER can detect hangs, and
  2. the thread that is considered the "crashing thread" is likely incorrect (as seen in bug 1709423).
Assignee: nobody → afranchuk
Status: NEW → ASSIGNED
Depends on: 1826703

I read as much source code as I could find, but I could not figure out whether the macOS EXC_CRASH (SIGKILL) termination reason is exposed in the exception information. This is what is necessary to determine that the crash is the result of the watchdog (which indicates a hang). Given the information that is in the struct passed over the exception port, I don't think it is there unless it happens to be undocumented in the message trailer.

I also tried running an instrumented crash handler that would dump the exception information with a hang inserted, however it didn't seem like the crash handler was being invoked (though the system did correctly kill the process). The system logs showed the process was killed but they didn't indicate the termination reason (and no pop-up came up as the macOS documentation shows).

For now, let's try just windows support.

Attachment #9327280 - Attachment description: Bug 1716946 pt1 - Windows hanging UI thread support r=gsvelto → Bug 1716946 - Windows hanging UI thread support r=gsvelto

As an addendum for any macOS experts, there are two problems I encountered:

  1. We install an exception handler in a process using task_set_exception_ports and as I understand it, it's set with relation to the child process after a fork. This works for various crashes (it's our crash reporting mechanism for main process crashes after all), however in my testing it wasn't called when a process was killed due to a hang. That being said, it's possible that something was misbehaving with lldb (while I was debugging) or the exception handling was going down a code path I wasn't expecting. This can be investigated further.
  2. There's simply no documentation about how, or if it's possible at all, to get the termination reason (as is referenced here and here). Based on the way the error messages are formatted in these cases, it seems as if the termination reason code is a separate code from the exception codes (since the error messages show dual 0 exception codes). Programmatically determining the termination reason is necessary to distinguish hangs.

(In reply to Alex Franchuk from comment #18)

  1. There's simply no documentation about how, or if it's possible at all, to get the termination reason (as is referenced here and here). Based on the way the error messages are formatted in these cases, it seems as if the termination reason code is a separate code from the exception codes (since the error messages show dual 0 exception codes). Programmatically determining the termination reason is necessary to distinguish hangs.

We probably have to dig again into Apple's kernel sources - the only way to gather reliable information about this. Theoretically we can capture EXC_CRASH exceptions since bug 1724928 however as you've noticed it's impossible for the termination reason to be carried by that exception, because the exception codes are empty. Additionally the documentation somehow mentions that EXC_CRASH is accompanied by an EXC_CORPSE_NOTIFY exception and that's complicated to say the least. IIRC the Darwin kernel handles exceptions differently depending on when they're delivered when a process is being torn down so chances are that particular exception is never delivered to the process itself or is turned into a different exception we don't/can't catch (such as EXC_CORPSE_NOTIFY).

FYI I had a very quick look at the relevant code and it's possible that what's happening is that a EXC_CRASH exception is delivered to the process but gets turned into an EXC_CORPSE_NOTIFY exception by the kernel. Catching that exception and retrieving the exception data at that point is complicated. Crashpad's code describes this fairly well and it's a lot of work. If there's a better way to do deal with the watchdog and we can pry out of Apple I would prefer it rather than mess with EXC_CORPSE_NOTIFY.

(In reply to Gabriele Svelto [:gsvelto] from comment #19)

We probably have to dig again into Apple's kernel sources - the only way to gather reliable information about this. Theoretically we can capture EXC_CRASH exceptions since bug 1724928 however as you've noticed it's impossible for the termination reason to be carried by that exception, because the exception codes are empty. Additionally the documentation somehow mentions that EXC_CRASH is accompanied by an EXC_CORPSE_NOTIFY exception and that's complicated to say the least. IIRC the Darwin kernel handles exceptions differently depending on when they're delivered when a process is being torn down so chances are that particular exception is never delivered to the process itself or is turned into a different exception we don't/can't catch (such as EXC_CORPSE_NOTIFY).

FWIW I did some extensive digging in Apple's open source code. Unfortunately whatever writes 0x8badf00d as a termination reason is not present (I could not find the constant anywhere). And on the other end of things, the system-level crash reporter and logging is also not open source, so I couldn't locate any code that reads the termination reason.

I propose we add a new bug for adding hang reporting to macOS (which will require crash reporting to work out-of-process) and land the windows patch. We can either close this bug when landing or create a new bug dedicated to windows, move the patch there, and land it (keeping this bug as a cross-OS one).

Agreed, let's get this bug moving. Let's open separate bugs for macOS and Linux and finish the Windows work here. Since bug 1667997 hasn't moved I'll prepare a trimmed own solution for bug 1826703 where we just generate the appropriate signature to tell these hangs apart from other crashes.

See Also: → 1837901
See Also: → 1837903

I just remembered something important: since we want to add a crash annotation we need a data review first.

Attachment #9338887 - Flags: data-review?(willkg)

Comment on attachment 9338887 [details]
Hang crash annotation data request

Data Review Form (to be filled by Data Stewards)

  1. Is there or will there be documentation that describes the schema for the ultimate data set in a public, complete, and accurate way?

Yes. It's documented in CrashAnnotations.yaml.

  1. Is there a control mechanism that allows the user to turn the data collection on and off? (Note, for data collection not needed for security purposes, Mozilla provides such a control mechanism) Provide details as to the control mechanism available.

Yes. The data is coming in via two different routes:

  • Hang annotation in crash reports: this is default off and follows crash reporting preferences
  • Hang in crash ping via glean: this is default on and follows telemetry preferences
  1. If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes.

  1. Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1. Technical data.

  1. Is the data collection request for default-on or default-off?
  • via crash reports is default off
  • via telemetry is default on
  1. Does the instrumentation include the addition of any new identifiers (whether anonymous or otherwise; e.g., username, random IDs, etc. See the appendix for more details)?

No.

  1. Is the data collection covered by the existing Firefox privacy notice? If unsure: escalate to legal if:

Yes.

  1. Does the data collection use a third-party collection tool? If yes, escalate to legal.

No.

Attachment #9338887 - Flags: data-review?(willkg) → data-review+

I'm reverting the dependency between bug 1826703 and this bug because we need the crash data within Socorro first, then we can proceed to identifying this reports. This means that this bug is fundamentally ready to land.

Blocks: 1826703
No longer depends on: 1826703
Pushed by afranchuk@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3406ed84fd90
Windows hanging UI thread support r=gsvelto
Status: ASSIGNED → RESOLVED
Closed: 10 months ago
Resolution: --- → FIXED
Target Milestone: --- → 116 Branch
Depends on: 1842041
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: