Closed Bug 1303077 Opened 4 years ago Closed 4 years ago

Gather client-side stacks for the content process when a hang occurs after paint is forced

Categories

(Firefox :: Tabbed Browser, defect, P2)

defect

Tracking

()

RESOLVED DUPLICATE of bug 1310880
Firefox 52
Tracking Status
firefox52 --- fixed

People

(Reporter: mconley, Assigned: mconley)

References

(Blocks 3 open bugs)

Details

Attachments

(1 file)

According to gsvelto, this kind of wizardry is possible now.
Almost, I'm about to land bug 1280477 which introduces an executable that can analyze minidumps and spit out stacks from them. There's a few gotchas right now though:

- It's designed to populate main crash events, so it'll need some tweaking to analyze any kind of minidump and return output in a way that's not just appending a JSON blob to an existing file

- It analyzes *the entire minidump* so if you're only interested in one process or one thread we've got no option to filter it out, but this could be added

- We don't have proper metadata-based stack-walking on many platforms so when working on release builds that have no frame pointers the stacks are walked using stack scanning which might or might not yield a proper stack trace. We're planning on adding proper metadata-based stack-walking but it's a long term goal as it requires significant infrastructure work

I'll soon modify this stuff to deal with minidump generated from content process crashes (bug 1293656); once that's done we'll have 90% of what's needed to use it to get client-side stacks from any sort of hangs.
Depends on: 1280477
Note that using Breakpad for this might be a little heavyweight--it's going to capture a dump of the entire process. We have all the machinery for that, but I don't know what the perf impact looks like. We haven't ever worried about that in the past because it's only been used in cases of crashes/hangs. The profiler machinery feels like it might be more suitable to this, but I can't recall if it requires cooperation from the content process to get a stack out of it (which obviously doesn't work well if that process is hung).
We could use the hang monitor channel to send a message asking a tab to walk the stack and send a stack trace back. This would be a pretty simple service to implement since the hang monitor already runs on a separate thread in the child process. It could call MozStackWalk and send the result back as a string.
Supposing we do this, what's the right ingestion end point for all of these stacks? The ChromeHangs stuff (which, I think, is part of BHR? I think?) might be one place.

Bug 1225851 offers an alternative, where stacks can be sent up to Telemetry, but a (very quick) scan of the bug shows that:

a) It hasn't landed yet, and
b) It has a requirement of one stack per "key", so unless we want to bump our key per spinner, this might not be the right ingestion point.
See Also: → 1225851
Cc'ing gfritzsche, who might find this stuff interesting.
(In reply to Mike Conley (:mconley) - (needinfo me!) from comment #4)
> Bug 1225851 offers an alternative, where stacks can be sent up to Telemetry,
> but a (very quick) scan of the bug shows that:
> 
> a) It hasn't landed yet, and
> b) It has a requirement of one stack per "key", so unless we want to bump
> our key per spinner, this might not be the right ingestion point.

Re a), it's closing in now and we could speed it up by taking over parts.
For b), the "one stack per key" limitation was to simplify design as the original request didn't need more. We could change that if needed.

Another thing:
Both chromeHangs and bug 1225851 only collect the stack for one thread.
I assume you'd need the state of all threads for diagnosis?
Priority: -- → P2
Depends on: 1225851
See Also: 1225851
(In reply to Georg Fritzsche [:gfritzsche] from comment #6)
> 
> Another thing:
> Both chromeHangs and bug 1225851 only collect the stack for one thread.
> I assume you'd need the state of all threads for diagnosis?

Hard to say - it might be different from cause to cause. Good thing to keep in mind though, so thanks.

Hey billm, can you describe how you plan on gathering the stack from the content process, and what kind of API you're hoping for so we can stash the stack in Telemetry? It looks like the patch in bug 1225851 adds a captureStack method to nsITelemetry, but only for the thread it's called from. If the main thread in the content process is blocked, (if I understand things correctly) I suspect we'll need to use a slightly different mechanism to get its stack.

I just want to get a sense of what kind of changes I need to make in that patch before I proceed.
Flags: needinfo?(wmccloskey)
My basic plan was to send a message on the hang monitor channel asking for a stack. That would be received on the hang monitor thread in the child. It would invoke a method to capture the main thread's stack. So we don't actually need to pass a thread into the capture routine--we could default to the main thread. But I guess there's no harm in asking for a specific thread. We'll also need the capture routine to be threadsafe (it looks like it isn't exactly threadsafe right now, although it sort of tries to be) and to work in the child process.
Flags: needinfo?(wmccloskey)
We'll need to file a follow-up for thread-safety anyway, i filed bug 1307746 on that.

Is the main threads stack usually sufficient for the scenarios you are looking at?
I wonder if scenarios come up where the main thread is actually just waiting on other threads.
(In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> Is the main threads stack usually sufficient for the scenarios you are
> looking at?
> I wonder if scenarios come up where the main thread is actually just waiting
> on other threads.

I suspect the main thread is likely to be sufficient. If it's not, we can add that functionality later. I'd rather not hold up the bug any more than necessary.
So I've just posted a patch for review that, I believe, annotates hangs that occur on the main thread during the time between a force paint and the next time something from the content process is composited.

Note that I'm kind of a newbie to Gecko multi-threading. I know enough about it to be dangerous.

I'm hoping that aklotz can review the annotation stuff, and billm can make sure I'm popping and locking in the right places.
r+ for "popping and locking" ;-)
Comment on attachment 8800282 [details]
Bug 1303077 - Annotate hangs that occur on the content process main thread during paint forcing.

https://reviewboard.mozilla.org/r/85246/#review83828

This looks mostly good to me, but I don't see a corresponding call to UnregisterAnnotator anywhere here. You'll need to call that on the main thread when HangMonitorChild is destroyed. Failing to do so will cause BHR to crash if a hang occurs after the HangMonitorChild is gone.

r- just because I'd like to take another look at it once that call is in place.
Attachment #8800282 - Flags: review?(aklotz) → review-
Comment on attachment 8800282 [details]
Bug 1303077 - Annotate hangs that occur on the content process main thread during paint forcing.

https://reviewboard.mozilla.org/r/85246/#review83828

Ah, good catch! Thanks!
Comment on attachment 8800282 [details]
Bug 1303077 - Annotate hangs that occur on the content process main thread during paint forcing.

https://reviewboard.mozilla.org/r/85246/#review83848

Thanks!
Attachment #8800282 - Flags: review?(aklotz) → review+
Comment on attachment 8800282 [details]
Bug 1303077 - Annotate hangs that occur on the content process main thread during paint forcing.

https://reviewboard.mozilla.org/r/85246/#review83872

Is the idea here that we'll sort through the existing hang reports and now we'll be able to tell which ones are for tab switches?

::: dom/ipc/ProcessHangMonitor.cpp:1294
(Diff revision 2)
> +{
> +  MOZ_RELEASE_ASSERT(NS_IsMainThread());
> +  MOZ_RELEASE_ASSERT(XRE_IsContentProcess());
> +
> +  if (HangMonitorChild* child = HangMonitorChild::Get()) {
> +    child->ClearForcePaint();

This isn't really correct. There are cases where we receive a ForcePaint message but choose not to paint (if the message is stale or the docshell was already active for example). I think it would be better to unset the "in progress" flag at the end of the interrupt callback.
Attachment #8800282 - Flags: review?(wmccloskey) → review+
Comment on attachment 8800282 [details]
Bug 1303077 - Annotate hangs that occur on the content process main thread during paint forcing.

https://reviewboard.mozilla.org/r/85246/#review83872

Yep. I'll either use a Spark cluster to generate a list of stacks by frequency, or (maybe more sustainably) work with someone like rvitillo to modify the API endpoint that https://telemetry.mozilla.org/hang/bhr/ reads from so that it can display content process hangs and show / filter by annotations.

> This isn't really correct. There are cases where we receive a ForcePaint message but choose not to paint (if the message is stale or the docshell was already active for example). I think it would be better to unset the "in progress" flag at the end of the interrupt callback.

I worry that flipping the bool at the end of the callback doesn't cast a wide enough net. I want to annotate hangs that occur between the time that the user clicks on a tab, and the time that the tab content displays itself.

Locally, when I switch tabs between, say, YouTube, and the HTML5 spec, I can get the spinner to display itself (due to big layout calculations) but the hangs are not annotated because (I think) we had successfully exited RecvSetDocShellIsActive.

So I worry that knowing that we exited RecvSetDocShellIsActive isn't enough to say that we want to stop annotating hangs.

Perhaps the naming is wrong - instead of using mForcePaintInProgress, I should call it mTabPaintInProgress?

Is what I'm saying making sense, or am I missing something?
ni? for comment 19
Flags: needinfo?(wmccloskey)
(In reply to Mike Conley (:mconley) - (needinfo me!) from comment #19)
> > This isn't really correct. There are cases where we receive a ForcePaint message but choose not to paint (if the message is stale or the docshell was already active for example). I think it would be better to unset the "in progress" flag at the end of the interrupt callback.
> 
> I worry that flipping the bool at the end of the callback doesn't cast a
> wide enough net. I want to annotate hangs that occur between the time that
> the user clicks on a tab, and the time that the tab content displays itself.
> 
> Locally, when I switch tabs between, say, YouTube, and the HTML5 spec, I can
> get the spinner to display itself (due to big layout calculations) but the
> hangs are not annotated because (I think) we had successfully exited
> RecvSetDocShellIsActive.

If we're hung in layout, we never should have even started RecvSetDocShellIsActive. It only runs if we're running the event loop or if we're stuck in JS.

We actually paint inside of RecvSetDocShellIsActive. So if you unset the flag at the end of it, we will have already painted.

I'm not sure why you wouldn't be getting the hang annotation in that case, but it shouldn't be related to where you set the flag. I think more investigation is needed.
Flags: needinfo?(wmccloskey)
Comment on attachment 8800282 [details]
Bug 1303077 - Annotate hangs that occur on the content process main thread during paint forcing.

https://reviewboard.mozilla.org/r/85246/#review83872

> I worry that flipping the bool at the end of the callback doesn't cast a wide enough net. I want to annotate hangs that occur between the time that the user clicks on a tab, and the time that the tab content displays itself.
> 
> Locally, when I switch tabs between, say, YouTube, and the HTML5 spec, I can get the spinner to display itself (due to big layout calculations) but the hangs are not annotated because (I think) we had successfully exited RecvSetDocShellIsActive.
> 
> So I worry that knowing that we exited RecvSetDocShellIsActive isn't enough to say that we want to stop annotating hangs.
> 
> Perhaps the naming is wrong - instead of using mForcePaintInProgress, I should call it mTabPaintInProgress?
> 
> Is what I'm saying making sense, or am I missing something?

Ah, I understand what's going wrong. We don't currently support annotating _in progress_ hangs. That's why I wasn't seeing annotations when I expected.

So this patch is going to only annotate hangs that occur between the point where we request a force paint, and when the force paint completes. I suspect the more common case for spinners is that the content process is already hung at the time that force paint is requested. I'll file a follow-up bug to allow us to annotate in-progress hangs.
No longer depends on: 1225851, 1280477
Summary: Gather client-side stacks for the content process when the tab switch spinner is shown → Gather client-side stacks for the content process when a hang occurs after paint is forced
Blocks: 1310011
Pushed by mconley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/aeadb583041d
Annotate hangs that occur on the content process main thread during paint forcing. r=aklotz,billm
https://hg.mozilla.org/mozilla-central/rev/aeadb583041d
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 52
Blocks: 1310255
See Also: → 1310809
https://hg.mozilla.org/integration/mozilla-inbound/rev/b4a6dca904d39041cff05765be2fa25f05f0c752
Backed out changeset aeadb583041d (bug 1303077) because the work in the upcoming bug 1310880 supercedes it.
I backed this out because the work in bug 1310880 effectively does the same thing, but better.
Resolution: FIXED → DUPLICATE
Duplicate of bug: 1310880
You need to log in before you can comment on or make changes to this bug.