Open Bug 1310255 Opened 8 years ago Updated 2 years ago

Build or modify current BHR / chromehangs dashboard to show Gecko_Child_ForcePaint hangs

Categories

(Toolkit :: Telemetry, defect, P3)

49 Branch
defect

Tracking

()

People

(Reporter: mconley, Unassigned)

References

(Depends on 2 open bugs)

Details

Attachments

(3 files, 10 obsolete files)

Bug 1310011 and bug 1303077 gives BHR the ability to get annotated hangs from the main thread of the content process during a tab switch with e10s enabled. Bug 1310246 should give us those annotations as well for chromehangs.

The reason we're doing this is to try to figure out why some users are seeing tab switch spinners with e10s enabled (see bug 1300411). The pseudostacks from BHR and native stacks from chromehangs (from Nightly) should hopefully give us data on where we need to be spending our time putting in hooks to optimize for paint.

We're going to need to get this data out of Telemetry once it starts coming in. Neither the BHR dashboard (https://telemetry.mozilla.org/hang/bhr/) nor the chromehangs dashboard (https://telemetry.mozilla.org/chromehangs) seem to be designed to display only content process hangs, and neither seems to have the ability to filter by annotation.

If it's straight-forward, we should add those abilities. Doesn't have to be pretty, just needs to give us the stacks and filter out the ones we don't care about (any stacks that don't have the ForcePaintInProgress annotation).

If it's not straight-forward, I'm also open to hacking up a quick and dirty dashboard, but I'll need the API endpoint to query the data from.
Cc'ing canuckistani and rweiss as requested during the e10s cross-functional.
Hey rvitillo, I talked to jchen a bit about this, and brought up two things:

1) The BHR thread hangs dashboard might be using the old (pre-unified) Telemetry format, so he's unsure how long it's liable to keep working
2) It looks like the mapreduce job at https://github.com/darchons/hang-telemetry-tools/blob/master/mapreduce-bhr.py makes no mention of the annotations on the thread hangs, so it doesn't look like the dashboard has the ability to try to filter by them.

Is (1) a big concern, and how hard would it be to include the annotations in (2) in the stacks that the BHR dashboard gathers? Is it also possible to make it easier to differentiate child thread hangs from parent process thread hangs?

I'm think I'm willing to put in the effort to try to re-jig the BHR dashboard to do what I need, but the API endpoint I'm less confident about.
Flags: needinfo?(rvitillo)
(In reply to Mike Conley (:mconley) - (high latency on reviews and needinfos) from comment #2)
> Hey rvitillo, I talked to jchen a bit about this, and brought up two things:
> 
> 1) The BHR thread hangs dashboard might be using the old (pre-unified)
> Telemetry format, so he's unsure how long it's liable to keep working
> 2) It looks like the mapreduce job at
> https://github.com/darchons/hang-telemetry-tools/blob/master/mapreduce-bhr.
> py makes no mention of the annotations on the thread hangs, so it doesn't
> look like the dashboard has the ability to try to filter by them.
> 
> Is (1) a big concern, and how hard would it be to include the annotations in
> (2) in the stacks that the BHR dashboard gathers? Is it also possible to
> make it easier to differentiate child thread hangs from parent process
> thread hangs?

I think that job has to be rewritten to use Spark. The problems I see with it in its current state are the following:
- it's not scheduled with Airflow and as such we don't know if it has been running successfully or not;
- it's using our very old mapreduce framework which we no longer support;
- it's using pre-unified Telemetry which we no longer support;
- it doesn't have a clear owner (both of the job and the dashboard).
Flags: needinfo?(rvitillo)
Attached file In lieu of a dashboard (obsolete) —
I did some Spark / iPython hackery to extract this stuff. This is looking on pings that were submitted from the Nightly channel on October the 15th, sorted by stack frequency.

Treat this data with some suspicion, as I haven't had anybody proof my (admittedly hackish) iPython notebook just yet.
Comment on attachment 8801853 [details]
In lieu of a dashboard

Here's the iPython notebook, with results at the bottom:

https://gist.github.com/mikeconley/1888194c341e3d4da87207540603e708
Attachment #8801853 - Attachment is obsolete: true
I will be the first to admit that this notebook could be cleaned up a bit, but I _think_ it's showing us useful things.

Can somebody double-check my work to make sure I've done this right, and we're not looking at bogus stacks here? chutten maybe?
Flags: needinfo?(chutten)
* Use get_pings_properties to filter the pings' contents down to just the fields you're interested in. This'll improve runtime for when/if you run this on larger populations.
* Are you worried about a particular client being represented too often? You'll want to pass the RDD through get_one_ping_per_client (after get_pings_properties) if so. (( The question I predict is "Should I be so worried?" And the answer is... if the idea is to take from this an ordered "Hit List" of worst offenders which, when fixed, will improve lives for the largest _population_ (instead of clearing the most hangs, which is different), then you'll want get_one_ping_per_client to quiet down the chattier clients. ))
* Is python's sort specifically good if the collection is already mostly sorted? If not, I'd recommend not sorting as-you-go
* For ||ization's sake, you may wish to maintain the collection as an RDD for longer. As soon as you collectAsMap, all but one of your cluster's nodes sit idle
* get_stack_hits is unused
* For fun, maybe print out a count of stacks, frames so brains like mine can get a handle on how long the long tail is on these.

But, yeah, it looks legit to me.

For extra Software Engineering points you could build something like an ordered trie to figure out what the most common hangs are by longest common stack prefix. But I'm not sure that'd tell us much more than we can already learn from this (that there's something in init, something in RPC, something in GC, and something in paint).
Flags: needinfo?(chutten)
Priority: -- → P3
In bug 1310880, we got rid of the annotation, and instead we have a dedicated Gecko_Child_ForcePaint "thread" being listed that's dedicated to only collecting hangs during the force paint window.

So if we're going to build some kind of dashboard, we should take the data from that thread in Telemetry, instead of using the annotation (which has been removed).
Summary: Build or modify current BHR / chromehangs dashboard to show content process hangs with ForcePaintInProgress annotation → Build or modify current BHR / chromehangs dashboard to show Gecko_Child_ForcePaint hangs
Attached file Very early stacks (obsolete) —
The patch to introduce Gecko_Child_ForcePaint only became available today, but I got excited, and decided to see if there was any Telemetry data just yet.

It turns out "yes", and I've attached the stacks we've got so far. Probably too early to make any claims about weighting and priority since the N is so small, but I thought this was kind of interesting.
These stacks are really, really early - like, we should wait until a weekends' worth of data has come in, but I still find what we've got interesting.

In particular, I see that the most frequent top-frame in the hang stacks is (chrome script). Here's that chunk from what I got from Telemetry:

{'frame': u'(chrome script)',
      'hits': 42,
      'stacks': [((u'Startup::XRE_InitChildProcess',
         u'IPDL::PContent::RecvLoadProcessScript',
         u'(chrome script)'),
        14),
       ((u'Startup::XRE_InitChildProcess',
         u'IPDL::PBrowser::RecvLoadRemoteScript',
         u'(chrome script)'),
        14),
       ((u'Startup::XRE_InitChildProcess',
         u'IPDL::PBrowser::RecvAsyncMessage',
         u'(chrome script)'),
        5),
       ((u'Startup::XRE_InitChildProcess',
         u'IPDL::PContent::RecvPBrowserConstructor',
         u'(chrome script)'),
        4),
       ((u'Startup::XRE_InitChildProcess',
         u'IPDL::PContent::RecvAsyncMessage',
         u'(chrome script)'),
        3),
       ((u'Startup::XRE_InitChildProcess',
         u'Events::ProcessGeckoEvents',
         u'(content script)',
         u'(chrome script)'),
        1),
       ((u'Startup::XRE_InitChildProcess',
         u'Timer::Fire',
         u'(content script)',
         u'(chrome script)'),
        1)]},

I thought that the work in bug 1279086 would have made the above stacks impossible to obtain during the force paint window? Is it possible that we fail to interrupt the JS engine in some cases?
Flags: needinfo?(wmccloskey)
It's possible that we're not in JS at this time. We could be in a deeper stack frame that's not annotated. Maybe ctypes code or OS.File or something else that might block for a while. It's a shame we're not getting the actual filename/lineno for the chrome script. It looks like that only happens if the filename is too long?

http://searchfox.org/mozilla-central/rev/84075be5067b68dc6cb3b89f999645650e68c05b/xpcom/threads/ThreadStackHelper.cpp#491

That seems like something we should be able to fix.

I also realized another potential issue. I'll comment in bug 1310880. I don't think it's the problem here since we're supposedly in script when this happens. But we should fix it.
Flags: needinfo?(wmccloskey)
Attached file Weekend stacks (obsolete) —
These are the weekend stacks (from before bug 1312080 landed).
Attachment #8803442 - Attachment is obsolete: true
See Also: → 1312540
Depends on: 1312597
Attached file Stack snapshot (October 25, 11AM ET) (obsolete) —
I've got a Spark job processing stacks every night, and I guess the next step is to massage the ipynb file into something that a dashboard can display. I'll hopefully be able to do that soon.

In the meantime, here's a snapshot from today at 11AM.
Attachment #8804032 - Attachment is obsolete: true
Attachment #8804036 - Attachment is obsolete: true
Bug 1312597 is in today's Nightly, and it looks like it's doing a good job of breaking up the (chrome script) block that was the #2 cause in yesterday's hang stacks list.

This has moved the GMP blocking stuff into 2nd place.
Attachment #8804330 - Attachment is obsolete: true
At blassey's suggestion, I re-jigged my iPython notebook to multiple histogram values by their indexes and sum them (because the "sum" property appears to be busted, wtf?).

This kinda changes stuff. Bug 1313125 suddenly seems more interesting to hop on. I've posted a patch there.
Attached file 20161118 (obsolete) —
Here are the stacks from Nov 5 (so after bug 1312880, bug 1312540 and bug 1313125 landed) to last night (Nov 17).
Attachment #8804863 - Attachment is obsolete: true
Attachment #8807236 - Attachment is obsolete: true
Attached file 20161118-by-time.json (obsolete) —
Here are the stacks from Nov 5 (so after bug 1312880, bug 1312540 and bug 1313125 landed) to last night (Nov 17), but organized by total accumulated hang time as opposed to frequency.
Looking at just Windows now. Here are the stacks organized by time.
Attachment #8812201 - Attachment is obsolete: true
Attachment #8812202 - Attachment is obsolete: true
Attached file 20161121 by frequency
... and frequency.
Attached file Symbolicated Native Stacks (obsolete) —
Depends on: 1344003
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: