Open
Bug 1310255
Opened 8 years ago
Updated 2 years ago
Build or modify current BHR / chromehangs dashboard to show Gecko_Child_ForcePaint hangs
Categories
(Toolkit :: Telemetry, defect, P3)
Tracking
()
NEW
People
(Reporter: mconley, Unassigned)
References
(Depends on 2 open bugs)
Details
Attachments
(3 files, 10 obsolete files)
Bug 1310011 and bug 1303077 gives BHR the ability to get annotated hangs from the main thread of the content process during a tab switch with e10s enabled. Bug 1310246 should give us those annotations as well for chromehangs. The reason we're doing this is to try to figure out why some users are seeing tab switch spinners with e10s enabled (see bug 1300411). The pseudostacks from BHR and native stacks from chromehangs (from Nightly) should hopefully give us data on where we need to be spending our time putting in hooks to optimize for paint. We're going to need to get this data out of Telemetry once it starts coming in. Neither the BHR dashboard (https://telemetry.mozilla.org/hang/bhr/) nor the chromehangs dashboard (https://telemetry.mozilla.org/chromehangs) seem to be designed to display only content process hangs, and neither seems to have the ability to filter by annotation. If it's straight-forward, we should add those abilities. Doesn't have to be pretty, just needs to give us the stacks and filter out the ones we don't care about (any stacks that don't have the ForcePaintInProgress annotation). If it's not straight-forward, I'm also open to hacking up a quick and dirty dashboard, but I'll need the API endpoint to query the data from.
Reporter | ||
Comment 1•8 years ago
|
||
Cc'ing canuckistani and rweiss as requested during the e10s cross-functional.
Reporter | ||
Comment 2•8 years ago
|
||
Hey rvitillo, I talked to jchen a bit about this, and brought up two things: 1) The BHR thread hangs dashboard might be using the old (pre-unified) Telemetry format, so he's unsure how long it's liable to keep working 2) It looks like the mapreduce job at https://github.com/darchons/hang-telemetry-tools/blob/master/mapreduce-bhr.py makes no mention of the annotations on the thread hangs, so it doesn't look like the dashboard has the ability to try to filter by them. Is (1) a big concern, and how hard would it be to include the annotations in (2) in the stacks that the BHR dashboard gathers? Is it also possible to make it easier to differentiate child thread hangs from parent process thread hangs? I'm think I'm willing to put in the effort to try to re-jig the BHR dashboard to do what I need, but the API endpoint I'm less confident about.
Flags: needinfo?(rvitillo)
Comment 3•8 years ago
|
||
(In reply to Mike Conley (:mconley) - (high latency on reviews and needinfos) from comment #2) > Hey rvitillo, I talked to jchen a bit about this, and brought up two things: > > 1) The BHR thread hangs dashboard might be using the old (pre-unified) > Telemetry format, so he's unsure how long it's liable to keep working > 2) It looks like the mapreduce job at > https://github.com/darchons/hang-telemetry-tools/blob/master/mapreduce-bhr. > py makes no mention of the annotations on the thread hangs, so it doesn't > look like the dashboard has the ability to try to filter by them. > > Is (1) a big concern, and how hard would it be to include the annotations in > (2) in the stacks that the BHR dashboard gathers? Is it also possible to > make it easier to differentiate child thread hangs from parent process > thread hangs? I think that job has to be rewritten to use Spark. The problems I see with it in its current state are the following: - it's not scheduled with Airflow and as such we don't know if it has been running successfully or not; - it's using our very old mapreduce framework which we no longer support; - it's using pre-unified Telemetry which we no longer support; - it doesn't have a clear owner (both of the job and the dashboard).
Flags: needinfo?(rvitillo)
Reporter | ||
Comment 4•8 years ago
|
||
I did some Spark / iPython hackery to extract this stuff. This is looking on pings that were submitted from the Nightly channel on October the 15th, sorted by stack frequency. Treat this data with some suspicion, as I haven't had anybody proof my (admittedly hackish) iPython notebook just yet.
Reporter | ||
Comment 5•8 years ago
|
||
Comment on attachment 8801853 [details] In lieu of a dashboard Here's the iPython notebook, with results at the bottom: https://gist.github.com/mikeconley/1888194c341e3d4da87207540603e708
Attachment #8801853 -
Attachment is obsolete: true
Reporter | ||
Comment 6•8 years ago
|
||
I will be the first to admit that this notebook could be cleaned up a bit, but I _think_ it's showing us useful things. Can somebody double-check my work to make sure I've done this right, and we're not looking at bogus stacks here? chutten maybe?
Flags: needinfo?(chutten)
Comment 7•8 years ago
|
||
* Use get_pings_properties to filter the pings' contents down to just the fields you're interested in. This'll improve runtime for when/if you run this on larger populations. * Are you worried about a particular client being represented too often? You'll want to pass the RDD through get_one_ping_per_client (after get_pings_properties) if so. (( The question I predict is "Should I be so worried?" And the answer is... if the idea is to take from this an ordered "Hit List" of worst offenders which, when fixed, will improve lives for the largest _population_ (instead of clearing the most hangs, which is different), then you'll want get_one_ping_per_client to quiet down the chattier clients. )) * Is python's sort specifically good if the collection is already mostly sorted? If not, I'd recommend not sorting as-you-go * For ||ization's sake, you may wish to maintain the collection as an RDD for longer. As soon as you collectAsMap, all but one of your cluster's nodes sit idle * get_stack_hits is unused * For fun, maybe print out a count of stacks, frames so brains like mine can get a handle on how long the long tail is on these. But, yeah, it looks legit to me. For extra Software Engineering points you could build something like an ordered trie to figure out what the most common hangs are by longest common stack prefix. But I'm not sure that'd tell us much more than we can already learn from this (that there's something in init, something in RPC, something in GC, and something in paint).
Flags: needinfo?(chutten)
Updated•8 years ago
|
Priority: -- → P3
Reporter | ||
Comment 8•8 years ago
|
||
In bug 1310880, we got rid of the annotation, and instead we have a dedicated Gecko_Child_ForcePaint "thread" being listed that's dedicated to only collecting hangs during the force paint window. So if we're going to build some kind of dashboard, we should take the data from that thread in Telemetry, instead of using the annotation (which has been removed).
Summary: Build or modify current BHR / chromehangs dashboard to show content process hangs with ForcePaintInProgress annotation → Build or modify current BHR / chromehangs dashboard to show Gecko_Child_ForcePaint hangs
Reporter | ||
Comment 9•8 years ago
|
||
The patch to introduce Gecko_Child_ForcePaint only became available today, but I got excited, and decided to see if there was any Telemetry data just yet. It turns out "yes", and I've attached the stacks we've got so far. Probably too early to make any claims about weighting and priority since the N is so small, but I thought this was kind of interesting.
Reporter | ||
Comment 10•8 years ago
|
||
These stacks are really, really early - like, we should wait until a weekends' worth of data has come in, but I still find what we've got interesting. In particular, I see that the most frequent top-frame in the hang stacks is (chrome script). Here's that chunk from what I got from Telemetry: {'frame': u'(chrome script)', 'hits': 42, 'stacks': [((u'Startup::XRE_InitChildProcess', u'IPDL::PContent::RecvLoadProcessScript', u'(chrome script)'), 14), ((u'Startup::XRE_InitChildProcess', u'IPDL::PBrowser::RecvLoadRemoteScript', u'(chrome script)'), 14), ((u'Startup::XRE_InitChildProcess', u'IPDL::PBrowser::RecvAsyncMessage', u'(chrome script)'), 5), ((u'Startup::XRE_InitChildProcess', u'IPDL::PContent::RecvPBrowserConstructor', u'(chrome script)'), 4), ((u'Startup::XRE_InitChildProcess', u'IPDL::PContent::RecvAsyncMessage', u'(chrome script)'), 3), ((u'Startup::XRE_InitChildProcess', u'Events::ProcessGeckoEvents', u'(content script)', u'(chrome script)'), 1), ((u'Startup::XRE_InitChildProcess', u'Timer::Fire', u'(content script)', u'(chrome script)'), 1)]}, I thought that the work in bug 1279086 would have made the above stacks impossible to obtain during the force paint window? Is it possible that we fail to interrupt the JS engine in some cases?
Flags: needinfo?(wmccloskey)
It's possible that we're not in JS at this time. We could be in a deeper stack frame that's not annotated. Maybe ctypes code or OS.File or something else that might block for a while. It's a shame we're not getting the actual filename/lineno for the chrome script. It looks like that only happens if the filename is too long? http://searchfox.org/mozilla-central/rev/84075be5067b68dc6cb3b89f999645650e68c05b/xpcom/threads/ThreadStackHelper.cpp#491 That seems like something we should be able to fix. I also realized another potential issue. I'll comment in bug 1310880. I don't think it's the problem here since we're supposedly in script when this happens. But we should fix it.
Flags: needinfo?(wmccloskey)
Reporter | ||
Comment 12•8 years ago
|
||
These are the weekend stacks (from before bug 1312080 landed).
Attachment #8803442 -
Attachment is obsolete: true
Reporter | ||
Comment 13•8 years ago
|
||
Reporter | ||
Comment 14•8 years ago
|
||
I've got a Spark job processing stacks every night, and I guess the next step is to massage the ipynb file into something that a dashboard can display. I'll hopefully be able to do that soon. In the meantime, here's a snapshot from today at 11AM.
Attachment #8804032 -
Attachment is obsolete: true
Attachment #8804036 -
Attachment is obsolete: true
Reporter | ||
Comment 15•8 years ago
|
||
Bug 1312597 is in today's Nightly, and it looks like it's doing a good job of breaking up the (chrome script) block that was the #2 cause in yesterday's hang stacks list. This has moved the GMP blocking stuff into 2nd place.
Attachment #8804330 -
Attachment is obsolete: true
Reporter | ||
Comment 16•8 years ago
|
||
I've got these stacks being generated daily. Pseudostacks: http://nbviewer.jupyter.org/urls/s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/TabSpinnerStacks/data/TabSpinnerNightly-v1.2.ipynb For some really long hangs, BHR attempts to get native stacks. Here are those as well, also being generated daily: http://nbviewer.jupyter.org/urls/s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/TabSpinnerNativeStacks/data/TabSpinnerNightly-NativeStacks-v1.0.ipynb
Reporter | ||
Comment 17•8 years ago
|
||
At blassey's suggestion, I re-jigged my iPython notebook to multiple histogram values by their indexes and sum them (because the "sum" property appears to be busted, wtf?). This kinda changes stuff. Bug 1313125 suddenly seems more interesting to hop on. I've posted a patch there.
Reporter | ||
Comment 18•8 years ago
|
||
Here are the stacks from Nov 5 (so after bug 1312880, bug 1312540 and bug 1313125 landed) to last night (Nov 17).
Attachment #8804863 -
Attachment is obsolete: true
Attachment #8807236 -
Attachment is obsolete: true
Reporter | ||
Comment 19•8 years ago
|
||
Here are the stacks from Nov 5 (so after bug 1312880, bug 1312540 and bug 1313125 landed) to last night (Nov 17), but organized by total accumulated hang time as opposed to frequency.
Reporter | ||
Comment 20•8 years ago
|
||
Looking at just Windows now. Here are the stacks organized by time.
Attachment #8812201 -
Attachment is obsolete: true
Attachment #8812202 -
Attachment is obsolete: true
Reporter | ||
Comment 21•8 years ago
|
||
... and frequency.
Reporter | ||
Comment 22•7 years ago
|
||
Reporter | ||
Comment 23•7 years ago
|
||
Attachment #8842490 -
Attachment is obsolete: true
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•