Open Bug 1310255 Opened 8 years ago Updated 2 years ago

Build or modify current BHR / chromehangs dashboard to show Gecko_Child_ForcePaint hangs

Tracking

()

Status:

NEW

People

(Reporter: mconley, Unassigned)

References

(Depends on 2 open bugs)

Details

Attachments

(3 files, 10 obsolete files)

In lieu of a dashboard 8 years ago Mike Conley (:mconley) (:⚙️) 914.78 KB, text/plain		Details
Very early stacks 8 years ago Mike Conley (:mconley) (:⚙️) 12.91 KB, text/plain		Details
Weekend stacks 8 years ago Mike Conley (:mconley) (:⚙️) 1.58 MB, text/plain		Details
Today's stacks after bug 1312080 landed 8 years ago Mike Conley (:mconley) (:⚙️) 148.62 KB, text/plain		Details
Stack snapshot (October 25, 11AM ET) 8 years ago Mike Conley (:mconley) (:⚙️) 696.12 KB, text/plain		Details
October 26th, 3PM (ET) stacks with less (chrome script) entries 8 years ago Mike Conley (:mconley) (:⚙️) 77.46 KB, text/plain		Details
Oct 27th - Nov 3rd stacks, sorted by hang length 8 years ago Mike Conley (:mconley) (:⚙️) 1.40 MB, text/plain		Details
20161118 8 years ago Mike Conley (:mconley) (:⚙️) 436.43 KB, application/json		Details
20161118-by-time.json 8 years ago Mike Conley (:mconley) (:⚙️) 443.58 KB, application/json		Details
20161121 by time (Windows-only) 8 years ago Mike Conley (:mconley) (:⚙️) 285.68 KB, text/plain		Details
20161121 by frequency 8 years ago Mike Conley (:mconley) (:⚙️) 280.83 KB, text/plain		Details
Symbolicated Native Stacks 7 years ago Mike Conley (:mconley) (:⚙️) 3.85 MB, application/json		Details
2017-03-09 symbolicated native stacks snapshot 7 years ago Mike Conley (:mconley) (:⚙️) 5.42 MB, application/json		Details

Mike Conley (:mconley) (:⚙️)

Reporter

Description

•

8 years ago

Bug 1310011 and bug 1303077 gives BHR the ability to get annotated hangs from the main thread of the content process during a tab switch with e10s enabled. Bug 1310246 should give us those annotations as well for chromehangs.

The reason we're doing this is to try to figure out why some users are seeing tab switch spinners with e10s enabled (see bug 1300411). The pseudostacks from BHR and native stacks from chromehangs (from Nightly) should hopefully give us data on where we need to be spending our time putting in hooks to optimize for paint.

We're going to need to get this data out of Telemetry once it starts coming in. Neither the BHR dashboard (https://telemetry.mozilla.org/hang/bhr/) nor the chromehangs dashboard (https://telemetry.mozilla.org/chromehangs) seem to be designed to display only content process hangs, and neither seems to have the ability to filter by annotation.

If it's straight-forward, we should add those abilities. Doesn't have to be pretty, just needs to give us the stacks and filter out the ones we don't care about (any stacks that don't have the ForcePaintInProgress annotation).

If it's not straight-forward, I'm also open to hacking up a quick and dirty dashboard, but I'll need the API endpoint to query the data from.

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 1

•

8 years ago

Cc'ing canuckistani and rweiss as requested during the e10s cross-functional.

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 2

•

8 years ago

Hey rvitillo, I talked to jchen a bit about this, and brought up two things:

1) The BHR thread hangs dashboard might be using the old (pre-unified) Telemetry format, so he's unsure how long it's liable to keep working
2) It looks like the mapreduce job at https://github.com/darchons/hang-telemetry-tools/blob/master/mapreduce-bhr.py makes no mention of the annotations on the thread hangs, so it doesn't look like the dashboard has the ability to try to filter by them.

Is (1) a big concern, and how hard would it be to include the annotations in (2) in the stacks that the BHR dashboard gathers? Is it also possible to make it easier to differentiate child thread hangs from parent process thread hangs?

I'm think I'm willing to put in the effort to try to re-jig the BHR dashboard to do what I need, but the API endpoint I'm less confident about.

Flags: needinfo?(rvitillo)

Roberto Agostino Vitillo (:rvitillo)

Comment 3

•

8 years ago

(In reply to Mike Conley (:mconley) - (high latency on reviews and needinfos) from comment #2)
> Hey rvitillo, I talked to jchen a bit about this, and brought up two things:
> 
> 1) The BHR thread hangs dashboard might be using the old (pre-unified)
> Telemetry format, so he's unsure how long it's liable to keep working
> 2) It looks like the mapreduce job at
> https://github.com/darchons/hang-telemetry-tools/blob/master/mapreduce-bhr.
> py makes no mention of the annotations on the thread hangs, so it doesn't
> look like the dashboard has the ability to try to filter by them.
> 
> Is (1) a big concern, and how hard would it be to include the annotations in
> (2) in the stacks that the BHR dashboard gathers? Is it also possible to
> make it easier to differentiate child thread hangs from parent process
> thread hangs?

I think that job has to be rewritten to use Spark. The problems I see with it in its current state are the following:
- it's not scheduled with Airflow and as such we don't know if it has been running successfully or not;
- it's using our very old mapreduce framework which we no longer support;
- it's using pre-unified Telemetry which we no longer support;
- it doesn't have a clear owner (both of the job and the dashboard).

Flags: needinfo?(rvitillo)

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 4

•

8 years ago

Attached file In lieu of a dashboard (obsolete) — Details

I did some Spark / iPython hackery to extract this stuff. This is looking on pings that were submitted from the Nightly channel on October the 15th, sorted by stack frequency.

Treat this data with some suspicion, as I haven't had anybody proof my (admittedly hackish) iPython notebook just yet.

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 5

•

8 years ago

Comment on attachment 8801853 [details]
In lieu of a dashboard

Here's the iPython notebook, with results at the bottom:

https://gist.github.com/mikeconley/1888194c341e3d4da87207540603e708

Attachment #8801853 - Attachment is obsolete: true

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 6

•

8 years ago

I will be the first to admit that this notebook could be cleaned up a bit, but I _think_ it's showing us useful things.

Can somebody double-check my work to make sure I've done this right, and we're not looking at bogus stacks here? chutten maybe?

Flags: needinfo?(chutten)

Chris H-C :chutten

Comment 7

•

8 years ago

* Use get_pings_properties to filter the pings' contents down to just the fields you're interested in. This'll improve runtime for when/if you run this on larger populations.
* Are you worried about a particular client being represented too often? You'll want to pass the RDD through get_one_ping_per_client (after get_pings_properties) if so. (( The question I predict is "Should I be so worried?" And the answer is... if the idea is to take from this an ordered "Hit List" of worst offenders which, when fixed, will improve lives for the largest _population_ (instead of clearing the most hangs, which is different), then you'll want get_one_ping_per_client to quiet down the chattier clients. ))
* Is python's sort specifically good if the collection is already mostly sorted? If not, I'd recommend not sorting as-you-go
* For ||ization's sake, you may wish to maintain the collection as an RDD for longer. As soon as you collectAsMap, all but one of your cluster's nodes sit idle
* get_stack_hits is unused
* For fun, maybe print out a count of stacks, frames so brains like mine can get a handle on how long the long tail is on these.

But, yeah, it looks legit to me.

For extra Software Engineering points you could build something like an ordered trie to figure out what the most common hangs are by longest common stack prefix. But I'm not sure that'd tell us much more than we can already learn from this (that there's something in init, something in RPC, something in GC, and something in paint).

Flags: needinfo?(chutten)

Georg Fritzsche [:gfritzsche]

Updated

•

8 years ago

Priority: -- → P3

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 8

•

8 years ago

In bug 1310880, we got rid of the annotation, and instead we have a dedicated Gecko_Child_ForcePaint "thread" being listed that's dedicated to only collecting hangs during the force paint window.

So if we're going to build some kind of dashboard, we should take the data from that thread in Telemetry, instead of using the annotation (which has been removed).

Summary: Build or modify current BHR / chromehangs dashboard to show content process hangs with ForcePaintInProgress annotation → Build or modify current BHR / chromehangs dashboard to show Gecko_Child_ForcePaint hangs

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 9

•

8 years ago

Attached file Very early stacks (obsolete) — Details

The patch to introduce Gecko_Child_ForcePaint only became available today, but I got excited, and decided to see if there was any Telemetry data just yet.

It turns out "yes", and I've attached the stacks we've got so far. Probably too early to make any claims about weighting and priority since the N is so small, but I thought this was kind of interesting.

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 10

•

8 years ago

These stacks are really, really early - like, we should wait until a weekends' worth of data has come in, but I still find what we've got interesting.

In particular, I see that the most frequent top-frame in the hang stacks is (chrome script). Here's that chunk from what I got from Telemetry:

{'frame': u'(chrome script)',
      'hits': 42,
      'stacks': [((u'Startup::XRE_InitChildProcess',
         u'IPDL::PContent::RecvLoadProcessScript',
         u'(chrome script)'),
        14),
       ((u'Startup::XRE_InitChildProcess',
         u'IPDL::PBrowser::RecvLoadRemoteScript',
         u'(chrome script)'),
        14),
       ((u'Startup::XRE_InitChildProcess',
         u'IPDL::PBrowser::RecvAsyncMessage',
         u'(chrome script)'),
        5),
       ((u'Startup::XRE_InitChildProcess',
         u'IPDL::PContent::RecvPBrowserConstructor',
         u'(chrome script)'),
        4),
       ((u'Startup::XRE_InitChildProcess',
         u'IPDL::PContent::RecvAsyncMessage',
         u'(chrome script)'),
        3),
       ((u'Startup::XRE_InitChildProcess',
         u'Events::ProcessGeckoEvents',
         u'(content script)',
         u'(chrome script)'),
        1),
       ((u'Startup::XRE_InitChildProcess',
         u'Timer::Fire',
         u'(content script)',
         u'(chrome script)'),
        1)]},

I thought that the work in bug 1279086 would have made the above stacks impossible to obtain during the force paint window? Is it possible that we fail to interrupt the JS engine in some cases?

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 11

•

8 years ago

It's possible that we're not in JS at this time. We could be in a deeper stack frame that's not annotated. Maybe ctypes code or OS.File or something else that might block for a while. It's a shame we're not getting the actual filename/lineno for the chrome script. It looks like that only happens if the filename is too long?

http://searchfox.org/mozilla-central/rev/84075be5067b68dc6cb3b89f999645650e68c05b/xpcom/threads/ThreadStackHelper.cpp#491

That seems like something we should be able to fix.

I also realized another potential issue. I'll comment in bug 1310880. I don't think it's the problem here since we're supposedly in script when this happens. But we should fix it.

Flags: needinfo?(wmccloskey)

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 12

•

8 years ago

Attached file Weekend stacks (obsolete) — Details

These are the weekend stacks (from before bug 1312080 landed).

Attachment #8803442 - Attachment is obsolete: true

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 13

•

8 years ago

Attached file Today's stacks after bug 1312080 landed (obsolete) — Details

Mike Conley (:mconley) (:⚙️)

Reporter

Updated

•

8 years ago

Updated

•

8 years ago

Depends on: 1312597

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 14

•

8 years ago

Attached file Stack snapshot (October 25, 11AM ET) (obsolete) — Details

I've got a Spark job processing stacks every night, and I guess the next step is to massage the ipynb file into something that a dashboard can display. I'll hopefully be able to do that soon.

In the meantime, here's a snapshot from today at 11AM.

Attachment #8804032 - Attachment is obsolete: true

Attachment #8804036 - Attachment is obsolete: true

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 15

•

8 years ago

Attached file October 26th, 3PM (ET) stacks with less (chrome script) entries (obsolete) — Details

Bug 1312597 is in today's Nightly, and it looks like it's doing a good job of breaking up the (chrome script) block that was the #2 cause in yesterday's hang stacks list.

This has moved the GMP blocking stuff into 2nd place.

Attachment #8804330 - Attachment is obsolete: true

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 16

•

8 years ago

I've got these stacks being generated daily.

Pseudostacks: http://nbviewer.jupyter.org/urls/s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/TabSpinnerStacks/data/TabSpinnerNightly-v1.2.ipynb

For some really long hangs, BHR attempts to get native stacks. Here are those as well, also being generated daily:
http://nbviewer.jupyter.org/urls/s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/TabSpinnerNativeStacks/data/TabSpinnerNightly-NativeStacks-v1.0.ipynb

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 17

•

8 years ago

Attached file Oct 27th - Nov 3rd stacks, sorted by hang length (obsolete) — Details

At blassey's suggestion, I re-jigged my iPython notebook to multiple histogram values by their indexes and sum them (because the "sum" property appears to be busted, wtf?).

This kinda changes stuff. Bug 1313125 suddenly seems more interesting to hop on. I've posted a patch there.

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 18

•

8 years ago

Attached file 20161118 (obsolete) — Details

Here are the stacks from Nov 5 (so after bug 1312880, bug 1312540 and bug 1313125 landed) to last night (Nov 17).

Attachment #8804863 - Attachment is obsolete: true

Attachment #8807236 - Attachment is obsolete: true

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 19

•

8 years ago

Attached file 20161118-by-time.json (obsolete) — Details

Here are the stacks from Nov 5 (so after bug 1312880, bug 1312540 and bug 1313125 landed) to last night (Nov 17), but organized by total accumulated hang time as opposed to frequency.

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 20

•

8 years ago

Attached file 20161121 by time (Windows-only) — Details

Looking at just Windows now. Here are the stacks organized by time.

Attachment #8812201 - Attachment is obsolete: true

Attachment #8812202 - Attachment is obsolete: true

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 21

•

8 years ago

Attached file 20161121 by frequency — Details

... and frequency.

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 22

•

7 years ago

Attached file Symbolicated Native Stacks (obsolete) — Details

Mike Conley (:mconley) (:⚙️)

Reporter

Updated

•

7 years ago

Depends on: 1344003

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 23

•

7 years ago

Attached file 2017-03-09 symbolicated native stacks snapshot — Details

Attachment #8842490 - Attachment is obsolete: true

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.