[meta] Create (or resurrect) BHR dashboard

NEW
Assigned to

Status

Webtools
Telemetry Dashboard
4 months ago
7 days ago

People

(Reporter: mconley, Assigned: dthayer)

Tracking

(Blocks: 2 bugs)

Trunk
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [fce-active][qf:p1])

Attachments

(1 attachment, 4 obsolete attachments)

(Reporter)

Description

4 months ago
We collect BackgroundHangMonitor data via Telemetry for all pre-release builds, and we've been doing so for a while. This data includes pseudostacks and hang-time histograms for the main thread (both in the parent and content processes), compositor thread, vsync thread, the ProcessHangMonitor thread, and the ImageBridgeChild thread. There are probably others.

That data used to be displayed on the BHR dashboard, but that dashboard broke when Unified Telemetry came about. Then the dashboard was (rightly) decommissioned because it was just flat-out broken.

We want to bring that dashboard back from the dead. Specifically, ehsan is interested in getting a visualization to see how we're doing on things like sync IPC hangs in the wild.
(Reporter)

Comment 1

4 months ago
ni?ing ddurst as suggested.
Flags: needinfo?(ddurst)
Harald, do you think you would have cycles to help with this?  This dashboard will be the first one we'll be using for Quantum Flow work...
Flags: needinfo?(hkirschner)
Blocks: 1337841

Updated

4 months ago
Assignee: nobody → dothayer
Flags: needinfo?(ddurst)
Whiteboard: [fce-active]
I would probably split this issue in 2 work streams to aggregate hang data from incoming pings:

1) New aggregate for hang signatures for tracking hangs on a signature level

This should be guided by engineers and their hang use cases, starting from just simply aggregating hang signatures with duration.

Happy to help to make this into a more elaborate diagnostic dashboard when we have the aggregate to easily draw data from.

2) New aggregate for hangs-per-minute for reporting on hangs as release quality metrics

Related: https://github.com/mozilla/telemetry-streaming/issues/20
Flags: needinfo?(hkirschner)
(Reporter)

Comment 4

4 months ago
Created attachment 8843097 [details]
parent-20170302.json

Hey chutten,

I took a 10% sample of all pings on Nightly for the past 3 days, and then extracted the BHR data for the main thread in the parent process. I've attached the results to this bug.

When I try to divide the number of hangs with a particular signature (in this case, the signature for me is the top frame from the pseudostack) by the total number of session minutes for all pings gathered, I end up getting a really, really, _really_ small number. Like, less than 0.00 for the most frequent stack signature.

I assume I did something wrong here. Can you see in this pile of Python where I went wrong?

https://github.com/mikeconley/bhr-dash-data/blob/master/ipynb/Gather-Nightly-BHR-Stacks.ipynb
Flags: needinfo?(chutten)
(Assignee)

Comment 5

4 months ago
@mconley:

I think your issue is that in this section:

```
def score(grouping):
    grouping['hang_sum'] = grouping['hang_sum'] / total_sessions_length_m
    scored_stacks = []
    for stack_tuple in grouping['stacks']:
        score = stack_tuple[1] / total_sessions_length_m
        scored_stacks.append((stack_tuple[0], score))
        
    grouping['stacks'] = scored_stacks
    return grouping


scored_content_top_frames = {k: score(g) for k, g in content_top_frames.iteritems()}
scored_parent_top_frames = {k: score(g) for k, g in content_top_frames.iteritems()}
```

you modify |grouping| in place. Normally that would work as long as you only ran it once, but there's a typo at the bottom where you iterate content_top_frames a second time instead of iterating parent_top_frames, meaning you end up dividing by the total session length squared.

Comment 6

4 months ago
Also, generally speaking we're expecting a small number. Amongst only hanging clients, hangs over 100ms only happened single-digit times per minute during the e10s a/b experiments[1]. Expanding the denominator to all clients' usage and reducing the numerator to only one signature's hang count should result in some satisfyingly-small numbers.

This is why crashes are expressed per thousand usage hours instead of per minute. They're just so darn infrequent (thank goodness)

We may wish to consider similar units for this style of analysis, just to lift it out of scientific notation. :)

[1]: https://github.com/mozilla/e10s_analyses/blob/master/beta/48/week7/e10s_experiment.ipynb
Flags: needinfo?(chutten)
FWIW based on my local measurements, the hangs are *much* more frequent on the main thread of the content process than the main thread of the parent process, so we may be getting a larger number there.
(Reporter)

Comment 8

4 months ago
Created attachment 8843459 [details]
Parent and content process main thread hangs

Thanks for the help, everybody!

Here is a snapshot of main thread hangs for both the parent and content process main threads.

This is from a sample of 100%, so it's everybody, which means that this set is _large_. I didn't filter out the small potatoes - it's all potatoes.

It should be sorted in descending by "score", which is a sense of how many hangs with the pseudostack top-frame per 1000 session minutes of the entire sampled population. That'll be the "hang_sum".

Let me know if this needs to be pared down - I know the JSON files are quite large, and I didn't try to pretty print them or anything.
Attachment #8843097 - Attachment is obsolete: true
Flags: needinfo?(ehsan)
(Reporter)

Comment 9

4 months ago
Created attachment 8845631 [details]
20170309-bhr-snapshot.zip

My script wasn't sorting the results, and wasn't gathering results from the parent process properly. So I took another snapshot today. Here it is.
Attachment #8843459 - Attachment is obsolete: true
(Reporter)

Comment 10

4 months ago
Comment on attachment 8845631 [details]
20170309-bhr-snapshot.zip

Let me format these for better readability.
Attachment #8845631 - Attachment is obsolete: true
(Reporter)

Comment 11

4 months ago
Created attachment 8845655 [details]
content-20170301-20170309.json.zip

Grrrr, my Spark cluster is being a pain. :( I'll have to try to get the parent samples tomorrow. Here are the content ones.
(Reporter)

Comment 12

4 months ago
Created attachment 8846002 [details]
20170310-bhr-snapshot.zip

Better snapshot.
Attachment #8845655 - Attachment is obsolete: true
Whiteboard: [fce-active] → [fce-active][qf:p1]
(Reporter)

Comment 13

3 months ago
I've scheduled the job to run daily, and to produce a zip file of the most recent snapshot. I'll post a link here when the first one is done.
(Assignee)

Comment 14

3 months ago
@mconley / anyone interested:

I'm paused on this right now as we're working on a study to do with Flash blocking, but I thought I'd give a brief update on the dashboard side of things.

I adapted your work to grab hang stats by day, just to give me a month's data to work with for the visuals. However it's easy to pass in an arbitrary start and end, so if we ran this daily we wouldn't need to reach thirty days back for anything. Additionally, I added an approximate measure of the total milliseconds hanging, so that we can also plot the average milliseconds per hang, which could be useful.

Anyway, here's the gist if you'd like to review it: https://gist.github.com/squarewave/df4608de29dc13eb865f95dc42f3571e

And here's a screenshot of the dashboard so far: https://d3uepj124s5rcx.cloudfront.net/items/1u072q0u1n0u380Z1q1v/charts2.PNG

Let me know if that looks like a useful visualization of things. (Also, since it's not obvious from the UI - the bottom two charts reflect the stats of the most recent bar clicked on in the top chart.)
Kyle, are you able to help with building a diagnostic dashboard for BHR data?
Flags: needinfo?(klahnakoski)
(Reporter)

Comment 16

3 months ago
Hey dthaylor,

This looks great! Reading the ipynb, I believe you're also correcting for a concern that both ehsan and I had with my original ipynb, which is that it doesn't capture both the frequency and severity of the hangs. So that's good - that appears to be addressed here.

This does, in my opinion, look like a very useful visualization!
Depends on: 1346415
Comment hidden (off-topic)
I might be able to help. Where is the data?
(Reporter)

Comment 19

3 months ago
(In reply to Doug Thayer [:dthayer] from comment #14)
> @mconley / anyone interested:
> 
> Let me know if that looks like a useful visualization of things. (Also,
> since it's not obvious from the UI - the bottom two charts reflect the stats
> of the most recent bar clicked on in the top chart.)

At the risk of introducing scope creep, I have one minor request. The current ipynb is looking at threadHangStacks with the name "Gecko" and "Gecko_Child" (in the childPayloads). These map to the main threads in the parent process and content processes, respectively.

These are massively useful in order to find out what things are janking the browser, as we paint and process user events on those main threads. So having that data is great, and this is a great first step.

There are, however, other items in those threadHangStacks in both the parent and childPayloads that would be useful to select here. For the most part, these represent different threads. For example, I know that there is one for the compositor thread / compositor process main thread. There's also a "special" BHR monitor in content processes on the main thread that monitors for hangs during tab switch (this has the name "Gecko_Child_ForcePaint").

It would be monumentally useful if we could get access to _all_ of these "threads", in both the main payload and in the child payloads, and be able to select and visualize them individually.
Note to self: mconley's native stack results: https://s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/bhr-stacks/data/snapshot-daily.zip
Flags: needinfo?(ehsan)
(In reply to :Ehsan Akhgari from comment #20)
> Note to self: mconley's native stack results:
> https://s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/bhr-stacks/
> data/snapshot-daily.zip

*pseudo-stack results*
(Assignee)

Comment 22

3 months ago
Quick update on this: I created a PR to mozilla-reports for the notebook.

https://github.com/mozilla/mozilla-reports/pull/41

And here's a screenshot of the dashboard right now:

https://pageshot.net/RQKoxnR9St6ks65X/localhost

It _should_ be showing more threads than that. Currently diagnosing why I'm only seeing three threads show up (most likely the notebook's fault and not the dashboard's).

If you have any feedback please let me know!
(Reporter)

Comment 23

3 months ago
(In reply to Doug Thayer [:dthayer] from comment #22)
> 
> If you have any feedback please let me know!

"This is amazing" and "I want to get my hands on it as soon as possible". :)

Thank you for your work!
(Reporter)

Comment 24

3 months ago
Oh, I should also point out that mystor and ehsan are working on getting more native stacks[1] into BHR. I wonder if we could find a way of presenting those in this dashboard as well.

Note that there's a symbolication step that's required before the information is useful. The Snappy Symbolication Server[2] is what we generally use to symbolicate things like profiles, but when I've been pulling native stacks from BHR myself, I've been running my own server locally.

[1]: http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/main-ping.html#threadhangstats - see the "native stacks" bits.
[2]: https://github.com/mozilla/Snappy-Symbolication-Server
(Assignee)

Comment 25

3 months ago
Yep, it's on my roadmap!

I've played with Snappy a bit, so I was just waiting on more data in telemetry to start work on getting those symbolicated.

You mention you've been running Snappy locally - is this to avoid taxing the hosted Snappy servers too much? Do you think the job that we schedule for this will need to steer away from using the existing Snappy servers for the same reason?
(Reporter)

Comment 26

3 months ago
(In reply to Doug Thayer [:dthayer] from comment #25)
> Yep, it's on my roadmap!

Whew. :)

> You mention you've been running Snappy locally - is this to avoid taxing the
> hosted Snappy servers too much?

Yeah, I made that choice after talking to ddurst about my plans. He suggested running it locally since he's not entirely confident in the current implementation that's stood up at symbolapi.m.o (which has been known to go down from time to time). You might want to talk with him about it directly.

I know that PeterB is working on re-engineering Snappy, and is looking for people to road test it.

Hey PeterB, do you have your server stood up somewhere that dthayer could potentially query against for testing?
Flags: needinfo?(peterbe)
Doug, thanks a lot for working on this, the screenshot looks really nice!  I'm excited to try this out.  Is there a saved version of the page or an experimental instance running somewhere that I can play with for a while?  I'd like to get a sense of how easy it is to interact with the page in order to correlate the information displayed in the various parts, and I'm also curious to see what it's like switching between threads, etc.  Thanks!
(Assignee)

Comment 28

3 months ago
(In reply to :Ehsan Akhgari from comment #27)
> Doug, thanks a lot for working on this, the screenshot looks really nice! 
> I'm excited to try this out.  Is there a saved version of the page or an
> experimental instance running somewhere that I can play with for a while? 
> I'd like to get a sense of how easy it is to interact with the page in order
> to correlate the information displayed in the various parts, and I'm also
> curious to see what it's like switching between threads, etc.  Thanks!

Here's what I'm working with right now. It's a bit messy and still needs some odds and ends like loading indicators, etc., and the layout should probably adapt for narrower windows than what I'm using. Also if the performance is a problem, there's definitely more processing we can do on the notebook side to help out with that, so just let me know.

https://drive.google.com/a/mozilla.com/file/d/0B6b7hsu66Vd3R3NyTjFnSG5BZm8/view?usp=sharing
(In reply to Mike Conley (:mconley) (Catching up on reviews and needinfos) from comment #26)
> (In reply to Doug Thayer [:dthayer] from comment #25)
> > Yep, it's on my roadmap!
> 
> Whew. :)
> 
> > You mention you've been running Snappy locally - is this to avoid taxing the
> > hosted Snappy servers too much?
> 
> Yeah, I made that choice after talking to ddurst about my plans. He
> suggested running it locally since he's not entirely confident in the
> current implementation that's stood up at symbolapi.m.o (which has been
> known to go down from time to time). You might want to talk with him about
> it directly.
> 
> I know that PeterB is working on re-engineering Snappy, and is looking for
> people to road test it.
> 
> Hey PeterB, do you have your server stood up somewhere that dthayer could
> potentially query against for testing?

I do! It's http://snappy2-zero.herokuapp.com/ and it's based on the most recent Snappy-Symbolication-Server that is hosted on someones macmini somewhere under a desk somewhere. 

It's hosted on Heroku and its symbol cache is a bit limited but it's the worst thing that can happen is that it's slightly slowed down. 

Also, consider it a prototype as it sits there on Heroku. It'll be production grade later this year and hosted properly in AWS.
Flags: needinfo?(peterbe)
(Assignee)

Comment 30

3 months ago
@mconley,

Under what scenarios does hangs[].nativeStack.stacks have more than one entry? I can't find instances of this in a 1% sample of pings, but since we're only getting nativeStacks for hangs > 8s, that's not too revealing.
Flags: needinfo?(mconley)
(Reporter)

Comment 31

3 months ago
(In reply to Doug Thayer [:dthayer] from comment #30)
> Under what scenarios does hangs[].nativeStack.stacks have more than one
> entry? I can't find instances of this in a 1% sample of pings, but since
> we're only getting nativeStacks for hangs > 8s, that's not too revealing.

Actually, never. This line of code makes it so that currently we only ever collect a single native stack ever:

http://searchfox.org/mozilla-central/rev/ee7cfd05d7d9f83b017dd54c2052a2ec19cbd48b/xpcom/threads/BackgroundHangMonitor.cpp#478-480
Flags: needinfo?(mconley)
(Assignee)

Comment 32

3 months ago
Another quick update on this:

I'm waiting on review for the job, but as far as features go I think it's complete. It can be seen here: https://github.com/squarewave/background-hang-reporter-job/blob/master/background_hang_reporter_job/main.py

It is symbolicating all of the stacks in one request to Snappy, which I _think_ should ease Snappy's burden a little bit. We'll have to see how big that request gets though with a 100% sample of Nightly and a 128 ms threshold on native stack collection.
(Assignee)

Comment 33

3 months ago
Another update - I ended up just skipping Snappy entirely and symbolicating directly in the job with symbols downloaded from https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/. It's much faster, but I can see some downsides. I don't imagine much will change with how we want to symbolicate our stacks, but I could see us moving the symbol server somewhere else and needing to change the URL here and in Snappy's config. I don't think that's too big of a problem though.
(In reply to Doug Thayer [:dthayer] from comment #33)
> Another update - I ended up just skipping Snappy entirely and symbolicating
> directly in the job with symbols downloaded from
> https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/
> . It's much faster, but I can see some downsides. I don't imagine much will
> change with how we want to symbolicate our stacks, but I could see us moving
> the symbol server somewhere else and needing to change the URL here and in
> Snappy's config. I don't think that's too big of a problem though.

I'm working on the new snappy server. Technically it'll be the first of a couple of features (all around symbols) under a project called "Symbol Server". 

Your code is pretty much the same in terms of the inner guts. (Mind you you missed the wonderful python builtin bissect to find the nearest address in sorted addresses :)

"My" version is going to be awesome :) Including...
* Hosted and managed by Cloud OPs on some beefy nodes really near the S3 source. 
* Has a fast LRU cache that keeps most of the commonly fetched symbols in a Redis database.
* The potential to talk to S3 directly instead of relying on good ol' HTTP GET https://us-west-2.a.c/specificbucket/$symbol/$id which might give us a lot more leverage to draw from "dynamic buckets". I.e. if you have the credentials it could symbolicate using proprietary buckets. 
* We might be able to do some smarts to support symbols from try builds additionally. Just in our thoughts for now. 
* Since this service will sit near the new symbol upload service (currently run within Socorro) we might have the potential to keep the caches really hot and thus make your lookups really fast. Especially when symbolicate'ing against things like try builds.

(I'm a bit oblivious about much that's going on in this bug but just wanted to say hi from the future symbolication server)
(Assignee)

Comment 35

3 months ago
Continuing the conversation from https://github.com/mozilla/mozilla-reports/pull/41 here.

> I think there's an easy fix for the aggregation code. At each stage track
> (hang_sum, hang_count, usage_hours) instead of the two ratios you're currently
> tracking. The issue is that the ratios (hang_ms_per_hour, hang_count_per_hour)
> do not aggregate well.

The problem with this approach is the flatMap on line 88:

```
def filter_for_hangs_of_type(pings):
    return pings.flatMap(lambda p: only_hangs_of_type(p))
```

We map each ping to a list of hangs for that ping, and we don't want to include the subsessionLength of one pings multiple times or else it will be over-represented in the sum later. This is why I was thinking of doing a second pass. The other alternative is just to divide the sum that we include with each individual hang by the number of hangs for that ping, which would be a tiny bit confusing to read but would probably be better over all. Thoughts?

Also I was wondering what the best way to include a package in the job is. I was thinking install_requires would bring eventlet in when we run bdist_egg, but it doesn't seem to. I'm not too familiar with python packaging, so help in this area would be appreciated. I can just add a `!pip install` or something, but that seems wrong.
Flags: needinfo?(rharter)
(Assignee)

Comment 36

3 months ago
(For now I ended up going down the route of dividing the usage_hours by the number of hangs for a given ping, and that should be reflected in the latest commit on the GitHub repo.)
(In reply to Doug Thayer [:dthayer] from comment #35)
> Continuing the conversation from
> https://github.com/mozilla/mozilla-reports/pull/41 here.
> 
> > I think there's an easy fix for the aggregation code. At each stage track
> > (hang_sum, hang_count, usage_hours) instead of the two ratios you're currently
> > tracking. The issue is that the ratios (hang_ms_per_hour, hang_count_per_hour)
> > do not aggregate well.
> 
> The problem with this approach is the flatMap on line 88:
> 
> ```
> def filter_for_hangs_of_type(pings):
>     return pings.flatMap(lambda p: only_hangs_of_type(p))
> ```
> 
> We map each ping to a list of hangs for that ping, and we don't want to
> include the subsessionLength of one pings multiple times or else it will be
> over-represented in the sum later. This is why I was thinking of doing a
> second pass. The other alternative is just to divide the sum that we include
> with each individual hang by the number of hangs for that ping, which would
> be a tiny bit confusing to read but would probably be better over all.
> Thoughts?

I see the problem now. A second pass makes sense to me. I think dividing by the number of hangs will distort the un-aggregated values. 

> Also I was wondering what the best way to include a package in the job is. I
> was thinking install_requires would bring eventlet in when we run bdist_egg,
> but it doesn't seem to. I'm not too familiar with python packaging, so help
> in this area would be appreciated. I can just add a `!pip install` or
> something, but that seems wrong.

Unfortunately, calling `!pip install` from the notebook won't help here either. The package needs to be installed before starting the spark cluster or the child nodes will not have the package. To the best of my knowledge, you'll need to download and build each package you want to include, which can get unwieldy.

Instead, we could schedule this job on Airflow [2], which is more reliable and allows you to execute generic shell statements. There's an example script here [1] showing what that would look like. This would allow you to save and load a requirements.txt file.

[0] https://github.com/squarewave/background-hang-reporter-job/blob/master/scheduling/load_and_run.ipynb
[1] https://github.com/squarewave/background-hang-reporter-job/blob/master/scheduling/airflow.sh
[2] https://github.com/mozilla/telemetry-airflow
Flags: needinfo?(rharter)
(Reporter)

Comment 38

2 months ago
Any updates here? Is this bug blocked?
Flags: needinfo?(dothayer)
(Assignee)

Comment 39

2 months ago
It's not necessarily blocked, but it has been a bit on the back-burner waiting for Bug 1346415 to land. The job is now symbolicating stacks and I wanted to eyeball test how the dashboard will behave once we have more native stacks to work with. I'm hoping it won't be too noisy once native stacks are in, but it could be since native stacks will have more room to be different from each other in ways that we don't as users actually care about. Hopefully I'm making sense here.
Flags: needinfo?(dothayer)
(In reply to Doug Thayer [:dthayer] from comment #39)
> It's not necessarily blocked, but it has been a bit on the back-burner
> waiting for Bug 1346415 to land. The job is now symbolicating stacks and I
> wanted to eyeball test how the dashboard will behave once we have more
> native stacks to work with. I'm hoping it won't be too noisy once native
> stacks are in, but it could be since native stacks will have more room to be
> different from each other in ways that we don't as users actually care
> about. Hopefully I'm making sense here.

I was talking with bsmedberg yesterday a bit about what we would be looking for from a background hang data pipeline and dashboard. The basic ideas which we had was to classify the native stacks for hang reports based on common roots. For example:
- If a large number of hangs have identical stacks - that probably means that it's a simple hang like a lock being held for a long time or waiting on synchronous network I/O. In that case we can just categorize based on the full native stack.
- Other hangs can then be categorized manually by identifying the "stable root" of the hang, When you choose one, it should find other hangs which have that shared stable root and group them together.
I imagine that doing something like that would require some sort of UI - maybe you or someone else could do something like that in a follow-up?

In addition, it would be nice with this type of categorization if we could associate bug numbers with these hangs.

I'm hoping to land bug 1346415 really soon but just made one last change earlier today - I'm hoping to land it pretty much as soon as that change is r+-ed.
(Assignee)

Comment 41

2 months ago
So, I'm think a UI similar to a profiler might be handy for this.

Maybe a workflow like this:

- Select a date
- See a tree view, with a root node and branches for the bottom frame of each stack in the profiler data. To the left of each node we'll have stats on total time spent in this node and all descendant nodes, as well as time spent in this node alone (like in a profiler)
- After selecting a node, you can see the stats for that stack over the last month

I imagine it would look a lot like https://perf-html.io.

Thoughts?
(In reply to Doug Thayer [:dthayer] from comment #41)
> So, I'm think a UI similar to a profiler might be handy for this.
> 
> Maybe a workflow like this:
> 
> - Select a date
> - See a tree view, with a root node and branches for the bottom frame of
> each stack in the profiler data. To the left of each node we'll have stats
> on total time spent in this node and all descendant nodes, as well as time
> spent in this node alone (like in a profiler)
> - After selecting a node, you can see the stats for that stack over the last
> month
> 
> I imagine it would look a lot like https://perf-html.io.
> 
> Thoughts?

I hadn't thought of designing the UI like that. I really like the idea :).

One problem which might come is is that right now we can collect multiple hangs which have the same pseudostack and we give them the same native stack. It would be better for us to send a separate native stack for every hang but that would require some more client side work (to collect this extra information), and probably a separate ping mechanism to avoid the 1M maximum telemetry ping size. For now though I suppose we could just associate the total approximate length of all hangs (from the histogram) with each native stack. 

Ehsan had some more thoughts on this when I was talking to him earlier.
Flags: needinfo?(ehsan)
(Assignee)

Comment 43

2 months ago
At a micro scale this should cause pings to misrepresent hangs, but on a macro scale it should even out, no? Since if native stack A gets credit for native stack B's hangs, it will only do so some percentage of the time which is equal to A's share of the hang time anyway, no?
(In reply to Doug Thayer [:dthayer] from comment #43)
> At a micro scale this should cause pings to misrepresent hangs, but on a
> macro scale it should even out, no? Since if native stack A gets credit for
> native stack B's hangs, it will only do so some percentage of the time which
> is equal to A's share of the hang time anyway, no?

I think this is probably correct with a large enough sample size, yes.

In the future I would still like to have every native stack associated with the amount of time that its hang took, but we can get to that in the future.
(Assignee)

Comment 45

2 months ago
Been digging into this a bit today, and I think it shouldn't be too crazy of a lift to get a profiler-like view of hang data by putting this into perf.html.

- Modify my BHR job to fake it by outputting data in the format that perf.html expects for saved profiles.
- See what it looks like to load that. It will obviously be a bit clunky because perf.html thinks that it's dealing with a profile that has a real timeline. However, the tree view of hangs should be working at this point.
- Add a route handler in perf.html that will download one of these from S3 instead of having to browse for it.
- Strip out the flame graphs in the top section if we're showing hang data, since those only make sense with a single profile.
- Add in a graph to show how hang time for the currently selected node has changed over the past month, per thread. This would require a change to the data format that perf.html expects.

mstange, is this a completely crazy idea, or am I right that this could work without adding too much complexity to the perf.html code?
Flags: needinfo?(mstange)
Moderately crazy, I'd say :)

We're adding more and more time-based data views to perf.html and it would be really unfortunate to have to condition them all on some isFakeProfile flag. I'd prefer one of these two approaches instead: Either fork perf.html and build it into a UI that's targeted at showing BHR data, or add a completely different top level view + redux state branch to perf.html that's just targeted at showing BHR data and shares some of the React components with the profile view.
Flags: needinfo?(mstange)
(Assignee)

Comment 47

2 months ago
Here's a link to a "profile" generated from the last 30 days of BHR data. It is only displaying native stacks, so you'll only see hangs represented which lasted longer than eight seconds. The "ms" in the profile should be interpreted as milliseconds of hanging per 1000 usage hours.

https://perfht.ml/2pZu1l7

I'd love any help in eyeballing it to see if it feels like this will be a useful representation to pursue in a less hacky manner.
(Assignee)

Comment 48

2 months ago
Scratch that - use this link instead: https://perfht.ml/2pZrZRU
Nice! I really like the idea of the UI you proposed. Unfortunately I can't really interact with the example profile because it seems to hang my whole system for a few seconds whenever the selection changes. We might be overwhelming something with the thread stack graph drawing. Can you work around that by assigning different times to the different callstacks, so that the graph draws smaller rectangles that are next to each other, instead of one large rectangle being stacked on top of itself a large number of times? I know there isn't really a meaningful time stamp value to associate with a sample, but any arbitrary order should be enough to avoid this problem.
(Assignee)

Comment 50

2 months ago
I modified it like you said, and it didn't affect the performance much on my machine. I created a profile for only one day's data though, which should perform much better:

https://perfht.ml/2pZHx8a
(Assignee)

Comment 51

2 months ago
Sorry for the noise on this - making an artificial timeline surfaced a bug in my processing to get it to the profile format. Linking a new version. (As you might be able to tell this is all a bit rough right now.)

https://perfht.ml/2pZTydN
First things first, I *really* like this idea Doug, well done!  I have been thinking about how we can visualize this data for so long and for all of this time I've also been looking at perf-html.io tabs all the time and I couldn't put these two things together, thanks for thinking outside of the box!

I think there are maybe a few modifications that we may want to make to the way that the UI models the timeline.  For example, I'm not sure if I understand what the timeline itself maps to, would it map to data coming from each day?  But then we wouldn't have the temporal data that would allow us to sort these samples on the X axis.  Or do we?

Another thing which I'm trying to wrap my head around is how are we going to represent hangs of different lengths on top of each other?  I think it would be nice if we could somehow make it so that hangs that are more severe show up as larger in the UI.  One way to do it is to simply make hangs consume space proportional to the number of times they have occurred (as if the number of times they have occurred corresponds to the number of samples in the normal perf-html UI.)

I think these are probably all UI challenges that we can solve by iterating over it, I'm super excited to see what comes out of this effort!
Flags: needinfo?(ehsan)
Has there been any recent progress on this?
(Assignee)

Comment 54

2 months ago
Hey Ehsan, I've had a few other things on my plate so this was a bit in the background while waiting for Bug 1346415 to land. However recently I've forked the perf.html code and have been cutting away what I don't need, and I've started on the path of creating a top section of the profile that will suit our needs.

(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #52)
> I think there are maybe a few modifications that we may want to make to the
> way that the UI models the timeline.  For example, I'm not sure if I
> understand what the timeline itself maps to, would it map to data coming
> from each day?  But then we wouldn't have the temporal data that would allow
> us to sort these samples on the X axis.  Or do we?

I'm not sure what you're referring to regarding temporal data that's not day-by-day. My thoughts were that the timeline would represent the last N days (N = 30 probably) of hang telemetry data for whatever is highlighted in the call tree. Essentially, you would see a graph of milliseconds of hang time per hour of usage on top of a graph of number of hangs per hour of usage for the last day.

> Another thing which I'm trying to wrap my head around is how are we going to
> represent hangs of different lengths on top of each other?  I think it would
> be nice if we could somehow make it so that hangs that are more severe show
> up as larger in the UI.  One way to do it is to simply make hangs consume
> space proportional to the number of times they have occurred (as if the
> number of times they have occurred corresponds to the number of samples in
> the normal perf-html UI.)

I'm not 100% sure I understand this. My thoughts were that we want a measure of the severity of hangs (total milliseconds of hang per hour of usage) and a measure of the frequency of hangs (number of hangs per hour of usage, where a hang is defined as anything that reports with a native stack), both of which could be represented in the graph listed above, as well as on the left side of the call tree (where Running time and Self have single values each right now, there would be two values each - one for ms and one for count.) Does that seem reasonable, or do you think that won't cover the measures that you need?
(Assignee)

Comment 55

2 months ago
So, I'm noticing that sMaxNativeFrames gums this up a bit, since the profile call tree ends up having far more roots because of it. It's set to 25 right now, and I imagine we don't want to increase that by much. In any case, I think I can try to stitch together stacks that look incomplete by looking at other stacks that contain the root of the stack that we're looking at, but I just thought I'd give a heads-up here in case the 25 number is something that can be upped significantly.
I've finally gotten around to automating the process of collecting these native stacks, symbolicating them, and posting the results on s3. You can get to it by going to https://people-mozilla.org/~mlayzell/bhr which will redirect into s3.

The `json` link at the top will have the raw JSON dump with symbolicated stacks. You might find it convenient to consume that data (although it is quite large - a couple of gigabytes per day of JSON iirc) for the BHR dashboard while testing. It would also be good to start seeing if the profiler UI can handle that amount of data. We can start looking into ways to optimize the data if it cannot.

(In reply to Doug Thayer [:dthayer] from comment #55)
> So, I'm noticing that sMaxNativeFrames gums this up a bit, since the profile
> call tree ends up having far more roots because of it. It's set to 25 right
> now, and I imagine we don't want to increase that by much. In any case, I
> think I can try to stitch together stacks that look incomplete by looking at
> other stacks that contain the root of the stack that we're looking at, but I
> just thought I'd give a heads-up here in case the 25 number is something
> that can be upped significantly.

The reason for the relatively low limit here is because BHR stacks are reported inside of the main telemetry ping. If this ping exceeds 1MB in size, then the entire ping is discarded, so this places a pretty strict limit on the amount of data we want to collect.

I've started talking to people about moving the BHR pings outside of the main telemetry ping, so that we don't have to worry about blowing out the entire telemetry ping. Once that happens I want to look into upping the stack depth considerably, and looking into using the profiler's infrastructure for interleaving the native, JS, and pseudostacks into a single stack which will be easier to visualize in the profiler based UI.

Other than that it would be good to know if you have any other requests for how I can improve the data you're building off of, and it would be awesome if we could start hosting this perf-html based UI (even just a testing one) so that we can start using it to look at the data, as I imagine that even in its current form it's more convenient than the UI I threw together for https://people-mozilla.org/~mlayzell/bhr.
Flags: needinfo?(dothayer)
(Assignee)

Comment 57

a month ago
(In reply to Michael Layzell [:mystor] from comment #56)
> I've finally gotten around to automating the process of collecting these
> native stacks, symbolicating them, and posting the results on s3. You can
> get to it by going to https://people-mozilla.org/~mlayzell/bhr which will
> redirect into s3.
> 
> The `json` link at the top will have the raw JSON dump with symbolicated
> stacks. You might find it convenient to consume that data (although it is
> quite large - a couple of gigabytes per day of JSON iirc) for the BHR
> dashboard while testing. It would also be good to start seeing if the
> profiler UI can handle that amount of data. We can start looking into ways
> to optimize the data if it cannot.
> 
> (In reply to Doug Thayer [:dthayer] from comment #55)
> > So, I'm noticing that sMaxNativeFrames gums this up a bit, since the profile
> > call tree ends up having far more roots because of it. It's set to 25 right
> > now, and I imagine we don't want to increase that by much. In any case, I
> > think I can try to stitch together stacks that look incomplete by looking at
> > other stacks that contain the root of the stack that we're looking at, but I
> > just thought I'd give a heads-up here in case the 25 number is something
> > that can be upped significantly.
> 
> The reason for the relatively low limit here is because BHR stacks are
> reported inside of the main telemetry ping. If this ping exceeds 1MB in
> size, then the entire ping is discarded, so this places a pretty strict
> limit on the amount of data we want to collect.
> 
> I've started talking to people about moving the BHR pings outside of the
> main telemetry ping, so that we don't have to worry about blowing out the
> entire telemetry ping. Once that happens I want to look into upping the
> stack depth considerably, and looking into using the profiler's
> infrastructure for interleaving the native, JS, and pseudostacks into a
> single stack which will be easier to visualize in the profiler based UI.
> 
> Other than that it would be good to know if you have any other requests for
> how I can improve the data you're building off of, and it would be awesome
> if we could start hosting this perf-html based UI (even just a testing one)
> so that we can start using it to look at the data, as I imagine that even in
> its current form it's more convenient than the UI I threw together for
> https://people-mozilla.org/~mlayzell/bhr.

Unfortunately I think we've duplicated work a bit regarding the collection and symbolication of native stacks. But since you've done it too I'll probably be including you on the final review of the job that will go on airflow.

Regarding sMaxNativeFrames, I've created a bug 1363571 for it. I feel like we could trade reducing kMaximumNativeHangStacks for increasing sMaxNativeFrames and the data would be more useful. Sending the data in its own ping is probably a better long term solution, but it would be nice if we could have something better right now. In the meantime I've already written the code to stitch the stacks together by guessing, which is probably good enough to work with, so I'm hoping to post up the forked BHR dashboard by end of day today, and then it would indeed be great to get that hosted.
Flags: needinfo?(dothayer)
(Assignee)

Comment 58

a month ago
(Er, by forked BHR dashboard I mean the forked perf.html which has been adapted to look at BHR data)
(In reply to Doug Thayer [:dthayer] from comment #57)
> Unfortunately I think we've duplicated work a bit regarding the collection
> and symbolication of native stacks. But since you've done it too I'll
> probably be including you on the final review of the job that will go on
> airflow.

Your code is probably much nicer than mine, as mine is just a simple singlethreaded python script running on a server in a cron job. I'll definitely take a look at your improved script when you're done with it.

> Regarding sMaxNativeFrames, I've created a bug 1363571 for it. I feel like
> we could trade reducing kMaximumNativeHangStacks for increasing
> sMaxNativeFrames and the data would be more useful. Sending the data in its
> own ping is probably a better long term solution, but it would be nice if we
> could have something better right now. In the meantime I've already written
> the code to stitch the stacks together by guessing, which is probably good
> enough to work with, so I'm hoping to post up the forked BHR dashboard by
> end of day today, and then it would indeed be great to get that hosted.

I can definitely look into that. I imagine we could probably bump the number of frames up to 100 or so if we halved the maximum number of native stacks (as the majority of the native stack overhead iirc comes from the module map). It's all guesswork anyway.

I'll file a bug about it and look into how close we're getting to kMaximumNativeHangStacks right now.
(In reply to Doug Thayer [:dthayer] from comment #54)
> Hey Ehsan, I've had a few other things on my plate so this was a bit in the
> background while waiting for Bug 1346415 to land. However recently I've
> forked the perf.html code and have been cutting away what I don't need, and
> I've started on the path of creating a top section of the profile that will
> suit our needs.

Hey Doug, firstly sorry again for the continued laggy responses, and so happy to see the progress here!

> (In reply to :Ehsan Akhgari (super long backlog, slow to respond) from
> comment #52)
> > I think there are maybe a few modifications that we may want to make to the
> > way that the UI models the timeline.  For example, I'm not sure if I
> > understand what the timeline itself maps to, would it map to data coming
> > from each day?  But then we wouldn't have the temporal data that would allow
> > us to sort these samples on the X axis.  Or do we?
> 
> I'm not sure what you're referring to regarding temporal data that's not
> day-by-day. My thoughts were that the timeline would represent the last N
> days (N = 30 probably) of hang telemetry data for whatever is highlighted in
> the call tree. Essentially, you would see a graph of milliseconds of hang
> time per hour of usage on top of a graph of number of hangs per hour of
> usage for the last day.

Hmm I think I was confused here, I'm coming from the viewpoint of a heavy perf-html user, so I just assumed without a good reason that we're going to have a graphs view very similar to the existing one on top of the call tree view in perf-html but Michael explained to me that it's probably not going to be the case and I just extrapolated from my past experience!

What you are explaining above sounds very interesting indeed!

> > Another thing which I'm trying to wrap my head around is how are we going to
> > represent hangs of different lengths on top of each other?  I think it would
> > be nice if we could somehow make it so that hangs that are more severe show
> > up as larger in the UI.  One way to do it is to simply make hangs consume
> > space proportional to the number of times they have occurred (as if the
> > number of times they have occurred corresponds to the number of samples in
> > the normal perf-html UI.)
> 
> I'm not 100% sure I understand this. My thoughts were that we want a measure
> of the severity of hangs (total milliseconds of hang per hour of usage) and
> a measure of the frequency of hangs (number of hangs per hour of usage,
> where a hang is defined as anything that reports with a native stack), both
> of which could be represented in the graph listed above, as well as on the
> left side of the call tree (where Running time and Self have single values
> each right now, there would be two values each - one for ms and one for
> count.) Does that seem reasonable, or do you think that won't cover the
> measures that you need?

I think that sounds very reasonable, yes.  Honestly after we start using the dashboard for actually analyzing the BHR data we may find out deficiencies with the way that the data is presented, etc. but we can iterate on the dashboard as we gain experience using it, I can't hypothesize any issues with what you're describing here at the moment.  :-)
(Assignee)

Comment 61

a month ago
Alright, so I think the dashboard is in a fairly usable state[1]. It's now displaying a timeline over the last 30 days of hang time (dark) and hang count (light). Definitely needs some tweaking to make it clear what is what, but the basic pieces are there.

A few notes, however:
- This is stitching stacks which have been truncated to what we guess are their parents based on looking at other stacks. The way this works isn't remarkably sophisticated, so this could distort some things. Keep that in mind if you're looking at a stack longer than 25 frames.
- There's a dramatic change in the timeline that happens right around when we started sending more native stacks. So be careful not to read anything into that.
- Unfortunately with the restriction down to 15 native stacks, I suspect the data will quickly become distorted since we normalize on subsession length.

[1]: https://github.com/squarewave/hangs.html
(Assignee)

Comment 62

a month ago
Additionally, if anyone is interested, the code that grabs the BHR data, symbolicates it, and translates it into a profile format (though I have tweaked the gecko profiler format to suit the new dashboard's needs) is here:

https://github.com/squarewave/background-hang-reporter-job

Don't trust the tests, they really need to be updated.
(In reply to Doug Thayer [:dthayer] from comment #61)
> Alright, so I think the dashboard is in a fairly usable state[1]. It's now
> displaying a timeline over the last 30 days of hang time (dark) and hang
> count (light). Definitely needs some tweaking to make it clear what is what,
> but the basic pieces are there.
> 
> A few notes, however:
> - This is stitching stacks which have been truncated to what we guess are
> their parents based on looking at other stacks. The way this works isn't
> remarkably sophisticated, so this could distort some things. Keep that in
> mind if you're looking at a stack longer than 25 frames.
> - There's a dramatic change in the timeline that happens right around when
> we started sending more native stacks. So be careful not to read anything
> into that.
> - Unfortunately with the restriction down to 15 native stacks, I suspect the
> data will quickly become distorted since we normalize on subsession length.
> 
> [1]: https://github.com/squarewave/hangs.html

Thanks! I tried building with `build-prod` and posting the resulting HTML files on my people.mozilla.org account so it would be easier to link to and play with, but it seems to instantly 301-redirect to perf-html.io when I do that - not sure why.

From my playing with it locally, it looks neat but I am not sure what most of the data means. For most of the "samples" it seems as though all of the time and percentage numbers are 0, which is rather confusing. It would be nice if we could instead make this "number of seconds of time spent hanging with this stack frame present" or similar.

The number of root frames is also very unfortunate. Hopefully we can move BHR out of the main ping soon so that we can both stop blowing out the size of the main ping, and collect more accurate stacks.
(Assignee)

Comment 64

a month ago
(In reply to Michael Layzell [:mystor] from comment #63)
> Thanks! I tried building with `build-prod` and posting the resulting HTML
> files on my people.mozilla.org account so it would be easier to link to and
> play with, but it seems to instantly 301-redirect to perf-html.io when I do
> that - not sure why.

Hmm, not 100% sure why that is. I just hosted it here though:

https://squarewave.github.io/?thread=0

> From my playing with it locally, it looks neat but I am not sure what most
> of the data means. For most of the "samples" it seems as though all of the
> time and percentage numbers are 0, which is rather confusing. It would be
> nice if we could instead make this "number of seconds of time spent hanging
> with this stack frame present" or similar.

I'm thinking it's probably going to be a good idea to filter out all roots whose total time rounds to 0.0ms, but I haven't done that yet. Regarding what the data as a whole means, it would be nice to figure out a way for the UI to clarify that more, but essentially the Time column is the percentage of total hang time that this stack and its descendants were responsible for. Self should be turned into a percentage and named Self time, because the units are somewhat meaningless, but it's similar to the profiler - it's the time spent in hangs that actually had this exact stack, rather than descendants of this stack. And Count is the percentage of all hangs that this stack and its descendants were responsible for.

Basically, Time is a measure of severity, and Count is a measure of frequency.

The reason why I didn't use an absolute measure for these (something like "number of seconds...") is that we don't have the right kind of data to speak in absolute terms. We can say that relative to these other stacks, this stack used up X% of the time, but until we're getting all stacks in, I think it's going to be misleading to say something like "number of seconds of hanging per thousand hours of usage."

I'm going to be working on making the UI more clear about what it means, but any more feedback would be really helpful!
(Assignee)

Comment 65

a month ago
To anyone whom the stacks are confusing right now:

mstange pointed out that the stacks looked inverted by default. Turns out they're actually just half-inverted (they start being inverted at exactly the 25th frame up). This was due to a bug in my stitching code. I'm running the job again on the last 30 days right now to address it.
(Assignee)

Comment 66

a month ago
The latest (corrected) data can now be seen on the dashboard:

https://squarewave.github.io/?thread=0

My todo list as it stands:

- Make the UI better at clarifying itself
  - Clarify what the different shades of blue mean
  - Clarify what the time numbers are
- Fix the glaring performance problems on the page
  - Invert call stacks, for instance
- Write tests and get further review on the data gathering pieces

Updated

a month ago
Summary: Create (or resurrect) BHR dashboard → [meta] Create (or resurrect) BHR dashboard
(Assignee)

Comment 67

a month ago
For anyone interested, the UI should be more intelligible now (hopefully), and everything on the page as it stands should be functioning correctly and performantly (or at least more performantly than before). If anyone gets a chance to try actually using this for real investigation, let me know how that works!

https://squarewave.github.io
(In reply to Doug Thayer [:dthayer] from comment #67)
> For anyone interested, the UI should be more intelligible now (hopefully),
> and everything on the page as it stands should be functioning correctly and
> performantly (or at least more performantly than before). If anyone gets a
> chance to try actually using this for real investigation, let me know how
> that works!
> 
> https://squarewave.github.io

Looks really awesome now :) more functional for sure.

I looked through it with the earlier version and found ~6 bugs (which we already knew about, but it was nice to verify that they appeared in BHR too). It's already proving super useful :). Thanks!
Hey, Doug. How hard would it be to add an option to each native stack frame to get the pseudostacks which are also associated with those native stacks? It could be very useful for being able to see what JS frames are causing a particular problem.
Flags: needinfo?(dothayer)
Depends on: 1365749
(Assignee)

Comment 70

a month ago
(In reply to Michael Layzell [:mystor] from comment #69)
> Hey, Doug. How hard would it be to add an option to each native stack frame
> to get the pseudostacks which are also associated with those native stacks?
> It could be very useful for being able to see what JS frames are causing a
> particular problem.

Yeah I think that should be pretty doable. I'm picturing another graph at the top that shows pseudostacks by frequency. Thoughts? Also it might be good to just have an option to switch between pseudostacks and native stacks in the tree view?
Flags: needinfo?(dothayer)
(In reply to Doug Thayer [:dthayer] from comment #70)
> Yeah I think that should be pretty doable. I'm picturing another graph at
> the top that shows pseudostacks by frequency. Thoughts? Also it might be
> good to just have an option to switch between pseudostacks and native stacks
> in the tree view?

What I was imagining was to also have an option when selecting a native stack frame in the tree view, to be able to get a popup which lists the different pseudostacks which contained that native stack frame, so, for example, you could figure out what was going on in a clump of JS::RunScript stack frames. Ideally these would probably have relative frequencies next to them, so you can tell what the most common vs. least common pseudostack is. I'm not sure exactly what that would look like in this interface.

It might also be nice to, after symbolicating the different stacks, merge consecutive JS::RunScript/Interpret etc. frames into a (script running) frame. Having to dig through all of them is a tad inconvenient. Then you could just right click on the interesting thing which is being called, and figure out what call stacks are calling it.

BTW. bug 1365749 has landed now, so you should be getting 150 stack frame depth starting in last night's nightly.
(Assignee)

Comment 72

28 days ago
Script frames are now condensed, and I added a pseudostacks pane to list the pseudostacks associated with a given node and all of its descendants.

I'm trying to assess if the "(script)" clumping was too aggressive. The list of things to condense can be found here: https://github.com/squarewave/background-hang-reporter-job/blob/master/background_hang_reporter_job/profile.py#L8. It felt like just a lot of noise with those frames present, but that may just be me not getting much from them.

Additionally, the aggregation job now accounts for Bug 1363883, so we should be able to see data from pings which have consolidated their memoryMaps.
Doug, in bug 987728 comment 11 I used this dashboard to diagnose the issue in the matter of a few minutes!

On the issue on clumping the frames, I actually disagree with Michael, at least in the default view.  What's noise to one is information to another.  I found these stacks very confusing and I actually had to go to the bug and read it to understand how the code is transmuting across the "(script)" stuff!  Also, stripping js::, JS:: and JS_ basically excludes the entire JS team from consuming BHR!  ;-)

Here are two alternative proposals:

1) Let's not do this.
2) Let's bring back perf-html's C++/JS drop-down and have a Default/Condensed mode selector, where default is my preferred mode and Condensed is Michael's.  Then if someone suggest other useful filterings we can add them too!

In general I think tools should show all of the information by default and provide decent filtering options.  This has worked pretty well in the call tree view in perf-html in practice IMO.

I also have a feature request!  Can you please make the pseudostacks columns regular selectable text?  I wanted to paste the pseudostacks on the bug, but couldn't.

Really great work, I'm going to play with it more.  Sorry I have been distracted with some other stuff for a little while (work week and the follow-up chaos!)  Super excited about where this is going to take us.  :-)
(Assignee)

Comment 74

7 days ago
Minor update:

- Put in a few tweaks to get it to cope better with the increased data from bug 1365749
- Undid the script condense work for now, will look into adding a condense mode to the UI at a later time
- Made the pseudo stack view's text selectable
You need to log in before you can comment on or make changes to this bug.