Create (or resurrect) BHR dashboard

NEW
Assigned to

Status

Webtools
Telemetry Dashboard
2 months ago
3 days ago

People

(Reporter: mconley, Assigned: dthayer)

Tracking

(Depends on: 1 bug, Blocks: 2 bugs)

Trunk
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [fce-active][qf:p1])

Attachments

(1 attachment, 4 obsolete attachments)

(Reporter)

Description

2 months ago
We collect BackgroundHangMonitor data via Telemetry for all pre-release builds, and we've been doing so for a while. This data includes pseudostacks and hang-time histograms for the main thread (both in the parent and content processes), compositor thread, vsync thread, the ProcessHangMonitor thread, and the ImageBridgeChild thread. There are probably others.

That data used to be displayed on the BHR dashboard, but that dashboard broke when Unified Telemetry came about. Then the dashboard was (rightly) decommissioned because it was just flat-out broken.

We want to bring that dashboard back from the dead. Specifically, ehsan is interested in getting a visualization to see how we're doing on things like sync IPC hangs in the wild.
(Reporter)

Comment 1

2 months ago
ni?ing ddurst as suggested.
Flags: needinfo?(ddurst)
Harald, do you think you would have cycles to help with this?  This dashboard will be the first one we'll be using for Quantum Flow work...
Flags: needinfo?(hkirschner)
Blocks: 1337841

Updated

2 months ago
Assignee: nobody → dothayer
Flags: needinfo?(ddurst)
Whiteboard: [fce-active]
I would probably split this issue in 2 work streams to aggregate hang data from incoming pings:

1) New aggregate for hang signatures for tracking hangs on a signature level

This should be guided by engineers and their hang use cases, starting from just simply aggregating hang signatures with duration.

Happy to help to make this into a more elaborate diagnostic dashboard when we have the aggregate to easily draw data from.

2) New aggregate for hangs-per-minute for reporting on hangs as release quality metrics

Related: https://github.com/mozilla/telemetry-streaming/issues/20
Flags: needinfo?(hkirschner)
(Reporter)

Comment 4

2 months ago
Created attachment 8843097 [details]
parent-20170302.json

Hey chutten,

I took a 10% sample of all pings on Nightly for the past 3 days, and then extracted the BHR data for the main thread in the parent process. I've attached the results to this bug.

When I try to divide the number of hangs with a particular signature (in this case, the signature for me is the top frame from the pseudostack) by the total number of session minutes for all pings gathered, I end up getting a really, really, _really_ small number. Like, less than 0.00 for the most frequent stack signature.

I assume I did something wrong here. Can you see in this pile of Python where I went wrong?

https://github.com/mikeconley/bhr-dash-data/blob/master/ipynb/Gather-Nightly-BHR-Stacks.ipynb
Flags: needinfo?(chutten)
(Assignee)

Comment 5

2 months ago
@mconley:

I think your issue is that in this section:

```
def score(grouping):
    grouping['hang_sum'] = grouping['hang_sum'] / total_sessions_length_m
    scored_stacks = []
    for stack_tuple in grouping['stacks']:
        score = stack_tuple[1] / total_sessions_length_m
        scored_stacks.append((stack_tuple[0], score))
        
    grouping['stacks'] = scored_stacks
    return grouping


scored_content_top_frames = {k: score(g) for k, g in content_top_frames.iteritems()}
scored_parent_top_frames = {k: score(g) for k, g in content_top_frames.iteritems()}
```

you modify |grouping| in place. Normally that would work as long as you only ran it once, but there's a typo at the bottom where you iterate content_top_frames a second time instead of iterating parent_top_frames, meaning you end up dividing by the total session length squared.

Comment 6

2 months ago
Also, generally speaking we're expecting a small number. Amongst only hanging clients, hangs over 100ms only happened single-digit times per minute during the e10s a/b experiments[1]. Expanding the denominator to all clients' usage and reducing the numerator to only one signature's hang count should result in some satisfyingly-small numbers.

This is why crashes are expressed per thousand usage hours instead of per minute. They're just so darn infrequent (thank goodness)

We may wish to consider similar units for this style of analysis, just to lift it out of scientific notation. :)

[1]: https://github.com/mozilla/e10s_analyses/blob/master/beta/48/week7/e10s_experiment.ipynb
Flags: needinfo?(chutten)
FWIW based on my local measurements, the hangs are *much* more frequent on the main thread of the content process than the main thread of the parent process, so we may be getting a larger number there.
(Reporter)

Comment 8

2 months ago
Created attachment 8843459 [details]
Parent and content process main thread hangs

Thanks for the help, everybody!

Here is a snapshot of main thread hangs for both the parent and content process main threads.

This is from a sample of 100%, so it's everybody, which means that this set is _large_. I didn't filter out the small potatoes - it's all potatoes.

It should be sorted in descending by "score", which is a sense of how many hangs with the pseudostack top-frame per 1000 session minutes of the entire sampled population. That'll be the "hang_sum".

Let me know if this needs to be pared down - I know the JSON files are quite large, and I didn't try to pretty print them or anything.
Attachment #8843097 - Attachment is obsolete: true
Flags: needinfo?(ehsan)
(Reporter)

Comment 9

2 months ago
Created attachment 8845631 [details]
20170309-bhr-snapshot.zip

My script wasn't sorting the results, and wasn't gathering results from the parent process properly. So I took another snapshot today. Here it is.
Attachment #8843459 - Attachment is obsolete: true
(Reporter)

Comment 10

2 months ago
Comment on attachment 8845631 [details]
20170309-bhr-snapshot.zip

Let me format these for better readability.
Attachment #8845631 - Attachment is obsolete: true
(Reporter)

Comment 11

2 months ago
Created attachment 8845655 [details]
content-20170301-20170309.json.zip

Grrrr, my Spark cluster is being a pain. :( I'll have to try to get the parent samples tomorrow. Here are the content ones.
(Reporter)

Comment 12

2 months ago
Created attachment 8846002 [details]
20170310-bhr-snapshot.zip

Better snapshot.
Attachment #8845655 - Attachment is obsolete: true
Whiteboard: [fce-active] → [fce-active][qf:p1]
(Reporter)

Comment 13

2 months ago
I've scheduled the job to run daily, and to produce a zip file of the most recent snapshot. I'll post a link here when the first one is done.
(Assignee)

Comment 14

2 months ago
@mconley / anyone interested:

I'm paused on this right now as we're working on a study to do with Flash blocking, but I thought I'd give a brief update on the dashboard side of things.

I adapted your work to grab hang stats by day, just to give me a month's data to work with for the visuals. However it's easy to pass in an arbitrary start and end, so if we ran this daily we wouldn't need to reach thirty days back for anything. Additionally, I added an approximate measure of the total milliseconds hanging, so that we can also plot the average milliseconds per hang, which could be useful.

Anyway, here's the gist if you'd like to review it: https://gist.github.com/squarewave/df4608de29dc13eb865f95dc42f3571e

And here's a screenshot of the dashboard so far: https://d3uepj124s5rcx.cloudfront.net/items/1u072q0u1n0u380Z1q1v/charts2.PNG

Let me know if that looks like a useful visualization of things. (Also, since it's not obvious from the UI - the bottom two charts reflect the stats of the most recent bar clicked on in the top chart.)
Kyle, are you able to help with building a diagnostic dashboard for BHR data?
Flags: needinfo?(klahnakoski)
(Reporter)

Comment 16

2 months ago
Hey dthaylor,

This looks great! Reading the ipynb, I believe you're also correcting for a concern that both ehsan and I had with my original ipynb, which is that it doesn't capture both the frequency and severity of the hangs. So that's good - that appears to be addressed here.

This does, in my opinion, look like a very useful visualization!
Depends on: 1346415
Comment hidden (off-topic)
I might be able to help. Where is the data?
(Reporter)

Comment 19

a month ago
(In reply to Doug Thayer [:dthayer] from comment #14)
> @mconley / anyone interested:
> 
> Let me know if that looks like a useful visualization of things. (Also,
> since it's not obvious from the UI - the bottom two charts reflect the stats
> of the most recent bar clicked on in the top chart.)

At the risk of introducing scope creep, I have one minor request. The current ipynb is looking at threadHangStacks with the name "Gecko" and "Gecko_Child" (in the childPayloads). These map to the main threads in the parent process and content processes, respectively.

These are massively useful in order to find out what things are janking the browser, as we paint and process user events on those main threads. So having that data is great, and this is a great first step.

There are, however, other items in those threadHangStacks in both the parent and childPayloads that would be useful to select here. For the most part, these represent different threads. For example, I know that there is one for the compositor thread / compositor process main thread. There's also a "special" BHR monitor in content processes on the main thread that monitors for hangs during tab switch (this has the name "Gecko_Child_ForcePaint").

It would be monumentally useful if we could get access to _all_ of these "threads", in both the main payload and in the child payloads, and be able to select and visualize them individually.
Note to self: mconley's native stack results: https://s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/bhr-stacks/data/snapshot-daily.zip
Flags: needinfo?(ehsan)
(In reply to :Ehsan Akhgari from comment #20)
> Note to self: mconley's native stack results:
> https://s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/bhr-stacks/
> data/snapshot-daily.zip

*pseudo-stack results*
(Assignee)

Comment 22

a month ago
Quick update on this: I created a PR to mozilla-reports for the notebook.

https://github.com/mozilla/mozilla-reports/pull/41

And here's a screenshot of the dashboard right now:

https://pageshot.net/RQKoxnR9St6ks65X/localhost

It _should_ be showing more threads than that. Currently diagnosing why I'm only seeing three threads show up (most likely the notebook's fault and not the dashboard's).

If you have any feedback please let me know!
(Reporter)

Comment 23

a month ago
(In reply to Doug Thayer [:dthayer] from comment #22)
> 
> If you have any feedback please let me know!

"This is amazing" and "I want to get my hands on it as soon as possible". :)

Thank you for your work!
(Reporter)

Comment 24

a month ago
Oh, I should also point out that mystor and ehsan are working on getting more native stacks[1] into BHR. I wonder if we could find a way of presenting those in this dashboard as well.

Note that there's a symbolication step that's required before the information is useful. The Snappy Symbolication Server[2] is what we generally use to symbolicate things like profiles, but when I've been pulling native stacks from BHR myself, I've been running my own server locally.

[1]: http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/main-ping.html#threadhangstats - see the "native stacks" bits.
[2]: https://github.com/mozilla/Snappy-Symbolication-Server
(Assignee)

Comment 25

a month ago
Yep, it's on my roadmap!

I've played with Snappy a bit, so I was just waiting on more data in telemetry to start work on getting those symbolicated.

You mention you've been running Snappy locally - is this to avoid taxing the hosted Snappy servers too much? Do you think the job that we schedule for this will need to steer away from using the existing Snappy servers for the same reason?
(Reporter)

Comment 26

a month ago
(In reply to Doug Thayer [:dthayer] from comment #25)
> Yep, it's on my roadmap!

Whew. :)

> You mention you've been running Snappy locally - is this to avoid taxing the
> hosted Snappy servers too much?

Yeah, I made that choice after talking to ddurst about my plans. He suggested running it locally since he's not entirely confident in the current implementation that's stood up at symbolapi.m.o (which has been known to go down from time to time). You might want to talk with him about it directly.

I know that PeterB is working on re-engineering Snappy, and is looking for people to road test it.

Hey PeterB, do you have your server stood up somewhere that dthayer could potentially query against for testing?
Flags: needinfo?(peterbe)
Doug, thanks a lot for working on this, the screenshot looks really nice!  I'm excited to try this out.  Is there a saved version of the page or an experimental instance running somewhere that I can play with for a while?  I'd like to get a sense of how easy it is to interact with the page in order to correlate the information displayed in the various parts, and I'm also curious to see what it's like switching between threads, etc.  Thanks!
(Assignee)

Comment 28

a month ago
(In reply to :Ehsan Akhgari from comment #27)
> Doug, thanks a lot for working on this, the screenshot looks really nice! 
> I'm excited to try this out.  Is there a saved version of the page or an
> experimental instance running somewhere that I can play with for a while? 
> I'd like to get a sense of how easy it is to interact with the page in order
> to correlate the information displayed in the various parts, and I'm also
> curious to see what it's like switching between threads, etc.  Thanks!

Here's what I'm working with right now. It's a bit messy and still needs some odds and ends like loading indicators, etc., and the layout should probably adapt for narrower windows than what I'm using. Also if the performance is a problem, there's definitely more processing we can do on the notebook side to help out with that, so just let me know.

https://drive.google.com/a/mozilla.com/file/d/0B6b7hsu66Vd3R3NyTjFnSG5BZm8/view?usp=sharing
(In reply to Mike Conley (:mconley) (Catching up on reviews and needinfos) from comment #26)
> (In reply to Doug Thayer [:dthayer] from comment #25)
> > Yep, it's on my roadmap!
> 
> Whew. :)
> 
> > You mention you've been running Snappy locally - is this to avoid taxing the
> > hosted Snappy servers too much?
> 
> Yeah, I made that choice after talking to ddurst about my plans. He
> suggested running it locally since he's not entirely confident in the
> current implementation that's stood up at symbolapi.m.o (which has been
> known to go down from time to time). You might want to talk with him about
> it directly.
> 
> I know that PeterB is working on re-engineering Snappy, and is looking for
> people to road test it.
> 
> Hey PeterB, do you have your server stood up somewhere that dthayer could
> potentially query against for testing?

I do! It's http://snappy2-zero.herokuapp.com/ and it's based on the most recent Snappy-Symbolication-Server that is hosted on someones macmini somewhere under a desk somewhere. 

It's hosted on Heroku and its symbol cache is a bit limited but it's the worst thing that can happen is that it's slightly slowed down. 

Also, consider it a prototype as it sits there on Heroku. It'll be production grade later this year and hosted properly in AWS.
Flags: needinfo?(peterbe)
(Assignee)

Comment 30

a month ago
@mconley,

Under what scenarios does hangs[].nativeStack.stacks have more than one entry? I can't find instances of this in a 1% sample of pings, but since we're only getting nativeStacks for hangs > 8s, that's not too revealing.
Flags: needinfo?(mconley)
(Reporter)

Comment 31

a month ago
(In reply to Doug Thayer [:dthayer] from comment #30)
> Under what scenarios does hangs[].nativeStack.stacks have more than one
> entry? I can't find instances of this in a 1% sample of pings, but since
> we're only getting nativeStacks for hangs > 8s, that's not too revealing.

Actually, never. This line of code makes it so that currently we only ever collect a single native stack ever:

http://searchfox.org/mozilla-central/rev/ee7cfd05d7d9f83b017dd54c2052a2ec19cbd48b/xpcom/threads/BackgroundHangMonitor.cpp#478-480
Flags: needinfo?(mconley)
(Assignee)

Comment 32

a month ago
Another quick update on this:

I'm waiting on review for the job, but as far as features go I think it's complete. It can be seen here: https://github.com/squarewave/background-hang-reporter-job/blob/master/background_hang_reporter_job/main.py

It is symbolicating all of the stacks in one request to Snappy, which I _think_ should ease Snappy's burden a little bit. We'll have to see how big that request gets though with a 100% sample of Nightly and a 128 ms threshold on native stack collection.
(Assignee)

Comment 33

a month ago
Another update - I ended up just skipping Snappy entirely and symbolicating directly in the job with symbols downloaded from https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/. It's much faster, but I can see some downsides. I don't imagine much will change with how we want to symbolicate our stacks, but I could see us moving the symbol server somewhere else and needing to change the URL here and in Snappy's config. I don't think that's too big of a problem though.
(In reply to Doug Thayer [:dthayer] from comment #33)
> Another update - I ended up just skipping Snappy entirely and symbolicating
> directly in the job with symbols downloaded from
> https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/
> . It's much faster, but I can see some downsides. I don't imagine much will
> change with how we want to symbolicate our stacks, but I could see us moving
> the symbol server somewhere else and needing to change the URL here and in
> Snappy's config. I don't think that's too big of a problem though.

I'm working on the new snappy server. Technically it'll be the first of a couple of features (all around symbols) under a project called "Symbol Server". 

Your code is pretty much the same in terms of the inner guts. (Mind you you missed the wonderful python builtin bissect to find the nearest address in sorted addresses :)

"My" version is going to be awesome :) Including...
* Hosted and managed by Cloud OPs on some beefy nodes really near the S3 source. 
* Has a fast LRU cache that keeps most of the commonly fetched symbols in a Redis database.
* The potential to talk to S3 directly instead of relying on good ol' HTTP GET https://us-west-2.a.c/specificbucket/$symbol/$id which might give us a lot more leverage to draw from "dynamic buckets". I.e. if you have the credentials it could symbolicate using proprietary buckets. 
* We might be able to do some smarts to support symbols from try builds additionally. Just in our thoughts for now. 
* Since this service will sit near the new symbol upload service (currently run within Socorro) we might have the potential to keep the caches really hot and thus make your lookups really fast. Especially when symbolicate'ing against things like try builds.

(I'm a bit oblivious about much that's going on in this bug but just wanted to say hi from the future symbolication server)
(Assignee)

Comment 35

27 days ago
Continuing the conversation from https://github.com/mozilla/mozilla-reports/pull/41 here.

> I think there's an easy fix for the aggregation code. At each stage track
> (hang_sum, hang_count, usage_hours) instead of the two ratios you're currently
> tracking. The issue is that the ratios (hang_ms_per_hour, hang_count_per_hour)
> do not aggregate well.

The problem with this approach is the flatMap on line 88:

```
def filter_for_hangs_of_type(pings):
    return pings.flatMap(lambda p: only_hangs_of_type(p))
```

We map each ping to a list of hangs for that ping, and we don't want to include the subsessionLength of one pings multiple times or else it will be over-represented in the sum later. This is why I was thinking of doing a second pass. The other alternative is just to divide the sum that we include with each individual hang by the number of hangs for that ping, which would be a tiny bit confusing to read but would probably be better over all. Thoughts?

Also I was wondering what the best way to include a package in the job is. I was thinking install_requires would bring eventlet in when we run bdist_egg, but it doesn't seem to. I'm not too familiar with python packaging, so help in this area would be appreciated. I can just add a `!pip install` or something, but that seems wrong.
Flags: needinfo?(rharter)
(Assignee)

Comment 36

27 days ago
(For now I ended up going down the route of dividing the usage_hours by the number of hangs for a given ping, and that should be reflected in the latest commit on the GitHub repo.)
(In reply to Doug Thayer [:dthayer] from comment #35)
> Continuing the conversation from
> https://github.com/mozilla/mozilla-reports/pull/41 here.
> 
> > I think there's an easy fix for the aggregation code. At each stage track
> > (hang_sum, hang_count, usage_hours) instead of the two ratios you're currently
> > tracking. The issue is that the ratios (hang_ms_per_hour, hang_count_per_hour)
> > do not aggregate well.
> 
> The problem with this approach is the flatMap on line 88:
> 
> ```
> def filter_for_hangs_of_type(pings):
>     return pings.flatMap(lambda p: only_hangs_of_type(p))
> ```
> 
> We map each ping to a list of hangs for that ping, and we don't want to
> include the subsessionLength of one pings multiple times or else it will be
> over-represented in the sum later. This is why I was thinking of doing a
> second pass. The other alternative is just to divide the sum that we include
> with each individual hang by the number of hangs for that ping, which would
> be a tiny bit confusing to read but would probably be better over all.
> Thoughts?

I see the problem now. A second pass makes sense to me. I think dividing by the number of hangs will distort the un-aggregated values. 

> Also I was wondering what the best way to include a package in the job is. I
> was thinking install_requires would bring eventlet in when we run bdist_egg,
> but it doesn't seem to. I'm not too familiar with python packaging, so help
> in this area would be appreciated. I can just add a `!pip install` or
> something, but that seems wrong.

Unfortunately, calling `!pip install` from the notebook won't help here either. The package needs to be installed before starting the spark cluster or the child nodes will not have the package. To the best of my knowledge, you'll need to download and build each package you want to include, which can get unwieldy.

Instead, we could schedule this job on Airflow [2], which is more reliable and allows you to execute generic shell statements. There's an example script here [1] showing what that would look like. This would allow you to save and load a requirements.txt file.

[0] https://github.com/squarewave/background-hang-reporter-job/blob/master/scheduling/load_and_run.ipynb
[1] https://github.com/squarewave/background-hang-reporter-job/blob/master/scheduling/airflow.sh
[2] https://github.com/mozilla/telemetry-airflow
Flags: needinfo?(rharter)
(Reporter)

Comment 38

6 days ago
Any updates here? Is this bug blocked?
Flags: needinfo?(dothayer)
(Assignee)

Comment 39

5 days ago
It's not necessarily blocked, but it has been a bit on the back-burner waiting for Bug 1346415 to land. The job is now symbolicating stacks and I wanted to eyeball test how the dashboard will behave once we have more native stacks to work with. I'm hoping it won't be too noisy once native stacks are in, but it could be since native stacks will have more room to be different from each other in ways that we don't as users actually care about. Hopefully I'm making sense here.
Flags: needinfo?(dothayer)
(In reply to Doug Thayer [:dthayer] from comment #39)
> It's not necessarily blocked, but it has been a bit on the back-burner
> waiting for Bug 1346415 to land. The job is now symbolicating stacks and I
> wanted to eyeball test how the dashboard will behave once we have more
> native stacks to work with. I'm hoping it won't be too noisy once native
> stacks are in, but it could be since native stacks will have more room to be
> different from each other in ways that we don't as users actually care
> about. Hopefully I'm making sense here.

I was talking with bsmedberg yesterday a bit about what we would be looking for from a background hang data pipeline and dashboard. The basic ideas which we had was to classify the native stacks for hang reports based on common roots. For example:
- If a large number of hangs have identical stacks - that probably means that it's a simple hang like a lock being held for a long time or waiting on synchronous network I/O. In that case we can just categorize based on the full native stack.
- Other hangs can then be categorized manually by identifying the "stable root" of the hang, When you choose one, it should find other hangs which have that shared stable root and group them together.
I imagine that doing something like that would require some sort of UI - maybe you or someone else could do something like that in a follow-up?

In addition, it would be nice with this type of categorization if we could associate bug numbers with these hangs.

I'm hoping to land bug 1346415 really soon but just made one last change earlier today - I'm hoping to land it pretty much as soon as that change is r+-ed.
(Assignee)

Comment 41

5 days ago
So, I'm think a UI similar to a profiler might be handy for this.

Maybe a workflow like this:

- Select a date
- See a tree view, with a root node and branches for the bottom frame of each stack in the profiler data. To the left of each node we'll have stats on total time spent in this node and all descendant nodes, as well as time spent in this node alone (like in a profiler)
- After selecting a node, you can see the stats for that stack over the last month

I imagine it would look a lot like https://perf-html.io.

Thoughts?
(In reply to Doug Thayer [:dthayer] from comment #41)
> So, I'm think a UI similar to a profiler might be handy for this.
> 
> Maybe a workflow like this:
> 
> - Select a date
> - See a tree view, with a root node and branches for the bottom frame of
> each stack in the profiler data. To the left of each node we'll have stats
> on total time spent in this node and all descendant nodes, as well as time
> spent in this node alone (like in a profiler)
> - After selecting a node, you can see the stats for that stack over the last
> month
> 
> I imagine it would look a lot like https://perf-html.io.
> 
> Thoughts?

I hadn't thought of designing the UI like that. I really like the idea :).

One problem which might come is is that right now we can collect multiple hangs which have the same pseudostack and we give them the same native stack. It would be better for us to send a separate native stack for every hang but that would require some more client side work (to collect this extra information), and probably a separate ping mechanism to avoid the 1M maximum telemetry ping size. For now though I suppose we could just associate the total approximate length of all hangs (from the histogram) with each native stack. 

Ehsan had some more thoughts on this when I was talking to him earlier.
Flags: needinfo?(ehsan)
(Assignee)

Comment 43

5 days ago
At a micro scale this should cause pings to misrepresent hangs, but on a macro scale it should even out, no? Since if native stack A gets credit for native stack B's hangs, it will only do so some percentage of the time which is equal to A's share of the hang time anyway, no?
(In reply to Doug Thayer [:dthayer] from comment #43)
> At a micro scale this should cause pings to misrepresent hangs, but on a
> macro scale it should even out, no? Since if native stack A gets credit for
> native stack B's hangs, it will only do so some percentage of the time which
> is equal to A's share of the hang time anyway, no?

I think this is probably correct with a large enough sample size, yes.

In the future I would still like to have every native stack associated with the amount of time that its hang took, but we can get to that in the future.
(Assignee)

Comment 45

5 days ago
Been digging into this a bit today, and I think it shouldn't be too crazy of a lift to get a profiler-like view of hang data by putting this into perf.html.

- Modify my BHR job to fake it by outputting data in the format that perf.html expects for saved profiles.
- See what it looks like to load that. It will obviously be a bit clunky because perf.html thinks that it's dealing with a profile that has a real timeline. However, the tree view of hangs should be working at this point.
- Add a route handler in perf.html that will download one of these from S3 instead of having to browse for it.
- Strip out the flame graphs in the top section if we're showing hang data, since those only make sense with a single profile.
- Add in a graph to show how hang time for the currently selected node has changed over the past month, per thread. This would require a change to the data format that perf.html expects.

mstange, is this a completely crazy idea, or am I right that this could work without adding too much complexity to the perf.html code?
Flags: needinfo?(mstange)
Moderately crazy, I'd say :)

We're adding more and more time-based data views to perf.html and it would be really unfortunate to have to condition them all on some isFakeProfile flag. I'd prefer one of these two approaches instead: Either fork perf.html and build it into a UI that's targeted at showing BHR data, or add a completely different top level view + redux state branch to perf.html that's just targeted at showing BHR data and shares some of the React components with the profile view.
Flags: needinfo?(mstange)
(Assignee)

Comment 47

3 days ago
Here's a link to a "profile" generated from the last 30 days of BHR data. It is only displaying native stacks, so you'll only see hangs represented which lasted longer than eight seconds. The "ms" in the profile should be interpreted as milliseconds of hanging per 1000 usage hours.

https://perfht.ml/2pZu1l7

I'd love any help in eyeballing it to see if it feels like this will be a useful representation to pursue in a less hacky manner.
(Assignee)

Comment 48

3 days ago
Scratch that - use this link instead: https://perfht.ml/2pZrZRU
Nice! I really like the idea of the UI you proposed. Unfortunately I can't really interact with the example profile because it seems to hang my whole system for a few seconds whenever the selection changes. We might be overwhelming something with the thread stack graph drawing. Can you work around that by assigning different times to the different callstacks, so that the graph draws smaller rectangles that are next to each other, instead of one large rectangle being stacked on top of itself a large number of times? I know there isn't really a meaningful time stamp value to associate with a sample, but any arbitrary order should be enough to avoid this problem.
(Assignee)

Comment 50

3 days ago
I modified it like you said, and it didn't affect the performance much on my machine. I created a profile for only one day's data though, which should perform much better:

https://perfht.ml/2pZHx8a
(Assignee)

Comment 51

3 days ago
Sorry for the noise on this - making an artificial timeline surfaced a bug in my processing to get it to the profile format. Linking a new version. (As you might be able to tell this is all a bit rough right now.)

https://perfht.ml/2pZTydN
First things first, I *really* like this idea Doug, well done!  I have been thinking about how we can visualize this data for so long and for all of this time I've also been looking at perf-html.io tabs all the time and I couldn't put these two things together, thanks for thinking outside of the box!

I think there are maybe a few modifications that we may want to make to the way that the UI models the timeline.  For example, I'm not sure if I understand what the timeline itself maps to, would it map to data coming from each day?  But then we wouldn't have the temporal data that would allow us to sort these samples on the X axis.  Or do we?

Another thing which I'm trying to wrap my head around is how are we going to represent hangs of different lengths on top of each other?  I think it would be nice if we could somehow make it so that hangs that are more severe show up as larger in the UI.  One way to do it is to simply make hangs consume space proportional to the number of times they have occurred (as if the number of times they have occurred corresponds to the number of samples in the normal perf-html UI.)

I think these are probably all UI challenges that we can solve by iterating over it, I'm super excited to see what comes out of this effort!
Flags: needinfo?(ehsan)
You need to log in before you can comment on or make changes to this bug.