Limit the number of thread hang stats reported to Telemetry

RESOLVED FIXED in Firefox 43

Status

()

Toolkit
Telemetry
P1
normal
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: Dexter, Assigned: Dexter)

Tracking

(Blocks: 1 bug)

Trunk
mozilla44
Points:
1
Dependency tree / graph

Firefox Tracking Flags

(firefox42 wontfix, firefox43 fixed, firefox44 fixed)

Details

(Whiteboard: [unifiedTelemetry][measurement:client])

Attachments

(1 attachment, 2 obsolete attachments)

(Assignee)

Description

3 years ago
In bug 1187864 comment 3 we found out that some pings report too many thread hang stats.

Since the repeated "(chrome script)" stack entries are not really useful we should collapse those into one "(chrome script)" or "(chrome script <N>)".
(Assignee)

Updated

3 years ago
Blocks: 1186986
Points: --- → 2
Priority: -- → P2
Whiteboard: [unifiedTelemetry][measurement:client]
(Assignee)

Comment 1

3 years ago
We should also limit the number of reported hangs.
So, per bug 1187864 there are different things we can do here:
(1) limit the number of hang stat entries
(2) limit the stack depth to a sane number
(3) collapse repeated successive "(chrome script)" to one "(chrome script)" or "(chrome script <N>)"

For (3) i think we should just stay with collapsing into the simple "(chrome script)" (unless there are compelling reasons to make it more complicated).
I'd be fine with moving this part to a follow-up if it is no big win.
Points: 2 → 3
(Assignee)

Updated

3 years ago
Assignee: nobody → alessio.placitelli
(Assignee)

Updated

3 years ago
status-firefox42: --- → wontfix
status-firefox43: --- → affected
(Assignee)

Updated

3 years ago
Priority: P2 → P1
(Assignee)

Comment 3

3 years ago
Analysing the data from bug 1191846, we found out that fixing the stuff from point (2) and (3) in comment 2, along with the future fix from bug 1213780, is enough to bring down the size of the majority of the big pings to less than 1Mb.

This means that we're not limiting the number of thread hang stat entries, for the moment.

In order to find a reasonable maximum stack depth for (2), we've used a python notebook [1] to check the 95th percentile of BHR stack depths received by the Telemetry servers for the Nightly population. This value is 11, so that's the maximum depth our reported stacks will have.

Thanks Vladan and Georg for their feedback and suggestions.

[1] - https://gist.github.com/Dexterp37/fc4a043fb442d4e4be90
(Assignee)

Comment 4

3 years ago
Created attachment 8673517 [details] [diff] [review]
bug1211411.patch

This patch changes the BackgroundHangMonitor to clean up the reported hang stacks by collapsing repeated "(chrome script)"/"(content script)" entries.

It also limits the stack depth by removing the topmost frames if greater than the maximum allowed depth.
(Assignee)

Updated

3 years ago
Attachment #8673517 - Flags: review?(vladan.bugzilla)
Status: NEW → ASSIGNED
Comment on attachment 8673517 [details] [diff] [review]
bug1211411.patch

Review of attachment 8673517 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/threads/BackgroundHangMonitor.cpp
@@ +32,5 @@
>  #define BHR_BETA_MOD 100;
>  
> +// Maximum depth of the call stack in the reported thread hangs. This value represents
> +// the 95th pecentile of the thread hangs stack depths reported by Telemetry.
> +const int kMaxThreadHangStackDepth = 11;

static const int

@@ +417,5 @@
> +  // Limit the depth of the reported stack if greater than our limit. Only keep its
> +  // last entries.
> +  if (mHangStack.length() > kMaxThreadHangStackDepth) {
> +    const int elementsToRemove = mHangStack.length() - kMaxThreadHangStackDepth;
> +    mHangStack.erase(mHangStack.begin(), mHangStack.begin() + elementsToRemove);

you can replace this with:

mHangStack.resize(std::min(mHangStack.length, kMaxThreadHangStackDepth));
Attachment #8673517 - Flags: review?(vladan.bugzilla) → review+
(Assignee)

Comment 6

3 years ago
Created attachment 8674405 [details] [diff] [review]
bug1211411.patch

Thanks for the review. This patch addresses your first comment and, as discussed over IRC, I've kept the second change and added a little comment there.

Moreover, this patch adds the "(reduced stack)" entry to mark that the stack was reduced.
Attachment #8673517 - Attachment is obsolete: true
Attachment #8674405 - Flags: review?(vladan.bugzilla)
Attachment #8674405 - Flags: review?(vladan.bugzilla) → review+
(Assignee)

Comment 8

3 years ago
Created attachment 8674878 [details] [diff] [review]
bug1211411.patch

This patch fixes the test fallout by changing the int constant to a size_t.

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=78a986b87c96
Attachment #8674405 - Attachment is obsolete: true
Attachment #8674878 - Flags: review+
(Assignee)

Comment 9

3 years ago
https://hg.mozilla.org/integration/fx-team/rev/73dbb4a043f8698c835e9b0e2ca4817ecc03f4bf
Bug 1211411 - Limit the number of thread hang stats reported to Telemetry. r=vladan
(Assignee)

Updated

3 years ago
Points: 3 → 1
https://hg.mozilla.org/mozilla-central/rev/73dbb4a043f8
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
status-firefox44: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla44
Alessio: how do you feel about uplifting this patch and the patch in bug 1219751 to Beta 43? It's already on Aurora 44 afaict.

If you're ok with uplifting these, I'd like to get them in by the end of the week, in time for next week's A/B experiment
Blocks: 1222894
Status: RESOLVED → REOPENED
Flags: needinfo?(alessio.placitelli)
Resolution: FIXED → ---
(Assignee)

Comment 12

3 years ago
Comment on attachment 8674878 [details] [diff] [review]
bug1211411.patch

Approval Request Comment
[Feature/regressing bug #]: Limit the depth of thread hang stacks, consistently reducing the size of the pings sent to Telemetry servers.
[User impact if declined]: This isn't a user-facing feature. However, without this patches bug 1222894 could potentially cause sending a lot of oversized pings, wasting both storage and processing time server side, thus increasing costs.
[Describe test coverage new/current, TreeHerder]: This has been on m-c for 3 weeks without any reported issue.
[Risks and why]: Low risk. This only touches Telemetry & Thread Hang stacks, no front facing features or other systems.
[String/UUID change made/needed]: None.

Please note that this is part of an uplift stack required for bug 1222894. This is the stack (the patches correctly apply in order to mozilla-beta): 1213780, 1211411, 1215540, 1219751.
Flags: needinfo?(alessio.placitelli)
Attachment #8674878 - Flags: approval-mozilla-beta?
Comment on attachment 8674878 [details] [diff] [review]
bug1211411.patch

OK on m-c and m-a for a while now.  
Please uplift this as part of the telemetry work for beta 4.
Attachment #8674878 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/releases/mozilla-beta/rev/59e6a978773f
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED

Updated

3 years ago
status-firefox43: affected → fixed
You need to log in before you can comment on or make changes to this bug.