We may be blowing up our telemetry ping size due to increased BHR submissions

RESOLVED FIXED in mozilla55

Status

()

enhancement
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: Ehsan, Assigned: Ehsan)

Tracking

unspecified
mozilla55
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [measurement:client:tracking])

Michael discovered that we are seeing a surprisingly low number of telemetry reports with 300 native stacks from BHR.  Our running theory right now is that the number 300 right now was way too high and we're breaking all telemetry submissions by blowing up the ping size.

I'm going to land a patch on m-c to bring down this limit to 15 to stop the bleeding.  Apologies for this.
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/7e0e20683d5a
Bring down the number of native stacks submitted through BHR temporarily in the hopes of not blowing up the telemetry ping sizes on Nightly for Windows x86; r=mystor a=kwierso
Blocks: 1346415
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Whiteboard: [measurement:client]
Whiteboard: [measurement:client] → [measurement:client:tracking]
Is there an open bug to track the next steps for this?
Flags: needinfo?(michael)
In all likelihood, what was really happening here was bug 1364556 and we were just incorrectly assuming this is out fault...

Do you think that theory makes sense?
Flags: needinfo?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] from comment #2)
> Is there an open bug to track the next steps for this?

Filed bug 1365029
Flags: needinfo?(michael)
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #3)
> In all likelihood, what was really happening here was bug 1364556 and we
> were just incorrectly assuming this is out fault...
> 
> Do you think that theory makes sense?

Is the job getting the BHR data a Python job?
Bug 1335343 started breaking Python data jobs from ~ May 11 on.
If the missing data is from that date on, this makes sense.
Flags: needinfo?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] from comment #5)
> (In reply to :Ehsan Akhgari (super long backlog, slow to respond) from
> comment #3)
> > In all likelihood, what was really happening here was bug 1364556 and we
> > were just incorrectly assuming this is out fault...
> > 
> > Do you think that theory makes sense?
> 
> Is the job getting the BHR data a Python job?
> Bug 1335343 started breaking Python data jobs from ~ May 11 on.
> If the missing data is from that date on, this makes sense.

The reason why we got scared was because when we implemented the background hang reporter, we set a limit of 300 native stacks being sent to the server. We notice in our personal usage that we usually get at least a couple of hangs within the first few minutes of running the browser. However, once we started looking at the data, we realized that of the pings which reached my spark script on analysis.telemetry.mozilla.org:

a) Not a single one filled all 300 native stacks, and 
b) Our average stack count was very low.

This could be just because our data is very clumpy, but seemed to signal to us that we were blowing out the maximum size of telemetry pings whenever we collected a large number of native stacks, and thus dropped the number.

It would be really nice to have someone who understands telemetry help us figure out what is actually going on here. For example, it would be good to know if there was a dropoff in telemetry ping submissions from nightly after the 128ms BHR patch landed on May 2.
Flags: needinfo?(gfritzsche)
Redirecting - Mark, can you follow up here or redirect?

Should we take the discussion to bug 1365029?
Flags: needinfo?(gfritzsche) → needinfo?(mreid)
Ben said he could help look into this. Thanks Ben!
Flags: needinfo?(mreid) → needinfo?(bmiroglio)
mystor and I looked into this and found that there were no issues with ping submissions. See Bug 1365749.
Flags: needinfo?(bmiroglio)
You need to log in before you can comment on or make changes to this bug.