Analyze TELEMETRY_SEND_FAILURE data from Beta

RESOLVED FIXED

Status

()

Toolkit
Telemetry
P2
normal
RESOLVED FIXED
8 months ago
6 months ago

People

(Reporter: chutten, Assigned: chutten)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [measurement:client])

(Assignee)

Description

8 months ago
We now (bug 1367110) have some (about a week's) data from Nightly 55 about how TelemetrySend can fail to send pings.

When we get a look at broader information from Beta we should write up an analysis about it and bring it to the pipeline side and maybe necko to get comments and suggestions.

Things to keep in mind:

1) It appears as though the different failure cases come in at different speeds. eChannelOpen is much faster to come in than eUnreachable, for instance.

2) These are only coming from users who failed in the past but have succeeded to send now to tell us about it. And they're self-reporting.

3) Do errors pile up? If you have four pending pings, do we get four samples of eUnreachable or just one?
(Assignee)

Updated

6 months ago
Blocks: 1385343
(Assignee)

Comment 1

6 months ago
Beta 55's data is an interesting study: https://mzl.la/2eTyIfV

Build-over-build variance is very low except for the current build, due no doubt to eChannelOpen coming in much faster than eUnreachable: https://mzl.la/2eTAGNr

This makes sense if you think of eChannelOpen being some sort of weird temporary network issue and eUnreachable being a function of the network itself. The former we receive with some chance every 10min retry interval, the latter we don't receive until after the user changes networks to one that can reach inbound.

50% eUnreachable
40% eChannelOpen
10% timeout

Could be worse. 203M samples, vs 265M successful transmissions.

These numbers are inflated as a single failure doesn't stop us from attempting to send the rest of the pending pings. Considering 73% of pending pings are less than a day old (and another 7% are between one and two) (https://mzl.la/2h97cMi), there aren't _many_ pending pings for very long. As a result the inflation shouldn't be too terrible.

Unfortunately the inflation will be uneven. Transient failures like eChannelOpen and timeout are likely to inflate times the number of pending pings on that tick (maybe 1 or 2 or 3).

But eUnreachable failures last longer (from the buildid evolution it suggests maybe as long as two weeks waiting for all the eUnreachables to come in), so we have more ticks and more pings compounding the inflation. Inflation becomes the sum of compounded pings times the number of scheduler ticks we attempt to send on.

--

Overall, this is very interesting but unfortunately very little of it is actionable. I've filed bug 1385343 to expand our understanding by going ever deeper, exposing ever more precise information about the networking failures.
(Assignee)

Updated

6 months ago
Status: NEW → RESOLVED
Last Resolved: 6 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.