saved-session ping size and ping frequency are causing bandwidth issues on Android

NEW
Unassigned

Status

()

Toolkit
Telemetry
2 years ago
a year ago

People

(Reporter: mfinkle, Unassigned)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [TPE-1])

Looking at the last 7 days of Nightly, I see that we send a lot of raw data. Some people have ~34MB of raw pings sent per day. This happens on Wifi and Cell Data.

We need to look into how to mitigate this issue.

I did some exploration in this doc:
https://docs.google.com/document/d/1-YXxlKutU31BS5WjEjd0niDA3PE9oa8QfxqB6rGzQHA/edit#

Comment 1

2 years ago
To clarify, this is only about the old opt-in telemetry, right? So not an issue caused by the new core ping.
Flags: needinfo?(mark.finkle)
(In reply to :Margaret Leibovic from comment #1)
> To clarify, this is only about the old opt-in telemetry, right? So not an
> issue caused by the new core ping.

Yes. This is an issue with "saved-session" telemetry managed in Gecko's Telemetry system.
Flags: needinfo?(mark.finkle)

Comment 3

2 years ago
Do we have logs or a record of the Content-Encoding negotiation to ensure that we are sending compressed payloads? All we need is a couple of example HTTP headers exchanges to see both the encoding and the length of the content.

A simple mitigation strategy is one advanced by most app stores: only use data when reasonably certain it will travel on a Wi-Fi transport.
In Gecko JS, do we have info for "on Wifi" vs. "on cell data"?
A first simply cut would be to simply not upload the pings on cell data.

Re. ping size - this actually improved a lot over the last two quarters from the UT work.
We found we had some unbounded fields in there leading to rather extreme ping sizes.

The ping count per client seems high, short-term shooting for one per Gecko life-time (instead of per "app went to background") seems feasible.
Everything else would mean stitching data across sessions etc., which probably gets complicated and is of questionable value.

Comment 5

2 years ago
There is a proposed Network Information API that we could probably add support for: https://developer.mozilla.org/en-US/docs/Web/API/Network_Information_API

Comment 6

2 years ago
(In reply to Georg Fritzsche [:gfritzsche] from comment #4)
> In Gecko JS, do we have info for "on Wifi" vs. "on cell data"?
> A first simply cut would be to simply not upload the pings on cell data.

Yes. Here's an example:
http://mxr.mozilla.org/mozilla-central/source/mobile/android/modules/HomeProvider.jsm#85
(In reply to Georg Fritzsche [:gfritzsche] from comment #4)
> In Gecko JS, do we have info for "on Wifi" vs. "on cell data"?
> A first simply cut would be to simply not upload the pings on cell data.

Not uploading pings on cell data would be OK for some data, like histograms, but I don't think we want to lose UI telemetry. Maybe we could drop certain payloads on cell data and see what affect that has on ping size?

> Re. ping size - this actually improved a lot over the last two quarters from
> the UT work.
> We found we had some unbounded fields in there leading to rather extreme
> ping sizes.

Android is not using UT. Were the unbounded fields fixed in "saved-session" or only "main"?

> The ping count per client seems high, short-term shooting for one per Gecko
> life-time (instead of per "app went to background") seems feasible.
> Everything else would mean stitching data across sessions etc., which
> probably gets complicated and is of questionable value.

I'm less worried about ping count. It's the ping size that is the main driver. That said, we could look at just archiving when on cell data.
(In reply to Mark Finkle (:mfinkle) from comment #7)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #4)
> > In Gecko JS, do we have info for "on Wifi" vs. "on cell data"?
> > A first simply cut would be to simply not upload the pings on cell data.
> 
> Not uploading pings on cell data would be OK for some data, like histograms,
> but I don't think we want to lose UI telemetry. Maybe we could drop certain
> payloads on cell data and see what affect that has on ping size?

This might be a complicated road, it depends on how which part is affecting analysis.
We can also screen the toplevel payload properties and see which we are definitely not using at all on Android:
https://gecko.readthedocs.org/en/latest/toolkit/components/telemetry/telemetry/main-ping.html

Suspects standing out to me:
chromeHangs
threadHangStats
fileIOReports
lateWrites

> > Re. ping size - this actually improved a lot over the last two quarters from
> > the UT work.
> > We found we had some unbounded fields in there leading to rather extreme
> > ping sizes.
> 
> Android is not using UT. Were the unbounded fields fixed in "saved-session"
> or only "main"?

Both, it's mostly the same data.

> > The ping count per client seems high, short-term shooting for one per Gecko
> > life-time (instead of per "app went to background") seems feasible.
> > Everything else would mean stitching data across sessions etc., which
> > probably gets complicated and is of questionable value.
> 
> I'm less worried about ping count. It's the ping size that is the main
> driver. That said, we could look at just archiving when on cell data.

So, this would be a pretty simple thing to do on the short-term (just blocking upload on cell data in TelemetrySend.jsm).
Should we do this part now or wait for a more involved design?
tracking-fennec: --- → ?
OS: Unspecified → Android
Hardware: Unspecified → All
I ran a slightly tweaked version of Alessio's script [1] on Fennec Nightly for the last week. 100% of the pings were in the "maximum" bucket. Not a surprise since Fennec only uses "saved-sessions".

The results: (payload section, median value, number of pings)

[('payload', 98876.0, 13115),
 ('payload/histograms', 75398.0, 13115),
 ('payload/threadHangStats', 17015.0, 13115),
 ('payload/keyedHistograms', 2952.0, 13115),
 ('environment', 1998.0, 13115),
 ('payload/simpleMeasurements', 732.0, 13115),
 ('payload/info', 630.0, 13115),
 ('environment/addons/activeAddons', 596.0, 7140),
 ('environment/addons', 507.0, 13076),
 ('environment/settings', 488.0, 13115),
 ('payload/UIMeasurements', 437.0, 11051),
 ('environment/addons/theme', 275.0, 670),
 ('environment/addons/activePlugins', 241.0, 1548),
 ('payload/addonDetails', 109.0, 13115),
 ('payload/chromeHangs', 97.0, 13115)]

[1] https://gist.github.com/Dexterp37/c2e1c1d4de4ba22bc4cf#file-bug-1215545-buckets-ipynb

Updated

2 years ago
Assignee: nobody → mark.finkle
tracking-fennec: ? → 48+
I wonder if anyone is even looking at threadHangStats, addonDetails & chromeHangs for Android.
Those seem like candidates for dropping.
I'll take a look to get a sense of the amount of work necessary here.
Assignee: mark.finkle → michael.l.comella
I repeated Finkle's experiment in comment 9 with the latest builds and got the same result (still running so I don't have the break-down yet) – the pings are still quite large (> 15.5 Kb).

A summary of options from this thread:
* don’t upload on wifi (e.g. archive on cell data)
* remove fields unused on fennec
* ensure all fields are bounded
* only upload certain payloads on cell data (but this could be complicated - comment 8)
We're going to wait until our Telemetry roadmap meeting to decide if this is important to do right now, or if we can wait until we move this to the Java uploader implementation in a few months (which would undo any work we do right now).
Unfortunately, we didn't specifically address this during the telemetry meeting.

While it's unclear how much the pings have improved in the past few months, looking at Finkle's doc in comment 0, it looks like it's a MB per day (on average) and a max of 34 MB per day, which could be pretty bad in extreme data scenarios (e.g. roaming).

Margaret, do you have an opinion on how we should move forward with this bug, if we even should?
Assignee: michael.l.comella → nobody
Flags: needinfo?(margaret.leibovic)

Comment 15

2 years ago
(In reply to Michael Comella (:mcomella) from comment #14)
> Unfortunately, we didn't specifically address this during the telemetry
> meeting.
> 
> While it's unclear how much the pings have improved in the past few months,
> looking at Finkle's doc in comment 0, it looks like it's a MB per day (on
> average) and a max of 34 MB per day, which could be pretty bad in extreme
> data scenarios (e.g. roaming).
> 
> Margaret, do you have an opinion on how we should move forward with this
> bug, if we even should?

I think we need to understand how much data we're currently sending, and have a way to monitor that over time. So, as a first step and bare minimum, let's get a system in place for this. Maybe Georg can help make a server-side analysis for this?

Then, we need to decide how much data is too much data to be sending. We could try to wait until there's a wifi connection to make an upload.

I think Barbara should be involved in helping make this decision. Given that the core ping is opt-out instead of opt-in, it's more important to address size issues with it than it would be with our other telemetry pings.
Flags: needinfo?(margaret.leibovic) → needinfo?(bbermes)
(In reply to :Margaret Leibovic from comment #15)
> I think Barbara should be involved in helping make this decision. Given that
> the core ping is opt-out instead of opt-in, it's more important to address
> size issues with it than it would be with our other telemetry pings.

I may be misinterpreting but this is for the saved-session ping, not the core ping.
Summary: Ping size and ping frequency are causing bandwidth issues on Android → saved-session ping size and ping frequency are causing bandwidth issues on Android
(In reply to Michael Comella (:mcomella) from comment #16)
> (In reply to :Margaret Leibovic from comment #15)
> > I think Barbara should be involved in helping make this decision. Given that
> > the core ping is opt-out instead of opt-in, it's more important to address
> > size issues with it than it would be with our other telemetry pings.
> 
> I may be misinterpreting but this is for the saved-session ping, not the
> core ping.

It is, from the incoming data we concluded that we don't have a ping size or frequency concerns with the "core" ping client-side.

Comment 18

2 years ago
Apologies, I didn't pay close enough attention here.

If this is for opt-in telemetry, I'm less concerned, although I do think we should at least have a system in place to know how much data we're sending. Data-driven data decisions! :)
Flags: needinfo?(bbermes)
tracking-fennec: 48+ → ---
Flags: needinfo?(s.kaspari)
I don't have the time right now to explore this, but I'm marking this for the Taipei team. This is a bigger thing and needs some investigation (and should be prioritized). We can talk more about this during our Taipei week.
Flags: needinfo?(s.kaspari)
Whiteboard: [TPE-1]
(In reply to Nevin Chen [:nechen] from comment #20)
> Maybe we can check for WIFI or Cell data here?
> http://searchfox.org/mozilla-central/rev/
> 78ac0ceba97bd2deed847a8d0ae86ccf7a8887bf/mobile/android/base/java/org/
> mozilla/gecko/telemetry/schedulers/
> TelemetryUploadAllPingsImmediatelyScheduler.java#20

Sorry the code above is for core ping. For main ping upload ....I can't find the code in front end.
Maybe it's in here[1]?


The only thing I found is this [2]. But I think it only reads the health report....

[1]https://bugzilla.mozilla.org/show_bug.cgi?id=1156253
[2]http://searchfox.org/mozilla-central/rev/7cb75d87753de9103253e34bc85592e26378f506/mobile/android/chrome/content/aboutHealthReport.js#98
Flags: needinfo?(rnewman)
Flags: needinfo?(gfritzsche)
(In reply to Nevin Chen [:nechen] from comment #21)

> Sorry the code above is for core ping. For main ping upload ....I can't find
> the code in front end.

Last I was involved in this, the telemetry pings in question were managed and uploaded by Gecko, even on Android. (This is a bad thing: it means collection, composition, and upload only happen when the browser is running, and thus compete for resources with the browser itself at the most critical moments.)

If that's still true, the relevant code is e.g.,

https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetrySend.jsm#435


> The only thing I found is this [2]. But I think it only reads the health
> report....

Correct: aboutHealthReport is nothing to do with telemetry upload.
Flags: needinfo?(rnewman)
Is TelemetrySend.jsm actually currently used on Android to upload these pings?
I'm not sure if we use the Java uploader or TelemetrySend, we should check that first.

If this is in TelemetrySend, do we actually want to keep sending data from there or move this to the Java uploader?

If we keep this in TelemetrySend, is it possible to avoid repeated checks?
E.g. is there an observer topic that we could listen to transition between the states "on local network" and "not on local network"?
Then we can properly shut down all sending activity while not on a local network.

Last but not least:
Do we know how many pre-release users are connected to a local network at least, say, once a week?
Consider we had 90% of pre-release users never or rarely connect to a local network, could we afford to lose their data?
Flags: needinfo?(gfritzsche)
FWIW Fennec uses TelemetrySend according to TELEMETRY_SEND_SUCCESS: https://mzl.la/2rUZVn2
(In reply to Georg Fritzsche [:gfritzsche] from comment #24)

> Do we know how many pre-release users are connected to a local network at
> least, say, once a week?

I think a more fundamental issue is: there are some populations who only use cellular data, and so any scheme that alters submission behavior based on network will totally obscure those populations, which will cause existence conclusions to be wrong.

It doesn't necessarily matter how big that population is, proportionally -- if we have, e.g., a bug that causes us to only run an updater on wifi, then we will see no evidence of that bug if we only report telemetry on wifi!

If we do choose whether to do an upload _right now_ based on connectivity -- which isn't a bad option for first-world users -- it shouldn't wait too long, and we shouldn't discard data.

IMO the solution for this bug tends more towards "don't send 34MB of data per day", rather than "only send 34MB when on wifi". Remember that "wifi" doesn't mean "unmetered", it doesn't mean "fast", and it doesn't mean "low-power".

That might mean changing data representations or pruning what's collected.
You need to log in before you can comment on or make changes to this bug.