Closed Bug 1155604 Opened 9 years ago Closed 9 years ago

Telemetry evicts too many pending pings

Categories

(Toolkit :: Telemetry, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
firefox40 --- fixed

People

(Reporter: gfritzsche, Assigned: gfritzsche)

References

Details

Attachments

(1 file)

Telemetry data from TELEMETRY_FILES_EVICTED shows that we evict a lot more pings now.

We evict pending pings if we have more than MAX_LRU=17 pings:
https://hg.mozilla.org/mozilla-central/annotate/51e3cb11a258/toolkit/components/telemetry/TelemetryFile.jsm#l192

We potentially generate much more pings now, so we have to change this behavior.
We should either drop that eviction behavior or make the number of kept pings much higher.

To avoid issues with too many pings (due to bugs etc.), i think we should just increase the number a fair bit.
We can max out at 288+ pings per day now and we currently keep pings for two weeks, so arbitrary suggestion: 4000 (288 * 14 days = 4032).
This seems like a lot of pings, but otherwise we could lose quite some data.

Related, is the MAX_PING_FILE_AGE of two weeks still reasonable?
Flags: needinfo?(vdjeric)
As far as I remember, the MAX_LRU limit was introduced to deal with users being offline or on restricted networks for days or weeks at a time, and accumulating dozens of unsaved pings over that time. I think Yoric wrote that patch, so he might be able to provide more background.

4000 pings is way too much, that's 400MB of data! We definitely do not want to take up that much disk space, or try to read 400MB on startup, or send 400MB of data. I think 50 is an absolute maximum

As a more general comment, Telemetry should not strive for 99% accurate reporting from every user. It's not a realistic goal and it severely limits Telemetry design decisions. Fundamentally, Telemetry (incl. unified Telemetry) is about providing representative measures of browser behaviour, and our priority is to avoid biases in the aggregate data (e.g. failing to report on very short sessions). We should not to try to implement a data-collection system which is very accurate for every individual user
Flags: needinfo?(vdjeric)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #2)
> As far as I remember, the MAX_LRU limit was introduced to deal with users
> being offline or on restricted networks for days or weeks at a time, and
> accumulating dozens of unsaved pings over that time. I think Yoric wrote
> that patch, so he might be able to provide more background.
> 
> 4000 pings is way too much, that's 400MB of data! We definitely do not want
> to take up that much disk space, or try to read 400MB on startup, or send
> 400MB of data. I think 50 is an absolute maximum

Ok, let's go for 50 now and track the evicted files histogram if that's sufficient?
Related note: we have to really think about the archiving again (180 days of archived pings).

> As a more general comment, Telemetry should not strive for 99% accurate
> reporting from every user. It's not a realistic goal and it severely limits
> Telemetry design decisions. Fundamentally, Telemetry (incl. unified
> Telemetry) is about providing representative measures of browser behaviour,
> and our priority is to avoid biases in the aggregate data (e.g. failing to
> report on very short sessions). We should not to try to implement a
> data-collection system which is very accurate for every individual user

I'm not sure where to draw the line here, but from my understanding a relevant set of the planned analysis will focus on per-user metrics over time.
I assume we don't need complete data, but a "reasonably complete" data set where possible?
Assignee: nobody → gfritzsche
Status: NEW → ASSIGNED
Attachment #8594729 - Flags: review?(vdjeric)
(In reply to Georg Fritzsche [:gfritzsche] from comment #3)
> I'm not sure where to draw the line here, but from my understanding a
> relevant set of the planned analysis will focus on per-user metrics over
> time.
> I assume we don't need complete data, but a "reasonably complete" data set
> where possible?

I think a basic question we have to answer is "when does a delayed ping become uninteresting?" In the old telemetry system, a ping had near-zero value after being delayed two weeks (so we deleted it). I'm not sure what the requirements are now that Telemetry is Telemetry+FHR.. but I think we should avoid trying to report as much as possible at the expense of all other considerations (such as performance impact)
Flags: needinfo?(bcolloran)
Attachment #8594729 - Flags: review?(vdjeric) → review+
Perhaps I am misunderstanding this discussion, but for many/most FHR-based studies, *all* metrics are of value. We need/want to conduct many longitudinal studies that investigate the impact of features, marketing campaigns, releases, etc on usage, and to gather that we can't throw away pings after two weeks. Is there a way to prioritize certain pings over others so that the client can decide which ones to send when there is a backlog?
(In reply to John Jensen from comment #6)
> Is there a way to prioritize certain pings over others so that the client can decide which ones to send when there is a backlog?

It's possible and we should do this if we really do want to aim for near-perfect reconstructions of users' Telemetry histories.

We'll need smarter ping-expiry and ping-loading rules, prioritizing ping sending, and better disk quota + network bandwidth management. This should be reflected in the project backlog & schedule
Flags: needinfo?(gfritzsche)
Hi Vladan, 

> It's possible and we should do this

Good to hear. It would be good to know how big an issue this is...do we have measures of the distribution of backlogs among current Telemetry clients, how long these queues typically last, etc?
(In reply to John Jensen from comment #8)
> It would be good to know how big an issue this is...do we have
> measures of the distribution of backlogs among current Telemetry clients,
> how long these queues typically last, etc?

This is a distribution showing the number of persisted pings present at browser startup, on release 37: http://mzl.la/1E4bsQ1
Note that if there are more than 17 saved pings, we delete the oldest pings until only 17 are left.

Pings are thrown away when ping_count > 17 during 20% of Nightly 40 startups: http://mzl.la/1D9xVY2

Ping practically never get thrown away because they're too old (older than 2 weeks), data from Release 37: http://mzl.la/1D9yzVE It's likely this is because they're getting deleted because of the "max 17 pings" limit
+1 to what John has said, we certainly need to get back all the payloads we can, and old payloads may remain quite important in the FHR context.

Also, re: a comment above-- "As a more general comment, Telemetry should not strive for 99% accurate reporting from every user." -- unfortunately, approaching 99% accuracy probably is a goal for the FHR use case. But one optimization that might help at least in terms of bandwidth and user upload time: if a client has a lot of old records that you would normally discard because they are too dated, for the FHR use case we don't need all of the data that is in telemetry, just the subset that has been approved for FHR.

Georg, could this ping expiration situation be behind some of the missing subsessions I've mentioned?
Flags: needinfo?(bcolloran)
(In reply to brendan c from comment #10)
> Georg, could this ping expiration situation be behind some of the missing
> subsessions I've mentioned?

Yes, this is the most obvious issue behind bug 1154113.

(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #7)
> (In reply to John Jensen from comment #6)
> > Is there a way to prioritize certain pings over others so that the client can decide which ones to send when there is a backlog?
> 
> It's possible and we should do this if we really do want to aim for
> near-perfect reconstructions of users' Telemetry histories.
> 
> We'll need smarter ping-expiry and ping-loading rules, prioritizing ping
> sending, and better disk quota + network bandwidth management. This should
> be reflected in the project backlog & schedule

Yes, we should talk about this. I filed bug 1156712 to track this.
Flags: needinfo?(gfritzsche)
Keywords: leave-open
Blocks: 1154113
We went down from evicting pings on >16% to 5-6% now.
I think we can call this bug fixed, bug 1156712 will bring more improvements.
We went down from evicting pings on >16% to 5-6% now.
I think we can call this bug fixed, bug 1156712 will bring more improvements.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: