Last Comment Bug 689142 - Places telemetry gets lost due to being collected on idle-daily
: Places telemetry gets lost due to being collected on idle-daily
Status: RESOLVED FIXED
: dev-doc-complete
Product: Toolkit
Classification: Components
Component: Places (show other bugs)
: Trunk
: All All
: -- normal (vote)
: mozilla10
Assigned To: Marco Bonardo [::mak]
:
Mentors:
Depends on: 694447
Blocks:
  Show dependency treegraph
 
Reported: 2011-09-26 06:11 PDT by Marco Bonardo [::mak]
Modified: 2011-12-13 06:46 PST (History)
9 users (show)
mak77: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
-
-


Attachments
patch v1.0 (8.28 KB, patch)
2011-09-27 07:13 PDT, Marco Bonardo [::mak]
taras.mozilla: review+
Details | Diff | Splinter Review
patch v1.1 (11.16 KB, patch)
2011-10-03 07:23 PDT, Marco Bonardo [::mak]
taras.mozilla: review+
dietrich: review+
Details | Diff | Splinter Review

Description Marco Bonardo [::mak] 2011-09-26 06:11:12 PDT
there are two outstanding issues:
1. some reported data looks wrong, esp filesizes (PLACES_DATABASE_FILESIZE_MB, PLACES_DATABASE_JOURNALSIZE_MB). May be due to bad chunks choices.
2. Too few reported data. This is due to the fact Places collects most telemetry at idle-daly, asynchronously. Telemetry ping happens at idle-daily as well, so it may happen before we register our data, so most Places data is never reported.

Solution to 2 may be to serialize stats to a pref, and just unserialize and register existing serialized data for telemetry. At that point the issue may still happen though, if telemetryPing is invoked before Places as an idle-daily observer (since category observers are invoked after other observers this is most likely the case). I think we should at least make TelemetryPing enqueue once data collection so that other consumers can use idle-daily notification to report telemetry data.
Comment 1 Marco Bonardo [::mak] 2011-09-27 03:30:50 PDT
the filesize issue is due to wrong units, I wrongly reported KiB rather than MiB
Comment 2 Marco Bonardo [::mak] 2011-09-27 05:36:49 PDT
Firefox 8 is the first version with Places telemetry, I think it's worth fixing it otherwise we'll not record useful telemetry.
Comment 3 Marco Bonardo [::mak] 2011-09-27 07:13:37 PDT
Created attachment 562758 [details] [diff] [review]
patch v1.0

Taras, do we have a bug about crazy slowness of the telemetry web interface? It's just unusable, in 20 minutes I've been able to see 2 histograms, then everything hung and I have been unable to see anything other even waiting 10 minutes. Maybe it is that bad just for remoties?
Comment 4 (dormant account) 2011-09-27 10:44:50 PDT
(In reply to Marco Bonardo [:mak] from comment #3)
> Created attachment 562758 [details] [diff] [review] [diff] [details] [review]
> patch v1.0
> 
> Taras, do we have a bug about crazy slowness of the telemetry web interface?
> It's just unusable, in 20 minutes I've been able to see 2 histograms, then
> everything hung and I have been unable to see anything other even waiting 10
> minutes. Maybe it is that bad just for remoties?

It sucks for everyone, but I've never seen it as bad. I can usually get individual queries to complete in under a minute. I think they are doing some upgrades and reshuffling to optimize it, but please repeat your complaint in bug 687117. The guys working on this are in Portugal so you can complain and expect a speedy response(no timezone screwery) :)
Comment 5 (dormant account) 2011-09-27 11:00:59 PDT
Comment on attachment 562758 [details] [diff] [review]
patch v1.0


>     case "idle-daily":
>-      this.send(aTopic, server);
>+      // Enqueue on the main-thread to allow other components to use idle-daily
>+      // to report their telemetry.
>+      Services.tm.mainThread.dispatch((function() {
>+        this.send(aTopic, server);
>+      }).bind(this), Ci.nsIThread.DISPATCH_NORMAL);
>       break;
>     }
>   },
> 
>   classID: Components.ID("{55d6a5fa-130e-4ee6-a158-0133af3b86ba}"),
>   QueryInterface: XPCOMUtils.generateQI([Ci.nsIObserver]),
> };
> 

r+ for this bit. I don't completely understand the rest. Even this I'm a bit hazy on. Does this move sending to the back of the event queue? If so, wouldn't async places operations still happen too late?
Comment 6 Marco Bonardo [::mak] 2011-09-27 13:45:45 PDT
(In reply to Taras Glek (:taras) from comment #4)
> I think they are doing
> some upgrades and reshuffling to optimize it, but please repeat your
> complaint in bug 687117.

I can't access that bug :)
Comment 7 Marco Bonardo [::mak] 2011-09-27 14:21:30 PDT
(In reply to Taras Glek (:taras) from comment #5)
> r+ for this bit. I don't completely understand the rest. Even this I'm a bit
> hazy on. Does this move sending to the back of the event queue? If so,
> wouldn't async places operations still happen too late?

There is no way we can ensure some async stuff is reported for sure in telemetry, since it may run at any time, what we can ensure is that telemetry reported synchronously in idle-daily will work, that is what my telemetryPing change does.

The places changes do something to make that fine for our needs.
Practically on the first idle-daily we run async queries to collect generic telemetry, and we store data in a pref. On the next idle-daily (may be the day after or some days after) we'll synchronously report the data that we proviously stored in the pref. Since we collect data once a week this should be fine for our needs.
Comment 8 (dormant account) 2011-09-27 14:30:29 PDT
(In reply to Marco Bonardo [:mak] from comment #7)
> 
> The places changes do something to make that fine for our needs.
> Practically on the first idle-daily we run async queries to collect generic
> telemetry, and we store data in a pref. On the next idle-daily (may be the
> day after or some days after) we'll synchronously report the data that we
> proviously stored in the pref. Since we collect data once a week this should
> be fine for our needs.

Ok. Since we do not have a system for persisting histograms at this point this might be ok for now. Sounds like we will need a PERSIST_UNTIL_PING flag to deal with these once we get around to persistent histograms.
CCing Sid for privacy review on this as this is the first bit of telemetry that persists across restarts.
Comment 9 Sid Stamm [:geekboy or :sstamm] 2011-09-27 14:53:56 PDT
If I understand correctly, the problem is that the probe computation is racing the telemetry ping itself and if the computation loses the race, nothing gets reported.  This patch, persistence via a pref, is a mechanism that gets around a race condition. Have I got this right?

Taras: I'm guessing we're going to encounter this again.  Would it be a good use of time to investigate a way to precede the telemetry ping submission with subroutines provided by the probe implementors?  I'm thinking like a service where you can register callbacks that run in serial before the ping goes out.

Assuming this will land as-is, I've got a few questions:
* how do users go about clearing this state data?  Say they turn off telemetry in the prefs UI, will the probe data get cleared?  
* What happens in private browsing mode?  Does the pref still get populated, or does the code that creates and stores the probe value halt until the user exits private mode?
* Is the value stored in the pref longitudinal (does it represent more than one data point) or is reset periodically?
Comment 10 Marco Bonardo [::mak] 2011-09-27 17:39:42 PDT
(In reply to Sid Stamm [:geekboy] from comment #9)
> If I understand correctly, the problem is that the probe computation is
> racing the telemetry ping itself and if the computation loses the race,
> nothing gets reported.

Yes, part of the problem is that telemetry pings on idle-daily, that is the first 5 minutes idle in a session, so it's really easy to miss the train for any heavy or async telemetry computation.

> This patch, persistence via a pref, is a mechanism
> that gets around a race condition. Have I got this right?

yes, the values are calculated asynchronously and put in the pref, at the next ping (that may be the same day or any of the next days) it is synchronously reported.

> * how do users go about clearing this state data?  Say they turn off
> telemetry in the prefs UI, will the probe data get cleared?  

No, the data is calculated and stored once a week, after a successful ping it is deleted. But I could detect if telemetry is disabled (I assume I can check telemetry.canRecord for that?) and clear the pref in such a case.

> * What happens in private browsing mode?  Does the pref still get populated,
> or does the code that creates and stores the probe value halt until the user
> exits private mode?

private browsing doesn't make a difference, the collected data are statistics on places.sqlite database, private browsing doesn't touch the database nor these statistics that are the same regardless of it.

> * Is the value stored in the pref longitudinal (does it represent more than
> one data point) or is reset periodically?

The value is a collection of statistics on the database (stored as json string): number of pages, number of bookmarks, number of tags, number of folders, number of keywords, percentage of bookmarks organized in folders, percentage of bookmarks organized in tags, filesize of the database, filesize of the database-journal, page_size of the database, avg size of a page in the database.
Some of these may have some privacy implications, to know if the users uses most bookmarks or most tags, but the same information can be easily gathered from the database.
Which kind of privacy issues are we trying to fight exactly? Clearly both a local access and an add-on have availability of far more complete data than some number. Knowing the issue I may give some better answer to your questions.
Comment 11 Sid Stamm [:geekboy or :sstamm] 2011-09-27 18:10:37 PDT
With these questions, I'm just trying to understand how caching stuff in a pref changes the privacy risk surface for Telemetry as a whole.  I don't think it's a big deal, but if we made any promises about how telemetry operates, we should keep them, so I want to think it through.

A local attacker on this specific data isn't the focus here (as referred to at the end of comment 10) but rather I want to focus on consistency of this change with the privacy commitments we made for what Telemetry collects and how. (For more information, see the privacy review for Telemetry: https://wiki.mozilla.org/Privacy/Reviews/Telemetry).  Taras flagged this for me because when we did the initial review we made the assumption that all ping data would be stored in volatile memory and not on disk.

My opinion is that the best way forward is to put some callback plumbing into Telemetry to allow "ping prep" subroutines to be defined by consumers of the Telemetry service. Telemetry itself can then decide when to erase the ping data. That's only my opinion though, Taras is the expert here, and this might not be feasible.

In lieu of that, this change seems reasonable so long as there's an accessible way to clear this data-storage pref --  maybe clear it when the user clears browsing history or something.  Ideally, it would get reset any time the rest of the telemetry ping data was lost.  Using telemetry.canRecord when you intend to write the pref is probably a good start.  I still think it's weird to store this data in a pref, but perhaps there is not a better option.
Comment 12 Marco Bonardo [::mak] 2011-09-28 02:50:51 PDT
(In reply to Sid Stamm [:geekboy] from comment #11)
> Taras flagged this for
> me because when we did the initial review we made the assumption that all
> ping data would be stored in volatile memory and not on disk.

Makes sense, thanks for clarifying my doubts.

> My opinion is that the best way forward is to put some callback plumbing
> into Telemetry to allow "ping prep" subroutines to be defined by consumers
> of the Telemetry service. Telemetry itself can then decide when to erase the
> ping data.

The problem to solve is just that telemetry has been built with synchronous data collection in mind, but there is asynchronous data collection too.
If we come with a good approach for asynchronous data collection, I'm even fine trying to implement it.

> I still think it's weird to store this data in a pref, but perhaps
> there is not a better option.

A possible alternative to this approach, may be on idle-daily to start a further idle observer and serve telemetry data at the immediately following idle. Like a second-idle-daily. The time between the two idles should be enough for any async operation.  This still looks hackish though, but it would avoid the pref and report data later in a session.

Indeed there is also another issue with how telemetry works, since using idle-daily the reported data in most cases will be relative to the begin of a session, due to this I don't expect memory usage data to be really representative of what happens in the wild, it is probably an optimistic approximation.
Comment 13 (dormant account) 2011-09-29 15:08:10 PDT
After further discussion with Marco, we decided to not persist this data. We'll change to telemetry to fire a gather-telemetry observer notification, set a 5 minute idle-timer and then send telemetry on that. This should give places and other places ample opportunity to prepare telemetry data.
Comment 14 Marco Bonardo [::mak] 2011-09-29 15:18:57 PDT
Comment on attachment 562758 [details] [diff] [review]
patch v1.0

obsoleting patch, working on a new one.
Comment 15 Marco Bonardo [::mak] 2011-10-03 07:23:24 PDT
Created attachment 564181 [details] [diff] [review]
patch v1.1

So this makes so that on idle-daily we enqueue a gather-telemetry notification. It is enqueued since idle'daily is also a category, if a service is inited on it it won't be able to collect the notification.
on the next 5 minutes idle after idle-daily the data is sent.

Places, rather than collecting telemetry on maintenance, will collect it on gather-telemetry notification (so it will also respect telemetry on/off preference)
Comment 16 Marco Bonardo [::mak] 2011-10-03 07:26:42 PDT
clearing privacy review request, the new patch has volatile data
Comment 17 Sid Stamm [:geekboy or :sstamm] 2011-10-03 11:58:54 PDT
(In reply to Marco Bonardo [:mak] from comment #16)
> clearing privacy review request, the new patch has volatile data

Thanks, Marco and Taras, for re-architecting.  I also think this is a more elegant/reusable solution than storing data in prefs.
Comment 18 (dormant account) 2011-10-03 16:24:16 PDT
Comment on attachment 564181 [details] [diff] [review]
patch v1.1

Thanks
Comment 19 Marco Bonardo [::mak] 2011-10-04 03:27:39 PDT
For Firefox 8 and 9 we should either take the patch and get correct telemetry, or disable this Places telemetry since it returns useless data or no data at all.
Comment 21 Marco Bonardo [::mak] 2011-10-07 14:33:37 PDT
dev-doc-needed to document the gather-telemetry notification, that is sent by TelemetryPing to notify observers the ping will happen soon.
Comment 22 Marco Bonardo [::mak] 2011-10-07 14:36:38 PDT
Comment on attachment 564181 [details] [diff] [review]
patch v1.1

This bug affects both Firefox 8 and 9.  I'd like to either take this patch or disable Places telemetry (that consist in commenting out a line). I'm more prone toward the first approach because we'd find useful to get Places data in the wild for the ongoing work to improve its performances, but if drivers think the change is too scary I will accept getting only Nightly users telemetry.
Comment 23 Kyle Huey [:khuey] (Exited; not receiving bugmail, email if necessary) 2011-10-09 07:29:05 PDT
https://hg.mozilla.org/mozilla-central/rev/b16c81fc5e94
Comment 24 christian 2011-10-10 10:48:48 PDT
Didn't the places probes get backed out for Firefox 8? Erin?
Comment 25 Erin Lancaster [:elan] 2011-10-10 11:00:18 PDT
Taras backed out addons and persona probes, *not* places. See 668392.
Comment 26 christian 2011-10-10 11:21:43 PDT
Ok, thanks erin.
Comment 27 Christopher Blizzard (:blizzard) 2011-10-10 14:53:26 PDT
If this is sending you data that's really bad and causing problems, then we should back it out.  If not, it's just introducing risk and we would like to just let it ride.
Comment 28 Christopher Blizzard (:blizzard) 2011-10-10 14:53:53 PDT
Comment on attachment 564181 [details] [diff] [review]
patch v1.1

If this is sending you data that's really bad and causing problems, then we should back it out.  If not, it's just introducing risk and we would like to just let it ride.
Comment 29 christian 2011-10-10 14:56:34 PDT
We're not going to track this for 8 and 9 as whatever we do (backout, keep sending bad data, fix) doesn't really impact the release.
Comment 30 Marco Bonardo [::mak] 2011-10-10 15:23:52 PDT
Comment on attachment 564181 [details] [diff] [review]
patch v1.1

Ok, I asked Taras and once bug 690587 is fixed we will get better filtering of histograms. In the end we'll have broken values on 8 and 9, but we can filter them out from stats, that may be enough for our needs.
We'll base future work on FF10+ reported values ignoring older ones, so no more work is needed here, thank you for your patience.
Comment 31 christian 2011-10-10 15:53:24 PDT
Sorry, we just said we weren't tracking, we may want to approve (either the fix or the backout).
Comment 32 Marco Bonardo [::mak] 2011-10-10 16:00:33 PDT
It's fine, the backout does not bring tangible improvements, since we can achieve the same result filtering in the telemetry web interface (that is what I was mostly worried about, having broken un-filterable data). Fixing it has some involved risk due to the fact it changes how telemetry pings, so, unless there are other telemetry probes hitting the same issue, is problem not worth it.
Comment 33 Eric Shepherd [:sheppy] 2011-12-12 08:57:47 PST
Is this in for 10?
Comment 34 Marco Bonardo [::mak] 2011-12-12 09:11:27 PST
(In reply to Eric Shepherd [:sheppy] from comment #33)
> Is this in for 10?

yes, as stated in the TM. Btw I also searched mxr to be sure: http://mxr.mozilla.org/mozilla-aurora/search?string=gather-telemetry
Comment 35 Eric Shepherd [:sheppy] 2011-12-13 06:46:56 PST
Documented:

https://developer.mozilla.org/en/Observer_Notifications#Telemetry

Note You need to log in before you can comment on or make changes to this bug.