Closed Bug 585196 Opened 14 years ago Closed 13 years ago

telemetry infrastructure

Categories

(Core :: General, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla6

People

(Reporter: graydon, Assigned: taras.mozilla)

References

()

Details

Attachments

(3 files, 9 obsolete files)

We currently gather crash-stats to help us tell what the crashiness state of our field users is. This is good, because we often don't trip on crashes in our own testing, but our field users find all sorts of unusual combinations.

We do not, as far as I know, gather much else. In particular we don't gather performance counters of any sort. This is bad, because we are stuck with performance counters gathered by the environment of idealized test runs like those we run in the talos cluster. No offense intended -- those are good numbers to have! -- but a curated test set is going to make us believe a lot of things are "performing ok" when we just haven't seen how they are performing in the field; we're missing the "real" signal from the larger network of users.

I'd like to start gathering this signal. The counters do not have to be sophisticated or fine-grained counters. I'd like to gather a handful of coarse-grained numbers broadly from all (or a substantial portion of) our users, and organize them on a server we operate, so that we can get a high-level view of performance problem areas and regressions.

Numbers I'd like to gather (say, min/max/avg for each):

  - mapped memory / resident set / private bytes
    - possibly the full reporter-set from about:memory
  - cycle collection and js gc durations
  - number of open file descriptors / OS resource handles
  - number of threads, subprocesses, etc.
  - startup time
  - UI event latency
  - Page-load, DNS, or other network latencies

Just some very basic counters so we can tell what users are suffering with. Should probably be binned by product, platform and build-id also.

To get this working there's not *much* to do in the client (you can start with any single counter you like, doesn't matter) but probably a lot to do on a server. Figuring out which counters (if any) we can collect, whether it needs opt-in or opt-out, how much data we can digest, what kind of storage and processing business we can allocate, where to get the beefy bandwidth and cpu to digest millions of such small-pings, how to submit asynchronously and unobtrusively. That kind of thing.

(Some people have suggested test pilot for this, but I think it's not quite the right fit. Those are limited-time trials, and more commonly user-facing. I'm talking about much more pervasive reporting of much simpler and lower-level performance numbers. The "analysis" in each case should be obvious: make the bad numbers go down. We just need to be seeing those bad numbers, and seeing when they change.)
There's code in extensions/metrics/ that *might* be useful for this sort of thing (though I'm not sure what it's currently set up to measure).  Google folks developed it, and I believe it shipped or ships with Google Toolbar.  It may be designed for more test-pilot-like things, though.
Notes: /proc/self/status contains process start time in jiffies
Windows got something similar toolkit/xre/nsAppRunner.cpp
To clarify the competitive landscape here: 

http://codesearch.google.com/codesearch?q=UMA_histogram&vert=chromium

There are hundreds of counters reporting histograms back from the field there. Not even mentioning the automatic A/B testing ("field trials"), which it's also running.
Depends on: 623950
Depends on: 634124
Depends on: 634960
Attached file proof-of-concept Telemetry addon (obsolete) —
Seems like it will take a while to reach consensus, alter privacy policy, allocate a server to make telemetry happen. In the meantime I think we should do this by landing features that expose interesting information to JS. Then we can start gathering this information via an addon and once it is deemed useful, we can start considering making it part of firefox.

I'm attaching a WIP extension that reports browser uptime and memory usage once a day to demonstrate how trivial telemetry needs are.
Depends on: 636217
Depends on: 603503
Depends on: revampaboutmemory
Depends on: 649502
Blocks: 650129
Depends on: 651262
I took this addon and added some CPU usage code to the telemetry data. It's implemented in Zippity:
http://starkravingfinkle.org/blog/2011/04/zippity-update-telemetry/
No longer depends on: 651262
Depends on: 652657
Attached patch telemetry clientside (obsolete) — Splinter Review
This the clientside telemetry bit. It is disabled unless one sets toolkit.telemetry.server to a server url(such as my test server http://telemetry.allizom.org)

I would like to land this as soon as possible so potential consumers of telemetry can start testing this and writing high quality probes.

This first set of histograms depend on revamped about:memory(bug 633653), but if that doesn't get reviewed in time I can take those histograms out.

In addition to reporting memory usage, this records telemetry ping times which might be useful for testing network performance.
Attachment #514291 - Attachment is obsolete: true
Attachment #529240 - Flags: review?(dtownsend)
Attached image screenshot
Screenshot of the gathered histograms. This patch does not provide about:histograms yet, using http://hg.mozilla.org/users/tglek_mozilla.com/perf-playground/file/7b3500184967/telemetry/addon for that. Would appreciate any help with presenting this info in a pretty manner.
Attached patch cycle collector probe (obsolete) — Splinter Review
Attachment #529245 - Flags: review?(bent.mozilla)
Attached patch cycle collector probe (obsolete) — Splinter Review
forgot to qref
Attachment #529245 - Attachment is obsolete: true
Attachment #529246 - Flags: review?(bent.mozilla)
Attachment #529245 - Flags: review?(bent.mozilla)
Comment on attachment 529246 [details] [diff] [review]
cycle collector probe

If we're going to put this in released code we need to at least use mozilla::TimeStamp (which uses PR_IntervalNow as opposed to PR_Now, and doesn't overflow) since it's supposed to be faster.
Attachment #529246 - Flags: review?(bent.mozilla) → review-
Using TimeStamp now, preserved COLLECT_TIME_DEBUG functionality.
Attachment #529246 - Attachment is obsolete: true
Attachment #529548 - Flags: review?(bent.mozilla)
Comment on attachment 529240 [details] [diff] [review]
telemetry clientside

Review of attachment 529240 [details] [diff] [review]:

On the whole this is mostly style nits with a few questions thrown in. I'll want to look over this again once it is updated to make sure I didn't miss anything.

::: toolkit/components/telemetry/TelemetryPing.js
@@ +14,5 @@
+ * The Original Code is the nsTryToClose component.
+ *
+ * The Initial Developer of the Original Code is
+ * Mozilla Corporation
+ * Portions created by the Initial Developer are Copyright (C) 2011

Initial developer should be "the Mozilla Foundation."

@@ +37,5 @@
+
+const Cc = Components.classes;
+const Ci = Components.interfaces;
+const Cr = Components.results;
+const Telemetry = Cc["@mozilla.org/base/telemetry;1"].getService(Ci.nsITelemetry);

This causes us to potentially initialise that component before it is necessarily needed. Is that a problem?

@@ +50,5 @@
+// about:memory values to turn into histograms
+const MEM_HISTOGRAMS = {"heap-used/js/gc-heap":[1024, 1024*500, 10],
+                        "mapped/heap/used":[1024, 2*1024*1024, 10],
+                        "heap-used/layout/all":[1024, 50*1025, 10]
+                       }

Spaces around operators

@@ +53,5 @@
+                        "heap-used/layout/all":[1024, 50*1025, 10]
+                       }
+
+function log(msg) {
+  dump(msg+"\n");

Spaces around operators

@@ +55,5 @@
+
+function log(msg) {
+  dump(msg+"\n");
+  let c=  Components.classes["@mozilla.org/consoleservice;1"].getService(Components.interfaces.nsIConsoleService);
+  c.logStringMessage(msg);

Just use Services.console.logStringMessage

@@ +59,5 @@
+  c.logStringMessage(msg);
+}
+
+/**
+ * Gathers a list of histograms easy conversion into JSON

Can't really understand what this comment means, maybe "Returns a list of histograms as a JS object that can be converted to JSON"?

@@ +66,5 @@
+ *    histogram_type: <0 for exponential, 1 for linear>, bucketX:countX, ....} ...}
+ * where bucket[XY], count[XY] are positive integers.
+ */
+function getHistograms() {
+  let hls = Cc["@mozilla.org/base/telemetry;1"].getService(Ci.nsITelemetry).histogramSnapshots

semicolon at end of line. Why not just use Telemetry.histogramSnapshots though?

@@ +72,5 @@
+  for (let key in hls) {
+    let hgram = hls[key]
+    let r = hgram.ranges
+    let c = hgram.counts
+    let retgram = {range:[r[1], r[r.length - 1]], bucket_count:r.length, histogram_type:hgram.histogram_type}

Spaces after :s, also would probably be more readable if this were split across multiple lines. Why isn't the first in the range r[0]?

@@ +75,5 @@
+    let c = hgram.counts
+    let retgram = {range:[r[1], r[r.length - 1]], bucket_count:r.length, histogram_type:hgram.histogram_type}
+    let first = true
+    let last = 0;
+    for(let i = 0;i<c.length;i++) {

Space before ( and after ;s and around <

@@ +82,5 @@
+        continue;
+      // add a lower bound
+      if (i && first) {
+        first = false;
+        retgram[r[i-1]] = 0;

spaces around operators

@@ +85,5 @@
+        first = false;
+        retgram[r[i-1]] = 0;
+      }
+      last = i + 1;
+      retgram[r[i]] = value

I think it's a little strange to be setting properties directly on retgram. Are elements of r always guaranteed to not clobber the existing properties? If they are numeric would be be better using an array inside retgram?

@@ +103,5 @@
+  let si = Cc["@mozilla.org/toolkit/app-startup;1"].getService(Ci.nsIAppStartup).getStartupInfo()
+  var xulRuntime = Components.classes["@mozilla.org/xre/app-info;1"]
+    .getService(Components.interfaces.nsIXULRuntime);
+  var appInfo = Components.classes["@mozilla.org/xre/app-info;1"]
+    .getService(Components.interfaces.nsIXULAppInfo);

Just use Services.appinfo instead of these two.

@@ +113,5 @@
+             name: appInfo.name,
+             appBuildID: appInfo.appBuildID,
+             platformBuildID: appInfo.platformBuildID,
+             reason: reason
+            }

Style as:
let ret = {
  uptime: ...
  OS: ...
}

@@ +123,5 @@
+TelemetryPing.prototype = {
+  _histograms: {},
+
+  /**
+   * Pull values into about:memory into corresponding histograms

s/into/from/

@@ +126,5 @@
+  /**
+   * Pull values into about:memory into corresponding histograms
+   */
+  gatherMemory:function gatherMemory() {
+    let mgr = Components.classes["@mozilla.org/memory-reporter-manager;1"].getService(Components.interfaces.nsIMemoryReporterManager);

Please use Cc, Ci and wrap lines where sensible so they aren't longer than 80 characters

@@ +140,5 @@
+      }
+      let name = "Memory:" + mr.path + " (KB)";
+      let h = this._histograms[name]
+      if (!h) {
+        this._histograms[name] = h = Telemetry.newExponentialHistogram(name, specs[0],specs[1],specs[2])

spaces after ,s

@@ +162,5 @@
+    const TELEMETRY_PING = "telemetry.ping (ms)";
+    const TELEMETRY_SUCCESS = "telemetry.success (No,Yes)";
+
+    this._request = Cc["@mozilla.org/xmlextras/xmlhttprequest;1"].
+      createInstance(Ci.nsIXMLHttpRequest);

Line createInstance up with the Cc above. Not sure why you're setting on a property of this, just use a local variable instead.

@@ +164,5 @@
+
+    this._request = Cc["@mozilla.org/xmlextras/xmlhttprequest;1"].
+      createInstance(Ci.nsIXMLHttpRequest);
+    this._request._hping = Telemetry.newExponentialHistogram(TELEMETRY_PING, 1, 3000, 10);
+    this._request._hsuccess = Telemetry.newLinearHistogram(TELEMETRY_SUCCESS, 1, 2, 3);

Setting random properties on the request object doesn't seem like a good idea to me, everything is only used within the scope of this function so just use local variables.

@@ +199,5 @@
+                                     self._server = Services.prefs.getCharPref(PREF_SERVER)
+
+                                     let idleService = Cc["@mozilla.org/widget/idleservice;1"]
+                                       .getService(Ci.nsIIdleService)
+                                     idleService.addIdleObserver(self, TELEMETRY_INTERVAL/1000); 

Spaces around operators

@@ +202,5 @@
+                                       .getService(Ci.nsIIdleService)
+                                     idleService.addIdleObserver(self, TELEMETRY_INTERVAL/1000); 
+                                     self.gatherMemory();
+                                   }
+                                 }, TELEMETRY_INTERVAL, Ci.nsITimer.TYPE_ONE_SHOT);

Why is this initial timer necessary?

@@ +211,5 @@
+   */
+  observe: function (aSubject, aTopic, aData) {
+    switch (aTopic) {
+    case "app-startup":
+        this.setup()

Correct this indentation
Attachment #529240 - Flags: review?(dtownsend) → review-
Should have added, obviously this needs unit tests
> Why isn't the first in the range r[0]?

Quirk of how histograms are implemented. Each bucket represents the lower range. So if you specify you need a range from 1 to 9 and feed it 0 and 10: 10 can go into the 9&up bucket, but you can't put 0 into bucket [1,1+x). To accommodate this, every histogram has a [0, minimum) bucket at element 0 to accommodate misfits. 
Sending the range to the server is a sanity check to make sure that someone  hasn't accidentally created a new histogram with different parameters. Due to always being 0, the smallest bucket isn't useful to report.

> I think it's a little strange to be setting properties directly on retgram. Are
> elements of r always guaranteed to not clobber the existing properties? If they
> are numeric would be be better using an array inside retgram?

Ranges are numeric, but they aren't indices. They literally are 'labels' for buckets that happen to be numeric.

I thought of 3 options of representing histogram values on the wire
1) C++ way, ie serialize ranges, values into 2 arrays. This is hard to read due to having to figure out what range corresponds to what value. This wastes a lot of space since many buckets will always contain 0
2) Sparse arrays of bucket:value form. I think this is what you suggest. I think using extremely sparse arrays(for data that is essentially an associative array) are misleading and possibly under-performing  when their indexes are far apart. 
3) An object with numeric indices. This is what I selected. It has the downside of sort of looking like an array, but is very compact for sending on the wire(no 0s or tons of commas). 

I think an histogram of form {range:[x,y], bucket_count:z, histogram_type:1, "0":1, "3":4, "5":2} is a pretty good tradeoff between being compact and easy to understand.


> +                                       .getService(Ci.nsIIdleService)
> +                                     idleService.addIdleObserver(self,
> TELEMETRY_INTERVAL/1000); 
> +                                     self.gatherMemory();
> +                                   }
> +                                 }, TELEMETRY_INTERVAL,
> Ci.nsITimer.TYPE_ONE_SHOT);
> 
> Why is this initial timer necessary?

a) Incompetence. specifying a category to listen to in the manifest only seems to work for app-startup for me.

b) ~1 minute after startup seems like a good threshold to start polling stats for telemetry.

(In reply to comment #13)
> Should have added, obviously this needs unit tests

Indeed. I submitted this for review so I can get feedback while I'm working on the testcase. Thanks for digging into implementation details in this prompt review.
(In reply to comment #14)
> > Why isn't the first in the range r[0]?
> 
> Quirk of how histograms are implemented. Each bucket represents the lower
> range. So if you specify you need a range from 1 to 9 and feed it 0 and 10: 10
> can go into the 9&up bucket, but you can't put 0 into bucket [1,1+x). To
> accommodate this, every histogram has a [0, minimum) bucket at element 0 to
> accommodate misfits. 
> Sending the range to the server is a sanity check to make sure that someone 
> hasn't accidentally created a new histogram with different parameters. Due to
> always being 0, the smallest bucket isn't useful to report.
> 
> > I think it's a little strange to be setting properties directly on retgram. Are
> > elements of r always guaranteed to not clobber the existing properties? If they
> > are numeric would be be better using an array inside retgram?
> 
> Ranges are numeric, but they aren't indices. They literally are 'labels' for
> buckets that happen to be numeric.
> 
> I thought of 3 options of representing histogram values on the wire
> 1) C++ way, ie serialize ranges, values into 2 arrays. This is hard to read due
> to having to figure out what range corresponds to what value. This wastes a lot
> of space since many buckets will always contain 0
> 2) Sparse arrays of bucket:value form. I think this is what you suggest. I
> think using extremely sparse arrays(for data that is essentially an associative
> array) are misleading and possibly under-performing  when their indexes are far
> apart. 
> 3) An object with numeric indices. This is what I selected. It has the downside
> of sort of looking like an array, but is very compact for sending on the
> wire(no 0s or tons of commas). 
> 
> I think an histogram of form {range:[x,y], bucket_count:z, histogram_type:1,
> "0":1, "3":4, "5":2} is a pretty good tradeoff between being compact and easy
> to understand.

Fair enough. I wonder if it would still be a little nicer to put all the numerical named properties into their own object:

{range:[x,y], bucket_count:z, histogram_type:1, ranges: {"0":1, "3":4, "5":2} }

> > +                                       .getService(Ci.nsIIdleService)
> > +                                     idleService.addIdleObserver(self,
> > TELEMETRY_INTERVAL/1000); 
> > +                                     self.gatherMemory();
> > +                                   }
> > +                                 }, TELEMETRY_INTERVAL,
> > Ci.nsITimer.TYPE_ONE_SHOT);
> > 
> > Why is this initial timer necessary?
> 
> a) Incompetence. specifying a category to listen to in the manifest only seems
> to work for app-startup for me.

profile-after-change should work too, and prefs etc. should work then.

> b) ~1 minute after startup seems like a good threshold to start polling stats
> for telemetry.

So you always want to take the first measurement 1 minute after startup, instead of the first minute that the user is idle?
> Fair enough. I wonder if it would still be a little nicer to put all the
> numerical named properties into their own object:
> 
> {range:[x,y], bucket_count:z, histogram_type:1, ranges: {"0":1, "3":4, "5":2} }

I will do that.

> 
> > b) ~1 minute after startup seems like a good threshold to start polling stats
> > for telemetry.
> 
> So you always want to take the first measurement 1 minute after startup,
> instead of the first minute that the user is idle?

Yes. I think it's nice to get some stats soon after start, but not too soon that firefox is still lazy-initializing. This way we get some lower-bound numbers on memory usage, etc. I'm also worried that for some users idle + 1 minute may not happen often or at all (ie due to something keeping firefox or the rest of the OS busy...whether it's the user or malware).

This is my first stab at a workable heuristic, I'm sure there are better options for this.
Attached patch telemetry clientside + testcase (obsolete) — Splinter Review
Addressed review comments, added xpcshell testcase.
Attachment #529240 - Attachment is obsolete: true
Attachment #529837 - Flags: review?(dtownsend)
Attached patch telemetry clientside + testcase (obsolete) — Splinter Review
Slight update. This updates the protocol to use the url format that the metrics serverside requested. The format is:
http://<server>/submit/telemetry/<request id>
Attachment #529837 - Attachment is obsolete: true
Attachment #531110 - Flags: review?(dtownsend)
Attachment #529837 - Flags: review?(dtownsend)
Comment on attachment 531110 [details] [diff] [review]
telemetry clientside + testcase

Is this an old version of the patch? Seems to be some of my earlier comments not addressed yet.
Attached patch telemetry clientside + testcase (obsolete) — Splinter Review
I missed some spaces and a "the". Is there anything else?
Attachment #531110 - Attachment is obsolete: true
Attachment #531150 - Flags: review?(dtownsend)
Attachment #531110 - Flags: review?(dtownsend)
Comment on attachment 531150 [details] [diff] [review]
telemetry clientside + testcase

Review of attachment 531150 [details] [diff] [review]:
-----------------------------------------------------------------

Basically all nits at this point though there is something I'm not following with the test.

::: toolkit/components/telemetry/Makefile.in
@@ +19,5 @@
> +# Portions created by the Initial Developer are Copyright (C) 2011
> +# the Initial Developer. All Rights Reserved.
> +#
> +# Contributor(s):
> +#	Taras Glek <tglek@mozilla.com>

Indent by just two spaces here

::: toolkit/components/telemetry/TelemetryPing.js
@@ +43,5 @@
> +Cu.import("resource://gre/modules/Services.jsm");
> +Cu.import("resource://gre/modules/XPCOMUtils.jsm");
> +
> +const PREF_SERVER = "toolkit.telemetry.server";
> +// Do not gather data more than once a minute5B5B

5B5B?

@@ +49,5 @@
> +// about:memory values to turn into histograms
> +const MEM_HISTOGRAMS = {"heap-used/js/gc-heap": [1024, 1024 * 500, 10],
> +                        "mapped/heap/used": [1024, 2 * 1024 * 1024, 10],
> +                        "heap-used/layout/all": [1024, 50 * 1025, 10]
> +                       };

Missed this last time. Normal style is to format JS objects like so:

const MEM_HISTOGRAMS = {
  "heap-used/js/gc-heap": [...],
  ...
};

@@ +58,5 @@
> +
> +function log(msg) {
> +  dump(msg + "\n");
> +  Services.console.logStringMessage(msg);
> +}

This function doesn't seem to be used anywhere

@@ +78,5 @@
> +    let retgram = {range: [r[1], r[r.length - 1]],
> +                   bucket_count: r.length,
> +                   histogram_type: hgram.histogram_type,
> +                   values: {}
> +                  };

formatting as above

@@ +99,5 @@
> +      retgram.values[r[last]] = 0;
> +    ret[key] = retgram;
> +  }
> +  return ret;
> +}

small nit: this function would probably be a little more readable with some line breaks in it

@@ +111,5 @@
> + * @return request metadata
> + */
> +function getMetadata(reason) {
> +  let si = Cc["@mozilla.org/toolkit/app-startup;1"]
> +    .getService(Ci.nsIAppStartup).getStartupInfo();

Put the . at the end of the first line. It is against the style guide I know but for Cc..getService it is what we do everywhere else. Line the getService up with the Cc

@@ +118,5 @@
> +    uptime: (new Date() - si.process),
> +    reason: reason
> +  }
> +
> +  // The following do not work in xpcshell

You can make these work in tests by defining a custom nsIXULAppInfo: http://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/extensions/test/xpcshell/head_addons.js#22

@@ +142,5 @@
> +
> +  /**
> +   * Pull values from about:memory into corresponding histograms
> +   */
> +  gatherMemory:function gatherMemory() {

Space after the :

@@ +146,5 @@
> +  gatherMemory:function gatherMemory() {
> +    let mgr;
> +    try {
> +      mgr = Cc["@mozilla.org/memory-reporter-manager;1"]
> +            .getService(Ci.nsIMemoryReporterManager);

Same again with the . and everywhere else in the file where you've needed to split getting a component across multiple lines

@@ +170,5 @@
> +      let v = Math.floor(mr.memoryUsed / 1024);
> +      h.add(v);
> +    }
> +    return memReporters;
> +  },

Some line breaks would probably be good in here too

@@ +187,5 @@
> +
> +    // Generate a unique id once per session so the server can cope with duplicate submissions.
> +    // Use a deterministic url for testing.
> +    if (!this._path)
> +      this._path = "/submit/telemetry/" + (reason == "test-ping" ? reason : generateUUID());

What duplicates are we concerned about here? Isn't the point to get multiple sets of data as the application is running? I'm a little concerned about the privacy implications of this but I guess that should get brought up at the security review.

@@ +193,5 @@
> +    const TELEMETRY_PING = "telemetry.ping (ms)";
> +    const TELEMETRY_SUCCESS = "telemetry.success (No, Yes)";
> +
> +    let request = Cc["@mozilla.org/xmlextras/xmlhttprequest;1"]
> +                    .createInstance(Ci.nsIXMLHttpRequest);

Styling as mentioned previously

@@ +228,5 @@
> +
> +                                     self._server = Services.prefs.getCharPref(PREF_SERVER);
> +
> +                                     let idleService = Cc["@mozilla.org/widget/idleservice;1"]
> +                                        .getService(Ci.nsIIdleService);

Style as mentioned previously

@@ +232,5 @@
> +                                        .getService(Ci.nsIIdleService);
> +                                     idleService.addIdleObserver(self, TELEMETRY_INTERVAL / 1000); 
> +                                     self.gatherMemory();
> +                                   }
> +                                 }, TELEMETRY_INTERVAL, Ci.nsITimer.TYPE_ONE_SHOT);

You can actually just pass a function here, so style like this:

this._timer.initWithCallback(function() {
  if (!Services.pre....
}, TELEMETRY_INTERVAL, Ci.nsITimer.TYPE_ONE_SHOT);

I think it would be better to use a separate constant for this delay even if they end up with the same value.

::: toolkit/components/telemetry/tests/unit/test_TelemetryPing.js
@@ +7,5 @@
> +
> +function next_request ()
> +{
> +  let tp = Cc["@mozilla.org/base/telemetry-ping;1"].getService(Ci.nsIObserver);
> +  tp.observe(tp, "test-ping","http://localhost:4444");

Spaces after ,s

@@ +23,5 @@
> +function readBytesFromInputStream(inputStream, count) {
> +  let BinaryInputStream = Components.Constructor(
> +      "@mozilla.org/binaryinputstream;1",
> +      "nsIBinaryInputStream",
> +      "setInputStream");

Should just declare this globally

@@ +30,5 @@
> +  }
> +  return new BinaryInputStream(inputStream).readBytes(count);
> +}
> +
> +function dummyHandler(metadata, response)

So what is going on here, you fire off a first request that you basically ignore followed by another request that seems to be nothing more than a delay followed by the real request that you check. I'm not following why you don't just start with the real request.

@@ +52,5 @@
> +
> +function checkHistograms(request, response) {
> +  // do not need the http server anymore
> +  httpserver.stop(do_test_finished);
> +  let s= request.bodyInputStream

Spaces around =
Attachment #531150 - Flags: review?(dtownsend) → review-
Unique ids are in case a user submits multiple telemetry events (ie due to multiple idle-daily events in a multiday session). Then the server can replace the last record instead of double-counting it.

> @@ +30,5 @@
> > +  }
> > +  return new BinaryInputStream(inputStream).readBytes(count);
> > +}
> > +
> > +function dummyHandler(metadata, response)
> 
> So what is going on here, you fire off a first request that you basically
> ignore followed by another request that seems to be nothing more than a
> delay followed by the real request that you check. I'm not following why you
> don't just start with the real request.

First request is ignored because there isn't any predictable data in there that I can check. However, telemetry code keeps a histogram of past telemetry pings. This gets me a predictable histogram to check in the second request. The "nothing more than a delay" xhr is used to flush the queued up xhr response events in telemetry code so the known histogram has a chance to be generated.
(In reply to comment #22)
> >
> > So what is going on here, you fire off a first request that you basically
> > ignore followed by another request that seems to be nothing more than a
> > delay followed by the real request that you check. I'm not following why you
> > don't just start with the real request.
> 
> First request is ignored because there isn't any predictable data in there
> that I can check. However, telemetry code keeps a histogram of past
> telemetry pings. This gets me a predictable histogram to check in the second
> request. The "nothing more than a delay" xhr is used to flush the queued up
> xhr response events in telemetry code so the known histogram has a chance to
> be generated.

Sounds like that explanation should be in a comment :)
Attached patch telemetry clientside + testcase (obsolete) — Splinter Review
Hopefully this one is a keeper. Addressed review comments, added a comment explaining multiple XHRs to testcase.
Attachment #531150 - Attachment is obsolete: true
Attachment #531390 - Flags: review?(dtownsend)
(In reply to comment #24)
> Created attachment 531390 [details] [diff] [review] [review]
> telemetry clientside + testcase
> 
> Hopefully this one is a keeper. Addressed review comments, added a comment
> explaining multiple XHRs to testcase.

The notification in bug 652657 would require a knob to enable/disable telemetry. It doesn't feel right to rely on setting toolkit.telemetry.server for that. I'd suggest adding a toolkit.telemetry.enabled bool.
Comment on attachment 531390 [details] [diff] [review]
telemetry clientside + testcase

Review of attachment 531390 [details] [diff] [review]:
-----------------------------------------------------------------

I agree with comment 25, we need a way to turn this off.

Do we have a feature page written up for this yet?

::: toolkit/components/telemetry/TelemetryPing.js
@@ +190,5 @@
> +    const TELEMETRY_PING = "telemetry.ping (ms)";
> +    const TELEMETRY_SUCCESS = "telemetry.success (No, Yes)";
> +
> +    let request = Cc["@mozilla.org/xmlextras/xmlhttprequest;1"]
> +                  .createInstance(Ci.nsIXMLHttpRequest);

let request = Cc["@mozilla.org/xmlextras/xmlhttprequest;1"].
              createInstance(Ci.nsIXMLHttpRequest);

@@ +227,5 @@
> +                                   let idleService = Cc["@mozilla.org/widget/idleservice;1"].
> +                                       getService(Ci.nsIIdleService);
> +                                   idleService.addIdleObserver(self, TELEMETRY_INTERVAL); 
> +                                   self.gatherMemory();
> +                                 } , TELEMETRY_DELAY, Ci.nsITimer.TYPE_ONE_SHOT);

This is all over-indented:

this._timer.initWithCallback(function() {
  self._server = Services.prefs.getCharPref(PREF_SERVER);
  ...
} , TELEMETRY_DELAY, Ci.nsITimer.TYPE_ONE_SHOT);

@@ +243,5 @@
> +      this.gatherMemory();
> +      break;
> +    case "test-ping":
> +      // This allows for easy testing
> +      this._server = aData;

I wonder if it is a problem that once you've done a test-ping all subsequent pings are to the test server

::: toolkit/components/telemetry/tests/unit/test_TelemetryPing.js
@@ +52,5 @@
> +    // Override the path handler for the next submission
> +    httpserver.registerPathHandler(PATH, checkHistograms);
> +    next_request();
> +  };
> +  xhr.send(null);

This really doesn't feel safe to me and could end up as a source of intermittent orange. The best alternative I can think of is firing an observer notification after the telemetry ping is complete and just waiting for that.
Attachment #531390 - Flags: review?(dtownsend) → review-
(In reply to comment #26)
> Comment on attachment 531390 [details] [diff] [review] [review]
> telemetry clientside + testcase
> 
> Review of attachment 531390 [details] [diff] [review] [review]:
> -----------------------------------------------------------------
> 
> I agree with comment 25, we need a way to turn this off.
> 
> Do we have a feature page written up for this yet?

It's a p1 for ff6. 

https://wiki.mozilla.org/Platform/Features/Telemetry

> 
> this._timer.initWithCallback(function() {
>   self._server = Services.prefs.getCharPref(PREF_SERVER);
>   ...
> } , TELEMETRY_DELAY, Ci.nsITimer.TYPE_ONE_SHOT);
> 
> @@ +243,5 @@
> > +      this.gatherMemory();
> > +      break;
> > +    case "test-ping":
> > +      // This allows for easy testing
> > +      this._server = aData;
> 
> I wonder if it is a problem that once you've done a test-ping all subsequent
> pings are to the test server

Good point, easy to fix.
> 
> ::: toolkit/components/telemetry/tests/unit/test_TelemetryPing.js
> @@ +52,5 @@
> > +    // Override the path handler for the next submission
> > +    httpserver.registerPathHandler(PATH, checkHistograms);
> > +    next_request();
> > +  };
> > +  xhr.send(null);
> 
> This really doesn't feel safe to me and could end up as a source of
> intermittent orange. The best alternative I can think of is firing an
> observer notification after the telemetry ping is complete and just waiting
> for that.

I don't feel that way. The order of events is very deterministic and I don't see how even a change in implementation of xhr/event-loop would make it less so.

I'll post another patch addressing the nits in a few minutes.

Dave, I don't feel comfortable delaying landing this much longer. We just spent a 3rd of the firefox6 window on review ping-pong. We need to land this asap so people have a chance to develop and test telemetry probes. I'd rather address any further changes in follow-up patches.
(In reply to comment #27)
> (In reply to comment #26)
> > ::: toolkit/components/telemetry/tests/unit/test_TelemetryPing.js
> > @@ +52,5 @@
> > > +    // Override the path handler for the next submission
> > > +    httpserver.registerPathHandler(PATH, checkHistograms);
> > > +    next_request();
> > > +  };
> > > +  xhr.send(null);
> > 
> > This really doesn't feel safe to me and could end up as a source of
> > intermittent orange. The best alternative I can think of is firing an
> > observer notification after the telemetry ping is complete and just waiting
> > for that.
> 
> I don't feel that way. The order of events is very deterministic and I don't
> see how even a change in implementation of xhr/event-loop would make it less
> so.

I disagree and before I posted the review I double checked with sdwilsh who agreed with me. You're relying on the load even for the telemetry triggered xhr to fire before the load event for the test triggered xhr. Nothing guarantees that. Using an observer notification would give a hard guarantee.

> Dave, I don't feel comfortable delaying landing this much longer. We just
> spent a 3rd of the firefox6 window on review ping-pong. We need to land this
> asap so people have a chance to develop and test telemetry probes. I'd
> rather address any further changes in follow-up patches.

I don't feel comfortable with allowing code to land that hasn't finished its review cycle and nothing remaining here seems like anything that could be follow-up fodder. We have just under two weeks till the next merge to aurora, I don't see any reason why we wouldn't make that, but even if the merge is missed it'll just get on the next train.
(In reply to comment #28)
> > I don't feel that way. The order of events is very deterministic and I don't
> > see how even a change in implementation of xhr/event-loop would make it less
> > so.
> 
> I disagree and before I posted the review I double checked with sdwilsh who
> agreed with me. You're relying on the load even for the telemetry triggered
> xhr to fire before the load event for the test triggered xhr. Nothing
> guarantees that. Using an observer notification would give a hard guarantee.

Perhaps I should diagram my understanding:
telemetry ping --->
(dummy xhr requested async from ping response handler)
<--- ping response finish
(event loop handles ping response)
dummy xhr sent --->
<--- dummy xhr response
(event loop starts dummy xhr handler)
(telemetry ping requested)
---> temetry xhr request
...

I don't see how the dummy xhr handler would get ahead of the first telemetry-ping-handler in the event queue. Seems like event order has to depend on order of socket events. I'm reluctant to significantly change the telemetry xhr handling to accommodate testing via the observer interface unless this is a real problem.
Attached patch telemetry clientside + testcase (obsolete) — Splinter Review
* Added toolkit.telemetry.enabled, toolkit.telemetry.server to all.js.
* Cleaned up timer indentation. Sorry for having so much trouble with this
* Made the test ping not clobber the server set in the prefs
Attachment #531390 - Attachment is obsolete: true
Attachment #531774 - Flags: review?(dtownsend)
(In reply to comment #29)
> (In reply to comment #28)
> > > I don't feel that way. The order of events is very deterministic and I don't
> > > see how even a change in implementation of xhr/event-loop would make it less
> > > so.
> > 
> > I disagree and before I posted the review I double checked with sdwilsh who
> > agreed with me. You're relying on the load even for the telemetry triggered
> > xhr to fire before the load event for the test triggered xhr. Nothing
> > guarantees that. Using an observer notification would give a hard guarantee.
> 
> Perhaps I should diagram my understanding:
> telemetry ping --->
> (dummy xhr requested async from ping response handler)
> <--- ping response finish
> (event loop handles ping response)
> dummy xhr sent --->
> <--- dummy xhr response
> (event loop starts dummy xhr handler)
> (telemetry ping requested)
> ---> temetry xhr request
> ...
> 
> I don't see how the dummy xhr handler would get ahead of the first
> telemetry-ping-handler in the event queue. Seems like event order has to
> depend on order of socket events. I'm reluctant to significantly change the
> telemetry xhr handling to accommodate testing via the observer interface
> unless this is a real problem.

I'm not asking for any significant change. Just send an observer notification at the end of the onload/onerror functions. The test listens to those to know when it can proceed.
(In reply to comment #31)
> I'm not asking for any significant change. Just send an observer
> notification at the end of the onload/onerror functions. The test listens to
> those to know when it can proceed.

Sorry, nevermind my previous complaint about complicating code. I misunderstood where you wanted the observer until this clarification.
Attachment #531774 - Attachment is obsolete: true
Attachment #531774 - Flags: review?(dtownsend)
Comment on attachment 531801 [details] [diff] [review]
telemetry clientside + testcase

Review of attachment 531801 [details] [diff] [review]:
-----------------------------------------------------------------

Don't know if you meant to request review on this but it looks good to me. Just a couple of new comments but I don't need to see this again I think

::: modules/libpref/src/init/all.js
@@ +274,5 @@
>  
> +// Telemetry
> +pref("toolkit.telemetry.enabled", false);
> +// Telemetry test server to be used until the official one is public
> +pref("toolkit.telemetry.server", "http://telemetry.allizom.org");

Should we be having nightlies point to the test server? And non-ssl at that? I presume we'll move this to a real server before it hits aurora?

::: toolkit/components/telemetry/TelemetryPing.js
@@ +198,5 @@
> +    let hsuccess = Telemetry.newLinearHistogram(TELEMETRY_SUCCESS, 1, 2, 3);
> +
> +    let url = server + this._path;
> +    request.open("POST", url, true);
> +    request.overrideMimeType("text/plain");

Speaking of ssl above made me remember you should set request.mozBackgroundRequest = true to just fail on any security-related errors.

::: toolkit/components/telemetry/tests/unit/test_TelemetryPing.js
@@ +30,5 @@
> +
> +  Services.obs.addObserver(function (aSubject, aTopic, aData) {
> +    httpserver.registerPathHandler(PATH, checkHistograms);
> +    telemetry_ping();
> +   }, "telemetry-test-xhr-complete", false);

You should remove the observer listener once it's no longer needed.
Attachment #531801 - Flags: review+
(In reply to comment #33)
> 
> Don't know if you meant to request review on this but it looks good to me.
> Just a couple of new comments but I don't need to see this again I think

Yeah, thanks for catching that.

> Should we be having nightlies point to the test server? And non-ssl at that?
> I presume we'll move this to a real server before it hits aurora?

Yes, once the metrics server is public.

> Speaking of ssl above made me remember you should set
> request.mozBackgroundRequest = true to just fail on any security-related
> errors.

ok.

> You should remove the observer listener once it's no longer needed.

ok
Attachment #529548 - Flags: review?(bent.mozilla) → review+
http://hg.mozilla.org/mozilla-central/rev/57af04c3dec4
http://hg.mozilla.org/mozilla-central/rev/07a15c006d9c
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Please mark all new features with the appropriate tracking flag. It's not clear from this bug whether the feature is on yet, or if this is only supporting code, and will be turned on in a later bug. Glandium says we're collecting data by default now, but not sending it.

Did the security review for this feature get completed?

Is there server infrastructure that needs to be considered before this is widely deployed?

Have all the necessary changes to the privacy policy been made?
Target Milestone: --- → mozilla6
(In reply to comment #37)
> Please mark all new features with the appropriate tracking flag. It's not
> clear from this bug whether the feature is on yet, or if this is only
> supporting code, and will be turned on in a later bug. Glandium says we're
> collecting data by default now, but not sending it.

Yes
> 
> Did the security review for this feature get completed?

Not yet.

> 
> Is there server infrastructure that needs to be considered before this is
> widely deployed?

Metrics is handling that. bug 655746 is the security review for it.

> 
> Have all the necessary changes to the privacy policy been made?

Almost. bug 652656.
Depends on: 658735
Depends on: 663423
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: