Closed Bug 863872 Opened 11 years ago Closed 10 years ago

Force sending of Telemetry if we haven't submitted anything for a week

Categories

(Toolkit :: Telemetry, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla29
Tracking Status
firefox27 --- fixed
firefox28 - fixed
firefox29 --- fixed

People

(Reporter: vladan, Assigned: mconley)

References

Details

(Whiteboard: [Telemetry:P1][Australis:P-])

Attachments

(1 file, 12 obsolete files)

21.64 KB, patch
mconley
: review+
mconley
: feedback+
Details | Diff | Splinter Review
The impetus for this request is bug 843807: The FHR team noticed that the number of Telemetry idle-daily pings on Beta is far lower than predicted by the Telemetry opt-in rates reported by FHR.

The likely cause of this is idle-daily not firing frequently enough. 

The SIMPLE_MEASURES_SAVED_PINGS histogram shows that 13% of idle-daily submissions on Nightly over the past week are sending 10 or more saved-session pings. Additionally, the SIMPLE_MEASURES_UPTIME histogram shows that 75% of sessions last less than an hour (again, using saved-session Nightly pings from last 7 days). It's very likely that Firefox never gets the idle-daily event simply because some users consistently have short sessions or their computers have short hibernation timeouts. Furthermore, Telemetry deletes saved-session pings older than 7 days, preventing us from ever submitting pings from users with these types of sessions. Finally, Telemetry waits for an additional 5 minutes of idleness after idle-daily before submitting pings (see IDLE_TIMEOUT_SECONDS).

We should consider forcing Telemetry submission X seconds or minutes after Firefox startup without waiting for system-idle if we have a lot of backlogged pings. We will need to be mindful of negatively affecting user experience with main-thread jank or repeated send attempts when the user is offline and submission consistently fails.
I think we should also bump old ping retention to 2 weeks
Whiteboard: [Telemetry:P1]
Just a few followup comments:

* Even with a five second delay, we had user complaints about Sync not updating quickly.  Five minutes is a long time for a modern PC to be completely idle without going to standby.
* the 5 minute idle observer is five minutes of total idle, not additional idle. (idle observers set for less than the current amount of idle will fire almost immediately)  It's still 2-4 minutes longer than the idle-daily trigger.  If we want to continue using idle-daily, I would recommend triggering submission on a much shorter cutoff (perhaps a few seconds) to increase the odds that this will work.  If the user is still idle it'll fire immediately, and if they've returned from idle it's still a relatively quick request that can be done during minimal idle.
* We should make sure to label these pings so we know how often we're having to force submissions.  (I assume we'd do that, but wanted to make sure we didn't forget.)
* It's probably sufficient to say "if N saved pings or we haven't submitted in X days, set a short idle timer during startup" and assume that'll work, or we'll get it next time.
According to mconnor, fixing this bug would go a long way towards making the Australis measurements we're collecting in bug 935093 more meaningful / representative.

I'd like to tackle this bug, if I can.
I think the solution that's being proposed is as follows:

* Reduce IDLE_TIMEOUT_SECONDS from 5 minutes to X seconds so that there's less likelihood of a machine going to sleep before it fire off a ping.
* On start-up, determine if we have crossed some threshold of stored pings, and attempt to send them out.

I know a few other ideas got floated around, mconnor - did I miss any important actions in here?
Flags: needinfo?(mconnor)
I think those are the right two actions.  We talked about more blue-sky type things, but in terms of minimal viable patch, I'd say those are the two pieces.  I'm not sure you can lower IDLE_TIMEOUT_SECONDS too much, because you're narrowing a gap between 5 minutes and 3 (default idle-daily interval) so there's a limit.  I'd at least like to get under the 5 minute mark, since that's a common increment for Windows power management profiles.  Even if you just knocked 30 seconds off you'd be in better shape.

For the second one, I'd say that (given the current memory overhead from holding the strings in memory) any more than five should be submitted post-haste.  Given experience with Sync and five second idle timers, I'd say we should cap at five and be aggressive.  As Irving noted, the majority of this stuff happens with async calls, so it's not the end of the world if we occasionally push while the user is potentially active.
Flags: needinfo?(mconnor)
Assignee: nobody → mconley
Status: NEW → ASSIGNED
Dunno if this is the right place to do this, since I'm pretty new to TelemetryPing.js (and Telemetry code in general for that matter), but this is where I'm starting.

I'm going to try to write tests for this, so it's not quite done - but am I going about this The Right Way?
Attachment #8342434 - Flags: feedback?(irving)
Comment on attachment 8342434 [details] [diff] [review]
Part 2: Send off all pings if there are more than MAX_PING_THRESHOLD stored

a-nope, it looks like this patch attempts to send all pings for each saved ping (the loadSavedPings callback is fired once per saved ping).

Thank goodness for testing.
Attachment #8342434 - Flags: feedback?(irving)
(In reply to Mike Conley (:mconley) from comment #8)
> Comment on attachment 8342434 [details] [diff] [review]
> Part 2: Send off all pings if there are more than MAX_PING_THRESHOLD stored
> 
> a-nope, it looks like this patch attempts to send all pings for each saved
> ping (the loadSavedPings callback is fired once per saved ping).
> 
> Thank goodness for testing.

This is the right idea, you probably just want a separate at-the-end-of-all-loads callback passed into loadSavedPings.
I think this will do the job.

All Telemetry unit tests, including my new one, pass. Doing self review, and then I'll request the real deal.
Attachment #8342434 - Attachment is obsolete: true
Ok, I think this is worth looking at now.
Attachment #8342623 - Attachment is obsolete: true
Attachment #8342624 - Flags: review?(nfroyd)
Attachment #8342624 - Flags: feedback?(irving)
Comment on attachment 8342430 [details] [diff] [review]
Part 1: Decrease IDLE_TIMEOUT_SECONDS by 1 minute

How does 4 minutes sound?
Attachment #8342430 - Flags: review?(nfroyd)
Attachment #8342430 - Flags: feedback?(irving)
After talking with Taras, we're thinking the logic should be more like:

After Telemetry startup, load saved pings. If any are pretty old, attempt to send those off. Regardless of whether or not that succeeds, delete the file after the request is completed.

We're thinking we might want to try to send _all_ saved pings if we find an old one.
Comment on attachment 8342624 [details] [diff] [review]
Part 2 (v2.1): Send off all pings if there are more than MAX_PING_THRESHOLD stored

Based on my conversation with Taras, I'm taking this back to the lab. Back in a bit.
Attachment #8342624 - Flags: review?(nfroyd)
Attachment #8342624 - Flags: feedback?(irving)
Self-review time...
Attachment #8342624 - Attachment is obsolete: true
Ok, ready for someone to kick the tires.

All Telemetry tests pass with this patch.
Attachment #8342685 - Attachment is obsolete: true
Attachment #8342693 - Flags: review?(nfroyd)
Attachment #8342693 - Flags: feedback?(irving)
Comment on attachment 8342430 [details] [diff] [review]
Part 1: Decrease IDLE_TIMEOUT_SECONDS by 1 minute

Review of attachment 8342430 [details] [diff] [review]:
-----------------------------------------------------------------

The existing code doesn't work correctly, so if we're in here I'd like to fix it.

The problem is that if the system has been idle for at least 5 (or, after the patch, 4) minutes *before* the idle-daily timer fires, we'll send out the gather-telemetry notification and then on the next clock we'll immediately collect the telemetry packet. Rather than using idle for that, I'd rather we just waited a fixed time (30 seconds or so) and then collect the telemetry without waiting for another idle signal.
Attachment #8342430 - Flags: feedback?(irving) → feedback-
Comment on attachment 8342693 [details] [diff] [review]
Part 2 (v3.1): Send off all pings if there are more than MAX_PING_THRESHOLD stored

Review of attachment 8342693 [details] [diff] [review]:
-----------------------------------------------------------------

The ideas look good, but there are a few things that need to be addressed below.

Also, the commit message for this patch should be changed to reflect what the patch is doing now.

::: toolkit/components/telemetry/TelemetryFile.jsm
@@ +286,5 @@
>    }
> +
> +  let now = new Date();
> +  if (file.exists() && (now - file.lastModifiedTime > MAX_PING_FILE_AGE)) {
> +    loadedOverduePings = true;

This is going to set loadedOverduePings at the wrong point if:

- You wanted to check loadedOverduePings in your onDone() handler;
- The only overdue ping was the last one to load.

...since your onDone handler fires before you get here.

@@ +289,5 @@
> +  if (file.exists() && (now - file.lastModifiedTime > MAX_PING_FILE_AGE)) {
> +    loadedOverduePings = true;
> +    // We haven't had much luck in sending this file; now that
> +    // we've queued it into memory, clear it off the disk.
> +    file.remove(true);

FWIW, the ping is just going to get saved to disk again if sending fails.  You may want to mark pings like this somehow as OK to not re-save or something.

::: toolkit/components/telemetry/TelemetryPing.js
@@ +830,5 @@
> +        {
> +          // If we have any TelemetryPings lying around, we'll be aggressive
> +          // and try to send them all off.
> +          if (TelemetryFile.loadedOverduePings) {
> +            this.send("overdue-flush", this._server);

You should check with mreid that it's OK to add another reason like this.  (I think having another reason is fine; lets us see how many people have overdue pings lying around.)

::: toolkit/components/telemetry/tests/unit/head.js
@@ +53,5 @@
> +
> +function decodeRequestPayload(request) {
> +  let s = request.bodyInputStream;
> +  let payload = null;
> +  let decoder = Cc["@mozilla.org/dom/json;1"].createInstance(Ci.nsIJSON)

Cc and friends aren't defined in head.js...they probably should be so we don't wind up with problems later.

::: toolkit/components/telemetry/tests/unit/test_TelemetrySendOldPings.js
@@ +101,5 @@
> + */
> +function assertPings(aExpectedNum) {
> +  let deferred = Promise.defer();
> +
> +  do_timeout(PING_TIMEOUT_LENGTH, function() {

Just to make sure I understand this: You're requiring that all the pings be read off the disk, sent, and processed before the timeout handler goes off 500ms later?  That sounds like a recipe for intermittent oranges.

@@ +114,5 @@
> +  return deferred.promise;
> +}
> +
> +/**
> + * Quick and dirty way to clears TelemetryFile's pendingPings

Nit: "to clear"

@@ +115,5 @@
> +}
> +
> +/**
> + * Quick and dirty way to clears TelemetryFile's pendingPings
> + * collection, and put it back in it's initial state.

Nit: "its initial state"
Attachment #8342693 - Flags: review?(nfroyd) → feedback+
(In reply to :Irving Reid from comment #17)
> The existing code doesn't work correctly, so if we're in here I'd like to
> fix it.

Perhaps, but I don't think this is necessarily the right place to fix it.

> The problem is that if the system has been idle for at least 5 (or, after
> the patch, 4) minutes *before* the idle-daily timer fires, we'll send out
> the gather-telemetry notification and then on the next clock we'll

What is "the next clock" referring to here?
(In reply to Nathan Froyd (:froydnj) from comment #19)
> > The problem is that if the system has been idle for at least 5 (or, after
> > the patch, 4) minutes *before* the idle-daily timer fires, we'll send out
> > the gather-telemetry notification and then on the next clock we'll
> 
> What is "the next clock" referring to here?

You're right, that wasn't clear. I just went and refreshed my memory of the nsIdleService code. The service always adds 10ms to the requested wait time, as far as I can tell because an earlier developer was concerned about asking nsITimer for too short of a timeout.

This means that if the browser has been idle for more than 5 (4) minutes, we will collect the telemetry data packet about 10 ms after notifying gather-telemetry.
Ok, summarizing the goals of this bug now that I've had a chance to talk with vladan, irving, and froydnj about how best to move forward:

1) Modify TelemetryPing to use the blocklist notification rather than idle-daily, because idle-daily is sketchy / unreliable.
2) On Telemetry setup, when we load pings:
  a) Destroy pings older than 2 weeks, and do not load them into memory. They're lost. We may want to record this fact in Telemetry somehow, but I will not be covering that part.
  b) If we detect leftover remaining pings that are older than 1 ping, schedule to send them all in as soon as a browser-delayed-startup-finished is fired.

There were some thoughts about restricting the number of pings we send in one shot in (b), but I will not be addressing that here.

/me cracks knuckles
Attachment #8342693 - Flags: feedback?(irving)
Attachment #8342430 - Flags: review?(nfroyd)
Just checkpointing work before I go home.
Attachment #8342430 - Attachment is obsolete: true
Attachment #8342693 - Attachment is obsolete: true
Attachment #8343365 - Attachment is obsolete: true
Ok, I think I'm ready for somebody to take a look.
Attachment #8343793 - Attachment is obsolete: true
Attachment #8343796 - Flags: review?(nfroyd)
Attachment #8343796 - Flags: feedback?(irving)
I split off the idle-daily -> update-timer work to bug 947294.
Comment on attachment 8343796 [details] [diff] [review]
Send old pings out after Telemetry Startup - Patch v1.1

Review of attachment 8343796 [details] [diff] [review]:
-----------------------------------------------------------------

WFM!

::: toolkit/components/telemetry/tests/unit/test_TelemetryPing.js
@@ +141,5 @@
>    if (request.getHeader("content-encoding") == "gzip") {
>      let observer = {
>        buffer: "",
>        onStreamComplete: function(loader, context, status, length, result) {
> +        this.buffer = String.fromCharCode.apply(this, result);

Kill them tabs, kill 'em good!

::: toolkit/components/telemetry/tests/unit/test_TelemetrySendOldPings.js
@@ +21,5 @@
> +const PREF_SERVER = "toolkit.telemetry.server";
> +const PING_SAVE_FOLDER = "saved-telemetry-pings";
> +const PING_TIMEOUT_LENGTH = 1500;
> +const EXPIRED_PING_FILE_AGE = (14 * 24 * 60 * 60 * 1000) + 1; // 2 weeks, 1ms old.
> +const OVERDUE_PING_FILE_AGE = (7 * 24 * 60 * 60 * 1000) + 1; // 1 week, 1ms old

WDYT about exporting these from TelemetryFile.jsm instead of duplicating the oldness criteria here?  Or basing them off exported symbols from TelemetryFile.jsm, perhaps?

@@ +125,5 @@
> + */
> +function assertReceivedPings(aExpectedNum) {
> +  let deferred = Promise.defer();
> +
> +  do_timeout(PING_TIMEOUT_LENGTH, function() {

As previously, this still makes me nervous about intermittents, but I guess failing hard in a short timeout is better than the whole test timing out.
Attachment #8343796 - Flags: review?(nfroyd) → review+
Comment on attachment 8343796 [details] [diff] [review]
Send old pings out after Telemetry Startup - Patch v1.1

Review of attachment 8343796 [details] [diff] [review]:
-----------------------------------------------------------------

f- until we figure out where the pingsOverdue and pingsDiscarded simple measures go relative to the existing if() statement around pingsLoaded

::: toolkit/components/telemetry/TelemetryFile.jsm
@@ +154,5 @@
>     * @param {function*} onLoad A function called upon loading of each
>     * ping. It is passed |true| in case of success, |false| in case of
>     * format error.
>     */
> +  loadSavedPings: function(sync, onLoad = null, onDone = null) {

I still think now is the time to get rid of the per-file onLoad callback (and the difficult-to-analyze telemetry histogram it populates).

@@ +195,5 @@
>      if (sync) {
>        let stream = Cc["@mozilla.org/network/file-input-stream;1"]
>                     .createInstance(Ci.nsIFileInputStream);
>        stream.init(file, -1, -1, 0);
> +      addToPendingPings(file, stream, onLoad, onDone);

I don't think we ever call the sync branch of this code, and we never want to in future. Can we trash it now?

::: toolkit/components/telemetry/TelemetryPing.js
@@ +217,5 @@
>      } catch(e) {
>      }
>      if (!forSavedSession || hasPingBeenSent) {
>        ret.savedPings = TelemetryFile.pingsLoaded;
>      }

We either want the pingsOverdue and pingsDiscarded to be inside this if {}, or we want all three values to be always included in the ping. Nathan, I think you added this condition?

@@ +958,5 @@
>     */
>    observe: function (aSubject, aTopic, aData) {
>      switch (aTopic) {
>      case "profile-after-change":
> +      this.setup(aData == "testing");

Do we really want to change this to "testing"?

::: toolkit/components/telemetry/tests/unit/test_TelemetrySendOldPings.js
@@ +21,5 @@
> +const PREF_SERVER = "toolkit.telemetry.server";
> +const PING_SAVE_FOLDER = "saved-telemetry-pings";
> +const PING_TIMEOUT_LENGTH = 1500;
> +const EXPIRED_PING_FILE_AGE = (14 * 24 * 60 * 60 * 1000) + 1; // 2 weeks, 1ms old.
> +const OVERDUE_PING_FILE_AGE = (7 * 24 * 60 * 60 * 1000) + 1; // 1 week, 1ms old

Rather than exporting them just for test purposes, load the module into a scope in this test and refer to them through that.

@@ +48,5 @@
> + *             of 10 would make the ping 10ms older than now, for
> + *             example.
> + * @returns an Array with the created pings.
> + */
> +function createSavedPings(aNum, aAge) {

Is this duplicated code from another test?
Attachment #8343796 - Flags: feedback?(irving) → feedback-
(In reply to :Irving Reid from comment #27)
> Comment on attachment 8343796 [details] [diff] [review]
> Send old pings out after Telemetry Startup - Patch v1.1
> 
> Review of attachment 8343796 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> f- until we figure out where the pingsOverdue and pingsDiscarded simple
> measures go relative to the existing if() statement around pingsLoaded
> 
> ::: toolkit/components/telemetry/TelemetryFile.jsm
> @@ +154,5 @@
> >     * @param {function*} onLoad A function called upon loading of each
> >     * ping. It is passed |true| in case of success, |false| in case of
> >     * format error.
> >     */
> > +  loadSavedPings: function(sync, onLoad = null, onDone = null) {
> 
> I still think now is the time to get rid of the per-file onLoad callback
> (and the difficult-to-analyze telemetry histogram it populates).
> 

I'd really like to avoid adding more scope to this bug as it is. :/

> @@ +195,5 @@
> >      if (sync) {
> >        let stream = Cc["@mozilla.org/network/file-input-stream;1"]
> >                     .createInstance(Ci.nsIFileInputStream);
> >        stream.init(file, -1, -1, 0);
> > +      addToPendingPings(file, stream, onLoad, onDone);
> 
> I don't think we ever call the sync branch of this code, and we never want
> to in future. Can we trash it now?
> 

Again, smells like scope creep, and something we might want to do in a follow-up...

> ::: toolkit/components/telemetry/TelemetryPing.js
> @@ +217,5 @@
> >      } catch(e) {
> >      }
> >      if (!forSavedSession || hasPingBeenSent) {
> >        ret.savedPings = TelemetryFile.pingsLoaded;
> >      }
> 
> We either want the pingsOverdue and pingsDiscarded to be inside this if {},
> or we want all three values to be always included in the ping. Nathan, I
> think you added this condition?
> 

Alright, let's wait to hear back from Nathan on that.

> @@ +958,5 @@
> >     */
> >    observe: function (aSubject, aTopic, aData) {
> >      switch (aTopic) {
> >      case "profile-after-change":
> > +      this.setup(aData == "testing");
> 
> Do we really want to change this to "testing"?
> 

This is just my hacky way of not waiting for 1 minute before Telemetry loads its pings from disk. I could use another string if you'd like - just so long as we can signal to TelemetryPing.setup to shortcut the startup.

> ::: toolkit/components/telemetry/tests/unit/test_TelemetrySendOldPings.js
> @@ +21,5 @@
> > +const PREF_SERVER = "toolkit.telemetry.server";
> > +const PING_SAVE_FOLDER = "saved-telemetry-pings";
> > +const PING_TIMEOUT_LENGTH = 1500;
> > +const EXPIRED_PING_FILE_AGE = (14 * 24 * 60 * 60 * 1000) + 1; // 2 weeks, 1ms old.
> > +const OVERDUE_PING_FILE_AGE = (7 * 24 * 60 * 60 * 1000) + 1; // 1 week, 1ms old
> 
> Rather than exporting them just for test purposes, load the module into a
> scope in this test and refer to them through that.
> 

I'll give that a shot and I'll let you know how it goes!

> @@ +48,5 @@
> > + *             of 10 would make the ping 10ms older than now, for
> > + *             example.
> > + * @returns an Array with the created pings.
> > + */
> > +function createSavedPings(aNum, aAge) {
> 
> Is this duplicated code from another test?

Nope - I wrote that myself.
(In reply to Mike Conley (:mconley) from comment #28)
> (In reply to :Irving Reid from comment #27)
> > ::: toolkit/components/telemetry/TelemetryFile.jsm
> > @@ +154,5 @@
> > >     * @param {function*} onLoad A function called upon loading of each
> > >     * ping. It is passed |true| in case of success, |false| in case of
> > >     * format error.
> > >     */
> > > +  loadSavedPings: function(sync, onLoad = null, onDone = null) {
> > 
> > I still think now is the time to get rid of the per-file onLoad callback
> > (and the difficult-to-analyze telemetry histogram it populates).
> > 
> 
> I'd really like to avoid adding more scope to this bug as it is. :/

Yes please.  Somebody file a followup if we want to fix this.  Heck, I'll even fix it if somebody else files the bug.

> > @@ +195,5 @@
> > >      if (sync) {
> > >        let stream = Cc["@mozilla.org/network/file-input-stream;1"]
> > >                     .createInstance(Ci.nsIFileInputStream);
> > >        stream.init(file, -1, -1, 0);
> > > +      addToPendingPings(file, stream, onLoad, onDone);
> > 
> > I don't think we ever call the sync branch of this code, and we never want
> > to in future. Can we trash it now?
> > 
> 
> Again, smells like scope creep, and something we might want to do in a
> follow-up...

Agreed.

> > ::: toolkit/components/telemetry/TelemetryPing.js
> > @@ +217,5 @@
> > >      } catch(e) {
> > >      }
> > >      if (!forSavedSession || hasPingBeenSent) {
> > >        ret.savedPings = TelemetryFile.pingsLoaded;
> > >      }
> > 
> > We either want the pingsOverdue and pingsDiscarded to be inside this if {},
> > or we want all three values to be always included in the ping. Nathan, I
> > think you added this condition?
> > 
> 
> Alright, let's wait to hear back from Nathan on that.

Ah, Irving's right: we only want to add these measurements when we're submitting the first ping from a live session.  Please move pings{Overdue,Discarded} inside that if.

> > @@ +958,5 @@
> > >     */
> > >    observe: function (aSubject, aTopic, aData) {
> > >      switch (aTopic) {
> > >      case "profile-after-change":
> > > +      this.setup(aData == "testing");
> > 
> > Do we really want to change this to "testing"?
> > 
> 
> This is just my hacky way of not waiting for 1 minute before Telemetry loads
> its pings from disk. I could use another string if you'd like - just so long
> as we can signal to TelemetryPing.setup to shortcut the startup.

For other places in toolkit/, we've used prefs for this sort of thing.  I don't have strong preferences either way.  I think it'd be slightly better to not be passing data through the observer service here if you really want an opinion.

Adding a method to nsITelemetryPing and making TelemetryPing implement that is usually how we do testing-only sorts of things nowadays, now that I think about it.
Okie doke, updating the patch...
Comment on attachment 8343796 [details] [diff] [review]
Send old pings out after Telemetry Startup - Patch v1.1

Review of attachment 8343796 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryFile.jsm
@@ +154,5 @@
>     * @param {function*} onLoad A function called upon loading of each
>     * ping. It is passed |true| in case of success, |false| in case of
>     * format error.
>     */
> +  loadSavedPings: function(sync, onLoad = null, onDone = null) {

So, since we said we didn't want to add a histogram entry for every saved ping, I don't think we need onLoad anymore.. You can put the histogram add() call in onDone

@@ +301,5 @@
>    }
> +
> +  let now = new Date();
> +  if (file.exists() && (now - file.lastModifiedTime > OVERDUE_PING_FILE_AGE)) {
> +    pingsOverdue++;

Why not put this check next to the pingsDiscarded check?

::: toolkit/components/telemetry/TelemetryPing.js
@@ +836,5 @@
> +          if (TelemetryFile.pingsOverdue > 0) {
> +            // It doesn't really matter what we pass to this.send as a reason,
> +            // since it's never sent to the server. All that this.send does with
> +            // the reason is check to make sure it's not a test-ping.
> +            this.send("saved-session", this._server);

We should add another reason field ("trigger"?) to the "info" section of saved-session payloads to distinguish startup submissions vs daily timer submissions
Attachment #8343796 - Flags: feedback- → feedback?(irving)
(In reply to Vladan Djeric (:vladan) from comment #31)
> ::: toolkit/components/telemetry/TelemetryPing.js
> @@ +836,5 @@
> > +          if (TelemetryFile.pingsOverdue > 0) {
> > +            // It doesn't really matter what we pass to this.send as a reason,
> > +            // since it's never sent to the server. All that this.send does with
> > +            // the reason is check to make sure it's not a test-ping.
> > +            this.send("saved-session", this._server);
> 
> We should add another reason field ("trigger"?) to the "info" section of
> saved-session payloads to distinguish startup submissions vs daily timer
> submissions

The ping for the current session is going to include a simple measure showing that we loaded N overdue pings; if that number is non-zero we know that there was a send at startup.
Comment on attachment 8343796 [details] [diff] [review]
Send old pings out after Telemetry Startup - Patch v1.1

Bugzilla's doing funny things...
Also, I didn't see Nathan's latest comments about scope-creep before I posted mine. Nathan is the authority on this bug :)
Attachment #8343796 - Flags: feedback?(irving)
Attachment #8343796 - Attachment is obsolete: true
(In reply to Nathan Froyd (:froydnj) from comment #26)
> Comment on attachment 8343796 [details] [diff] [review]
> Send old pings out after Telemetry Startup - Patch v1.1
> 
> Review of attachment 8343796 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> WFM!
> 
> ::: toolkit/components/telemetry/tests/unit/test_TelemetryPing.js
> @@ +141,5 @@
> >    if (request.getHeader("content-encoding") == "gzip") {
> >      let observer = {
> >        buffer: "",
> >        onStreamComplete: function(loader, context, status, length, result) {
> > +        this.buffer = String.fromCharCode.apply(this, result);
> 
> Kill them tabs, kill 'em good!
> 

Sometimes I just can't help myself.

> ::: toolkit/components/telemetry/tests/unit/test_TelemetrySendOldPings.js
> @@ +21,5 @@
> > +const PREF_SERVER = "toolkit.telemetry.server";
> > +const PING_SAVE_FOLDER = "saved-telemetry-pings";
> > +const PING_TIMEOUT_LENGTH = 1500;
> > +const EXPIRED_PING_FILE_AGE = (14 * 24 * 60 * 60 * 1000) + 1; // 2 weeks, 1ms old.
> > +const OVERDUE_PING_FILE_AGE = (7 * 24 * 60 * 60 * 1000) + 1; // 1 week, 1ms old
> 
> WDYT about exporting these from TelemetryFile.jsm instead of duplicating the
> oldness criteria here?  Or basing them off exported symbols from
> TelemetryFile.jsm, perhaps?

Yeah, good idea. Done.

> 
> @@ +125,5 @@
> > + */
> > +function assertReceivedPings(aExpectedNum) {
> > +  let deferred = Promise.defer();
> > +
> > +  do_timeout(PING_TIMEOUT_LENGTH, function() {
> 
> As previously, this still makes me nervous about intermittents, but I guess
> failing hard in a short timeout is better than the whole test timing out.

Yeah - I bumped the timeout by a second. Hopefully our test machines don't choke on it, but if they do, we can bump it further.

(In reply to Nathan Froyd (:froydnj) from comment #29)

> Ah, Irving's right: we only want to add these measurements when we're
> submitting the first ping from a live session.  Please move
> pings{Overdue,Discarded} inside that if.
> 

Fixed.

> For other places in toolkit/, we've used prefs for this sort of thing.  I
> don't have strong preferences either way.  I think it'd be slightly better
> to not be passing data through the observer service here if you really want
> an opinion.
> 
> Adding a method to nsITelemetryPing and making TelemetryPing implement that
> is usually how we do testing-only sorts of things nowadays, now that I think
> about it.

Irving suggested a different method which I used here that does not require altering the interface - I load the TelemetryPing script directly instead of going through the XPCOM plumbing, and use it directly. It has the side-effect of loading in a bunch of consts, as well as Cc/Ci/Cu/Cr, but it saves us having to hack about with the profile-after-change notification.


(In reply to Vladan Djeric (:vladan) from comment #31)
> 
> So, since we said we didn't want to add a histogram entry for every saved
> ping, I don't think we need onLoad anymore.. You can put the histogram add()
> call in onDone
> 

As discussed - I think we'll save that for a follow-up.

> @@ +301,5 @@
> >    }
> > +
> > +  let now = new Date();
> > +  if (file.exists() && (now - file.lastModifiedTime > OVERDUE_PING_FILE_AGE)) {
> > +    pingsOverdue++;
> 
> Why not put this check next to the pingsDiscarded check?
> 

Good idea - done.
Comment on attachment 8343952 [details] [diff] [review]
Send old pings out after Telemetry Startup - Patch v1.2

I'm feeling lucky!
Attachment #8343952 - Flags: review?(nfroyd)
Attachment #8343952 - Flags: feedback?(irving)
Comment on attachment 8343952 [details] [diff] [review]
Send old pings out after Telemetry Startup - Patch v1.2

Review of attachment 8343952 [details] [diff] [review]:
-----------------------------------------------------------------

Ship it!

...after you change the commit message. ;)
Attachment #8343952 - Flags: review?(nfroyd) → review+
Comment on attachment 8343952 [details] [diff] [review]
Send old pings out after Telemetry Startup - Patch v1.2

Review of attachment 8343952 [details] [diff] [review]:
-----------------------------------------------------------------

Excellent.
Attachment #8343952 - Flags: feedback?(irving) → feedback+
(In reply to Nathan Froyd (:froydnj) from comment #37)
> Comment on attachment 8343952 [details] [diff] [review]
> Send old pings out after Telemetry Startup - Patch v1.2
> 
> Review of attachment 8343952 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Ship it!
> 
> ...after you change the commit message. ;)

Yes, good call.

Thank you both!

Landed on inbound as: https://hg.mozilla.org/integration/mozilla-inbound/rev/16628db76c3c
(In reply to Nathan Froyd (:froydnj) from comment #18)
> > +        {
> > +          // If we have any TelemetryPings lying around, we'll be aggressive
> > +          // and try to send them all off.
> > +          if (TelemetryFile.loadedOverduePings) {
> > +            this.send("overdue-flush", this._server);
> 
> You should check with mreid that it's OK to add another reason like this. 
> (I think having another reason is fine; lets us see how many people have
> overdue pings lying around.)

Yes, it's fine to add another reason like this.  Right now they would be bucketed into "OTHER" for reason in the storage directory structure (the specific reason will still be available in the payload itself).  If you don't want these bucketed into "OTHER" let me know.  Currently they would be the only reason going into "OTHER" anyways.
(In reply to Wes Kocher (:KWierso) from comment #40)
> Backed out in
> http://hg.mozilla.org/integration/mozilla-inbound/rev/769e5d448b1e for
> failing tests like this:
> https://tbpl.mozilla.org/php/getParsedLog.php?id=31587410&tree=Mozilla-
> Inbound

No luck reproducing this one locally. :/ I have a weaker 10.6 Mac Mini that I'm going to try next.
Doing two things:

1) Increasing the amount by which we artificially age our TelemetryPings from 1ms to 1m, to account for possible OS-rounding/fudging.
2) Registering a cleanup function that clears any leftover saved TelemetryPings.

Irving and I have noticed that the test times out when running in parallel, and complains about 5 remaining ping files when running sequentially. This makes sense - if the first test failed and didn't clean up after itself, it will leave 5 "expired" files behind, which might be confusing things here.

Try push baking here: https://tbpl.mozilla.org/?tree=Try&rev=38e37cd279d3
Unintentional clownshoes in that last patch.

Re-pushed to try as: https://tbpl.mozilla.org/?tree=Try&rev=4926c38b3f33
Ugh, today just isn't my day. I forgot that this bug was seemingly restricted to debug builds, and my try push was for an opt build only. *sigh*

https://tbpl.mozilla.org/?tree=Try&rev=3f3a60cb8b7c
This is also a thing that will need to be on Aurora.
OK, I think I've solved this mystery.

Basically, for debug builds, MOZ_TELEMETRY_REPORTING is left undefined, meaning that Telemetry.canSend returns false. That means that TelemetryPing.setup returns early *before* it can load saved pings and attempt to send them.

Which means that our debug builds should probably not be running this test. I'm attaching a new patch that disables the test for debug builds.
Attachment #8345497 - Flags: review+
Attachment #8345497 - Flags: feedback+
Attachment #8343952 - Attachment is obsolete: true
Ok, so there's a little bit more work to do here before we can land this. Specifically:

1) We need to modify TelemetryPing to not create / send a ping for the current session if we're doing the overdue flush.
2) We should set the pingsOverdue and pingsExpired regardless of whether or not we've sent a ping already.
Attached patch Follow-up patch (obsolete) — Splinter Review
I think this encompasses the requirements that irving, vladan and I laid out today.

irving - I know checking for the "overdue-flush" reason is probably not the graceful solution you were thinking of, but I need to get out of this code and back into my Australis UITelemetry probe work tout suite - so could a better solution be carried out in a follow-up?
Attachment #8345540 - Flags: feedback?(irving)
Comment on attachment 8345540 [details] [diff] [review]
Follow-up patch

Might as well get nfroyd's eyes on this too.
Attachment #8345540 - Flags: review?(nfroyd)
Mike, while this might be something you want to land on Aurora it seems more like an enhancement and not a release blocker.  So I'm not going to track this but by all means please nominate patches (low risk ones) for uplift while FF28 is on Aurora.
(In reply to Mike Conley (:mconley) from comment #49)
> Ok, so there's a little bit more work to do here before we can land this.
> Specifically:
> 
> 1) We need to modify TelemetryPing to not create / send a ping for the
> current session if we're doing the overdue flush.

Sounds good.

> 2) We should set the pingsOverdue and pingsExpired regardless of whether or
> not we've sent a ping already.

Could you explain the rationale for this?  If there was a discussion of this on #perf, I missed it.
Flags: needinfo?(mconley)
Comment on attachment 8345540 [details] [diff] [review]
Follow-up patch

Review of attachment 8345540 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Nathan Froyd (:froydnj) from comment #53)
> > 2) We should set the pingsOverdue and pingsExpired regardless of whether or
> > not we've sent a ping already.
> 
> Could you explain the rationale for this?  If there was a discussion of this
> on #perf, I missed it.

Vladan, mconley and I discussed this face to face. Because we discard (or ignore) idle-daily pings once we get the saved-session ping for a specific session, we want to make sure this data is always in the saved-session we send.

::: toolkit/components/telemetry/TelemetryPing.js
@@ +221,1 @@
>      }

I think we want to remove the if() statement and always send savedPings as well. That way, what we see in any received telemetry ping is "At the beginning of this session I loaded N pings, X were discarded and Y were overdue." We don't load pings at any other time, do we?

@@ +605,5 @@
>    popPayloads: function popPayloads(reason) {
>      function payloadIter() {
> +      if (reason != "overdue-flush") {
> +        yield this.getSessionPayloadAndSlug(reason);
> +      }

This approach still ends up doing some extra work, because send() calls gatherMemory() - http://telemetry.mozilla.org/#path=nightly/28/TELEMETRY_MEMORY_REPORTER_MS/saved_session shows that it's less than 3ms, so not a big deal.

That said, the way gatherMemory() calls are scattered around TelemetryPing.js, I really can't say I understand how to interpret what it collects.
Attachment #8345540 - Flags: feedback?(irving) → feedback+
Whiteboard: [Telemetry:P1] → [Telemetry:P1][Australis:P-]
Comment on attachment 8345540 [details] [diff] [review]
Follow-up patch

Review of attachment 8345540 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for slacking and not r+'ing this until now.
Attachment #8345540 - Flags: review?(nfroyd) → review+
Thanks! Folded both patches as attached, and landed on mozilla-inbound as:

https://hg.mozilla.org/integration/mozilla-inbound/rev/fa2130fc4f08
Attachment #8345497 - Attachment is obsolete: true
Attachment #8345540 - Attachment is obsolete: true
Attachment #8347395 - Flags: review+
Attachment #8347395 - Flags: feedback+
Flags: needinfo?(mconley)
Ok, I've increased the timeout time for receiving pings, and also bumped the old-file lastModifiedTime in TelemetryPing by 1 minute, and that seems to have sealed the deal on Try:

https://tbpl.mozilla.org/?tree=Try&rev=d4c846f393ba

Re-pushing. Third time's the charm. Here goes:

https://hg.mozilla.org/integration/mozilla-inbound/rev/2412b1a7b97e
Backed out in http://hg.mozilla.org/integration/mozilla-inbound/rev/47595f181b62 because Android just can't be pleased (a nice parallel with the way it can't please), https://tbpl.mozilla.org/php/getParsedLog.php?id=31989738&tree=Mozilla-Inbound
Any objections to me just disabling this test on Android?
Flags: needinfo?(nfroyd)
(In reply to Phil Ringnalda (:philor) from comment #59)
> Backed out in
> http://hg.mozilla.org/integration/mozilla-inbound/rev/47595f181b62 because
> Android just can't be pleased (a nice parallel with the way it can't
> please),
> https://tbpl.mozilla.org/php/getParsedLog.php?id=31989738&tree=Mozilla-
> Inbound

I didn't realize we've started backing things out after one orange run.  (AFAICS, things are green on that platform for several pushes before...)

(In reply to Mike Conley (:mconley) from comment #60)
> Any objections to me just disabling this test on Android?

Sure, we can do that.
Flags: needinfo?(nfroyd)
https://hg.mozilla.org/mozilla-central/rev/af4351a02070
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
I'm curious to know if we see any uptick in submissions over the next few days...

I also hope to request beta / aurora uplift approval, once this has had a few days to bake.
Depends on: 951318
Comment on attachment 8347395 [details] [diff] [review]
Patch v1.4 (r=nfroyd, feedback=irving)

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 

None. This should, in theory, increase the amount of telemetry pings we have coming in from folks with stored session pings.


User impact if declined: 

None, really. Overdue pings will just have a higher likelihood of being destroyed instead of sent.


Testing completed (on m-c, etc.): 

There's a test included, and it's been baking on mozilla-central for a week. All quiet on the western front.


Risk to taking this patch (and alternatives if risky): 

Pretty low, IMO. If drivers are feeling squeamish, we can probably not have this land on mozilla-beta - but it could mean less data in for the UITelemetry stuff (as well as telemetry data in general) on that channel until Aurora uplifts.


String or IDL/UUID changes made by this patch:

None.
Attachment #8347395 - Flags: approval-mozilla-beta?
Attachment #8347395 - Flags: approval-mozilla-aurora?
Comment on attachment 8347395 [details] [diff] [review]
Patch v1.4 (r=nfroyd, feedback=irving)

Given the low risk and since we have couple of beta's in hand, lets take this to help with metrics data.

Please make sure to verify that the telemetry data is as expected once this lands in beta. Our next beta is going to goto build on Monday and released the following day, so next week beta metrics would be the right time to check if this is working as expected.
Attachment #8347395 - Flags: approval-mozilla-beta?
Attachment #8347395 - Flags: approval-mozilla-beta+
Attachment #8347395 - Flags: approval-mozilla-aurora?
Attachment #8347395 - Flags: approval-mozilla-aurora+
Flags: in-testsuite?
Blocks: 1041616
You need to log in before you can comment on or make changes to this bug.