Closed Bug 748520 Opened 12 years ago Closed 12 years ago

resend failed telemetry pings

Categories

(Toolkit :: Telemetry, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla20

People

(Reporter: froydnj, Unassigned)

References

Details

Attachments

(1 file, 1 obsolete file)

So any metrics server downtime doesn't cause us to lose useful field data.

We also want to resend pings across sessions if need be.  This feature will require privacy review, as similar bugs (bug 715299) have similar issues.
Will there be a maximum lifetime for pings stored on disk due to failure?  For example, if my computer is on a network that always fails to ping, will I just continue to accumulate ping data or will it expire at some point?
(In reply to Sid Stamm [:geekboy] from comment #1)
> Will there be a maximum lifetime for pings stored on disk due to failure? 
> For example, if my computer is on a network that always fails to ping, will
> I just continue to accumulate ping data or will it expire at some point?

I guess some scope clarification is needed here.  We have bug 715299 for persisting telemetry data over multiple sessions, which sounds like what you are trying to address in the above comment.

My intention for filing this bug was handling failed pings in a single session (e.g. browser session up for multiple days, but cannot ping the telemetry server for part of that).  In that case, everything lives in memory, so there's no persistent data (which I assume is OK, but am happy to be corrected).

I don't have strong opinions as to what we decide this bug is for, but we should do that so that the above questions have a reasonable answer.
Ok, great, that makes sense.  The only privacy implications seem to be in bug 715299 since this is more about re-trying and that bug is more about saving until we can re-try.  Clearing the p-r-n flag since this in itself probably isn't risky.
(In reply to Nathan Froyd (:froydnj) from comment #2)
> 
> My intention for filing this bug was handling failed pings in a single
> session (e.g. browser session up for multiple days, but cannot ping the
> telemetry server for part of that).  In that case, everything lives in
> memory, so there's no persistent data (which I assume is OK, but am happy to
> be corrected).

What happens if the browser crashes before the ping is sent? Is the data lost?
(In reply to Lawrence Mandel [:lmandel] from comment #4)
> (In reply to Nathan Froyd (:froydnj) from comment #2)
> > My intention for filing this bug was handling failed pings in a single
> > session
>
> What happens if the browser crashes before the ping is sent? Is the data
> lost?

Yes.  Persisted pings should be handled in bug 715299.
I understood bug 715299 to address a normal scenario in which data is persisted when the browser is shut down. Is data persisted in the case where the browser doesn't properly shut down but instead crashes?
(In reply to Lawrence Mandel [:lmandel] from comment #6)
> I understood bug 715299 to address a normal scenario in which data is
> persisted when the browser is shut down. Is data persisted in the case where
> the browser doesn't properly shut down but instead crashes?

No.  We have bug 719167 for that scenario.

I am a fan of fine-grained bugs; are these bugs too fine-grained?
(In reply to Nathan Froyd (:froydnj) from comment #7)
> (In reply to Lawrence Mandel [:lmandel] from comment #6)
> > I understood bug 715299 to address a normal scenario in which data is
> > persisted when the browser is shut down. Is data persisted in the case where
> > the browser doesn't properly shut down but instead crashes?
> 
> No.  We have bug 719167 for that scenario.
> 
> I am a fan of fine-grained bugs; are these bugs too fine-grained?

I'm also a fan of fine-grained bugs. I think these bugs all make sense as all of these items are distinct pieces of work. Thanks for the clarification. I think all the bases are covered.
Attached patch patch (obsolete) — Splinter Review
It is awfully convenient that we already have a failed ping to test with...saved quite a bit of test-writing.
Attachment #633206 - Flags: review?(taras.mozilla)
Comment on attachment 633206 [details] [diff] [review]
patch

I think the issue is more complex than is implemented in the patch.
As far as I can tell the implementation in the patch will just causes us to retry infinitely.
What needs to happen if 
1) if we are sending a persisted ping, then do not remove the persisted data in case of failure. I believe this is what currently happens since file deletion is tied to success.
1.a) If sending from persisted ping and we encounter an error should rename the file it does not get overwritten when the current session is serialized. This will come for free once we use session-id as part of the filename.

2) When sending a current session and the send fails, instead of trying again, we should serialize the data to disk. This gets us retry behavior on next-idle daily and hopefully keeps us from wasting bandwidth while having server problems.
Attachment #633206 - Flags: review?(taras.mozilla) → review-
(In reply to Taras Glek (:taras) from comment #10)
> 1) if we are sending a persisted ping, then do not remove the persisted data
> in case of failure. I believe this is what currently happens since file
> deletion is tied to success.

This doesn't make any sense.  We don't remove the file in case of failure already.

But the other points are reasonable points.  I'll just make this depend on the multi-ping save bug; it will be easier to write the desired logic once that's done.
Depends on: 763524
(In reply to Taras Glek (:taras) from comment #10)
> 2) When sending a current session and the send fails, instead of trying
> again, we should serialize the data to disk. This gets us retry behavior on
> next-idle daily and hopefully keeps us from wasting bandwidth while having
> server problems.

This suggestions is reasonable (though I don't understand the "gives us retry behavior"...what are we retrying?), but this comes for free once we implement persisting multiple pings to disk: just persist everything in TelemetryPing._pendingPings.  Admittedly, it would be more interesting to save a ping at first failure to avoid doing a bunch of I/O at shutdown...
(In reply to Nathan Froyd (:froydnj) from comment #12)
> (In reply to Taras Glek (:taras) from comment #10)
> > 2) When sending a current session and the send fails, instead of trying
> > again, we should serialize the data to disk. This gets us retry behavior on
> > next-idle daily and hopefully keeps us from wasting bandwidth while having
> > server problems.
> 
> This suggestions is reasonable (though I don't understand the "gives us
> retry behavior"...what are we retrying?), but this comes for free once we
> implement persisting multiple pings to disk: just persist everything in
> TelemetryPing._pendingPings.  Admittedly, it would be more interesting to
> save a ping at first failure to avoid doing a bunch of I/O at shutdown...

by retry behavior, I mean we will try to send the data on the next idle-daily.

This also wont save you io on shutdown, because data we write on shutdown is a superset of what happened at idle-daily. It reduces chances of not writing any telemetry data to disk due to an unclean shutdown.
(In reply to Taras Glek (:taras) from comment #13)
> (In reply to Nathan Froyd (:froydnj) from comment #12)
> > (In reply to Taras Glek (:taras) from comment #10)
> > > 2) When sending a current session and the send fails, instead of trying
> > > again, we should serialize the data to disk. This gets us retry behavior on
> > > next-idle daily and hopefully keeps us from wasting bandwidth while having
> > > server problems.
> > 
> > This suggestions is reasonable (though I don't understand the "gives us
> > retry behavior"...what are we retrying?), but this comes for free once we
> > implement persisting multiple pings to disk: just persist everything in
> > TelemetryPing._pendingPings.  Admittedly, it would be more interesting to
> > save a ping at first failure to avoid doing a bunch of I/O at shutdown...
> 
> by retry behavior, I mean we will try to send the data on the next
> idle-daily.

It is not clear to me what this has to do with persisting data to disk.  And we now save bandwidth by stopping at the first failed ping, so I'm also not seeing what saves us bandwidth here...

> This also wont save you io on shutdown, because data we write on shutdown is
> a superset of what happened at idle-daily. It reduces chances of not writing
> any telemetry data to disk due to an unclean shutdown.

It sounds like the algorithm you are proposing is something like:

  for each ping that we need to send:
    if the ping does not succeed:
      write data to disk
      shove ping back into the pendingPings queue
      break;

  at shutdown:
    for each ping in pendingPings:
      write ping to disk

I am saying that the first loop could be:

  for each ping that we need to send:
    if the ping does not succeed:
      write data to disk
      break;

which would then minimize the amount of data that you'd write to disk at shutdown.  It's possible you could do shutdown in the middle of writing that data if you wrote it async (?), so maybe that's not desirable.
> It is not clear to me what this has to do with persisting data to disk.  And
> we now save bandwidth by stopping at the first failed ping, so I'm also not
> seeing what saves us bandwidth here...

Good point. Forgot the little detail of us stopping the sending on failure.

> 
> > This also wont save you io on shutdown, because data we write on shutdown is
> > a superset of what happened at idle-daily. It reduces chances of not writing
> > any telemetry data to disk due to an unclean shutdown.
> 
> It sounds like the algorithm you are proposing is something like:
> 
>   for each ping that we need to send:
>     if the ping does not succeed:
>       write data to disk
>       shove ping back into the pendingPings queue
>       break;
> 
>   at shutdown:
>     for each ping in pendingPings:
>       write ping to disk

not what I meant.
> 
> I am saying that the first loop could be:

This is what I was trying to propose.

> 
>   for each ping that we need to send:
>     if the ping does not succeed:
>       write data to disk
>       break;


> 
> which would then minimize the amount of data that you'd write to disk at
> shutdown.  It's possible you could do shutdown in the middle of writing that
> data if you wrote it async (?), so maybe that's not desirable.

We'll see how it goes.
Comment on attachment 633206 [details] [diff] [review]
patch

After looking at this again today, I remember what was confusing me so much about this bug.  There's two different schools of thought here:

1) If we fail to send a ping, then just save the data in memory and try to send it later.  It's OK for said data to not actually get persisted; that's a different bug (bug 719167, essentially).  This is what I tried to implemented in the patch that was attached and got r-'d because the second approach is...
2) If we fail to send a ping, then we need to serialize that data to disk and try again later.  This is more robust, but there are privacy issues and it's quite a bit more complicated because our current saved ping implementation assumes there's only one ping for any given session stored on disk at any given time.

I'd like to avoid the privacy issues in (2); Sid already OK'd the solution in (1), so I'd like to ask for re-review on this.
Attachment #633206 - Flags: review- → review?(vdjeric)
Vladan and I talked about this on IRC.  Just resending pings without saving them to disk means that we miss data from no-network sessions that end in crashes.  I had assumed, based on discussion in bug 719167 that we didn't want to serialize anything to disk.  But re-reading through that bug, it sounds like serializing to disk is OK so long as the data is deleted once it gets sent (which we already do).

If we can persist failed pings (and then, by extension, intermittently save pings in bug 719167), that would make fixing this bug much easier and make Telemetry more robust as well.  Tom, do you think that's reasonable vis-a-vis privacy concerns?
Flags: needinfo?(tom)
I understand that all of the saved info is sent at the next available ping, in sequence. If there's a problem, the pending info is queued up for the next 24h slot, and sent then. This works for me.
Flags: needinfo?(tom)
Comment on attachment 633206 [details] [diff] [review]
patch

Clearing review flag because it sounds like you're going to have a patch with a different approach
Attachment #633206 - Flags: review?(vdjeric)
This was easy enough.  This saves the ping for later sending:

* If we send another ping during the same session, then:
  - The second ping will succeed.  We'll automatically remove the saved
    ping file through the normal mechanism of removing saved pings.
  - The second ping will fail.  In which case we'll overwrite the saved
    information with the second ping.
* If we don't send another ping during the same session, then:
  - The session will shutdown normally, and we'll overwrite the saved
    ping file by the normal, save-a-ping-at-shutdown means.
  - The session will crash, and we'll load the ping at the next session.
Attachment #633206 - Attachment is obsolete: true
Attachment #690561 - Flags: review?(vdjeric)
I folks, I've cc:ed Daniel and
Harsha to flesh out whether this will have an impact on validations.
I am seeing quite a few telemetry submission missing info field. Is that caused by this bug?
(In reply to schintalapani from comment #23)
> I am seeing quite a few telemetry submission missing info field. Is that
> caused by this bug?

No, nothing's been committed from this bug yet.  Can you be more specific about what fields you see missing?
some documents have no info field at all and some don't have appVersion in them.
On other note I am seeing inconsistent no.of telemetry submissions per day. 
Here are counts for few days
yyyymmdd total docs
20121118 5579384
20121119 6226229
20121120 6865824
20121121 7806403
20121126 9975334
20121127 9728292
20121128 9943121
20121129 9575144
20121130 7586829
20121201 5155001
20121202 4446206
20121203 3236707

and for the last two days
20121210 15155607
20121211 17242419
Please file a new bug on that; make it depend on bug 816166, which looks like the most probable candidate for causing problems.
Comment on attachment 690561 [details] [diff] [review]
persist failed telemetry pings for possible re-sending later

>diff --git a/toolkit/components/telemetry/TelemetryPing.js b/toolkit/components/telemetry/TelemetryPing.js
>index 8741a89..be1bea6 100644
>--- a/toolkit/components/telemetry/TelemetryPing.js
>+++ b/toolkit/components/telemetry/TelemetryPing.js
>@@ -629,16 +629,17 @@ TelemetryPing.prototype = {
>     function onSuccess() {
>       if (data.slug == this._uuid) {
>         this._lastPingDate = new Date();
>         this._lastAppBuildID = Services.appinfo.appBuildID;
>       }
>       this.sendPingsFromIterator(server, reason, i);
>     }
>     function onError() {
>+      this.savePing(data, true);
>       // Notify that testing is complete, even if we didn't send everything.
>       finishPings(reason);
>     }
>     this.doPing(server, data,
>                 onSuccess.bind(this), onError.bind(this));
>   },

So let's say the following happens:
- sending a ping fails because we are offline 
- Firefox then saves the "failed ping" to disk
- Firefox successfully sends a ping 24 hours later from the same session
- Firefox crashes
- Firefox is restarted and sends out the saved "failed ping"

Won't that cause us to replace the more recent Telemetry ping on the Metrics' servers with the older "failed ping" from disk? Shouldn't we delete the "failed ping" from disk after a successful send during the same session?
Attachment #690561 - Flags: review?(vdjeric) → review-
(In reply to Vladan Djeric (:vladan) from comment #27)
> So let's say the following happens:
> - sending a ping fails because we are offline 
> - Firefox then saves the "failed ping" to disk
> - Firefox successfully sends a ping 24 hours later from the same session
> - Firefox crashes
> - Firefox is restarted and sends out the saved "failed ping"
> 
> Won't that cause us to replace the more recent Telemetry ping on the
> Metrics' servers with the older "failed ping" from disk? Shouldn't we delete
> the "failed ping" from disk after a successful send during the same session?

That won't happen because of this code which kicks in at step 3:

http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryPing.js#631
Comment on attachment 690561 [details] [diff] [review]
persist failed telemetry pings for possible re-sending later

(In reply to Nathan Froyd (:froydnj) from comment #28)
> That won't happen because of this code which kicks in at step 3:

You're right, I should have traced this code deeper
Attachment #690561 - Flags: review- → review+
https://hg.mozilla.org/mozilla-central/rev/cc544979aeaa
https://hg.mozilla.org/mozilla-central/rev/6c224e817390
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: