748520 - resend failed telemetry pings

Reporter

Description

•

13 years ago

So any metrics server downtime doesn't cause us to lose useful field data. We also want to resend pings across sessions if need be. This feature will require privacy review, as similar bugs (bug 715299) have similar issues.

Lawrence Mandel [:lmandel] (use needinfo)

Updated

•

13 years ago

Keywords: privacy-review-needed

Sid Stamm [:geekboy or :sstamm]

Comment 1

•

13 years ago

Will there be a maximum lifetime for pings stored on disk due to failure? For example, if my computer is on a network that always fails to ping, will I just continue to accumulate ping data or will it expire at some point?

Nathan Froyd [:froydnj]

Reporter

Comment 2

•

13 years ago

(In reply to Sid Stamm [:geekboy] from comment #1) > Will there be a maximum lifetime for pings stored on disk due to failure? > For example, if my computer is on a network that always fails to ping, will > I just continue to accumulate ping data or will it expire at some point? I guess some scope clarification is needed here. We have bug 715299 for persisting telemetry data over multiple sessions, which sounds like what you are trying to address in the above comment. My intention for filing this bug was handling failed pings in a single session (e.g. browser session up for multiple days, but cannot ping the telemetry server for part of that). In that case, everything lives in memory, so there's no persistent data (which I assume is OK, but am happy to be corrected). I don't have strong opinions as to what we decide this bug is for, but we should do that so that the above questions have a reasonable answer.

Sid Stamm [:geekboy or :sstamm]

Comment 3

•

13 years ago

Ok, great, that makes sense. The only privacy implications seem to be in bug 715299 since this is more about re-trying and that bug is more about saving until we can re-try. Clearing the p-r-n flag since this in itself probably isn't risky.

Keywords: privacy-review-needed

Lawrence Mandel [:lmandel] (use needinfo)

Comment 4

•

13 years ago

(In reply to Nathan Froyd (:froydnj) from comment #2) > > My intention for filing this bug was handling failed pings in a single > session (e.g. browser session up for multiple days, but cannot ping the > telemetry server for part of that). In that case, everything lives in > memory, so there's no persistent data (which I assume is OK, but am happy to > be corrected). What happens if the browser crashes before the ping is sent? Is the data lost?

Nathan Froyd [:froydnj]

Reporter

Comment 5

•

13 years ago

(In reply to Lawrence Mandel [:lmandel] from comment #4) > (In reply to Nathan Froyd (:froydnj) from comment #2) > > My intention for filing this bug was handling failed pings in a single > > session > > What happens if the browser crashes before the ping is sent? Is the data > lost? Yes. Persisted pings should be handled in bug 715299.

Lawrence Mandel [:lmandel] (use needinfo)

Comment 6

•

13 years ago

I understood bug 715299 to address a normal scenario in which data is persisted when the browser is shut down. Is data persisted in the case where the browser doesn't properly shut down but instead crashes?

Nathan Froyd [:froydnj]

Reporter

Comment 7

•

13 years ago

(In reply to Lawrence Mandel [:lmandel] from comment #6) > I understood bug 715299 to address a normal scenario in which data is > persisted when the browser is shut down. Is data persisted in the case where > the browser doesn't properly shut down but instead crashes? No. We have bug 719167 for that scenario. I am a fan of fine-grained bugs; are these bugs too fine-grained?

Lawrence Mandel [:lmandel] (use needinfo)

Comment 8

•

13 years ago

(In reply to Nathan Froyd (:froydnj) from comment #7) > (In reply to Lawrence Mandel [:lmandel] from comment #6) > > I understood bug 715299 to address a normal scenario in which data is > > persisted when the browser is shut down. Is data persisted in the case where > > the browser doesn't properly shut down but instead crashes? > > No. We have bug 719167 for that scenario. > > I am a fan of fine-grained bugs; are these bugs too fine-grained? I'm also a fan of fine-grained bugs. I think these bugs all make sense as all of these items are distinct pieces of work. Thanks for the clarification. I think all the bases are covered.

Nathan Froyd [:froydnj]

Reporter

Comment 9

•

13 years ago

Attached patch patch (obsolete) — Details — Splinter Review

It is awfully convenient that we already have a failed ping to test with...saved quite a bit of test-writing.

Attachment #633206 - Flags: review?(taras.mozilla)

(dormant account)

Comment 10

•

13 years ago

Comment on attachment 633206 [details] [diff] [review] patch I think the issue is more complex than is implemented in the patch. As far as I can tell the implementation in the patch will just causes us to retry infinitely. What needs to happen if 1) if we are sending a persisted ping, then do not remove the persisted data in case of failure. I believe this is what currently happens since file deletion is tied to success. 1.a) If sending from persisted ping and we encounter an error should rename the file it does not get overwritten when the current session is serialized. This will come for free once we use session-id as part of the filename. 2) When sending a current session and the send fails, instead of trying again, we should serialize the data to disk. This gets us retry behavior on next-idle daily and hopefully keeps us from wasting bandwidth while having server problems.

Attachment #633206 - Flags: review?(taras.mozilla) → review-

Nathan Froyd [:froydnj]

Reporter

Comment 11

•

13 years ago

(In reply to Taras Glek (:taras) from comment #10) > 1) if we are sending a persisted ping, then do not remove the persisted data > in case of failure. I believe this is what currently happens since file > deletion is tied to success. This doesn't make any sense. We don't remove the file in case of failure already. But the other points are reasonable points. I'll just make this depend on the multi-ping save bug; it will be easier to write the desired logic once that's done.

Depends on: 763524

Nathan Froyd [:froydnj]

Reporter

Comment 12

•

13 years ago

(In reply to Taras Glek (:taras) from comment #10) > 2) When sending a current session and the send fails, instead of trying > again, we should serialize the data to disk. This gets us retry behavior on > next-idle daily and hopefully keeps us from wasting bandwidth while having > server problems. This suggestions is reasonable (though I don't understand the "gives us retry behavior"...what are we retrying?), but this comes for free once we implement persisting multiple pings to disk: just persist everything in TelemetryPing._pendingPings. Admittedly, it would be more interesting to save a ping at first failure to avoid doing a bunch of I/O at shutdown...

(dormant account)

Comment 13

•

13 years ago

(In reply to Nathan Froyd (:froydnj) from comment #12) > (In reply to Taras Glek (:taras) from comment #10) > > 2) When sending a current session and the send fails, instead of trying > > again, we should serialize the data to disk. This gets us retry behavior on > > next-idle daily and hopefully keeps us from wasting bandwidth while having > > server problems. > > This suggestions is reasonable (though I don't understand the "gives us > retry behavior"...what are we retrying?), but this comes for free once we > implement persisting multiple pings to disk: just persist everything in > TelemetryPing._pendingPings. Admittedly, it would be more interesting to > save a ping at first failure to avoid doing a bunch of I/O at shutdown... by retry behavior, I mean we will try to send the data on the next idle-daily. This also wont save you io on shutdown, because data we write on shutdown is a superset of what happened at idle-daily. It reduces chances of not writing any telemetry data to disk due to an unclean shutdown.

Nathan Froyd [:froydnj]

Reporter

Comment 14

•

13 years ago

(In reply to Taras Glek (:taras) from comment #13) > (In reply to Nathan Froyd (:froydnj) from comment #12) > > (In reply to Taras Glek (:taras) from comment #10) > > > 2) When sending a current session and the send fails, instead of trying > > > again, we should serialize the data to disk. This gets us retry behavior on > > > next-idle daily and hopefully keeps us from wasting bandwidth while having > > > server problems. > > > > This suggestions is reasonable (though I don't understand the "gives us > > retry behavior"...what are we retrying?), but this comes for free once we > > implement persisting multiple pings to disk: just persist everything in > > TelemetryPing._pendingPings. Admittedly, it would be more interesting to > > save a ping at first failure to avoid doing a bunch of I/O at shutdown... > > by retry behavior, I mean we will try to send the data on the next > idle-daily. It is not clear to me what this has to do with persisting data to disk. And we now save bandwidth by stopping at the first failed ping, so I'm also not seeing what saves us bandwidth here... > This also wont save you io on shutdown, because data we write on shutdown is > a superset of what happened at idle-daily. It reduces chances of not writing > any telemetry data to disk due to an unclean shutdown. It sounds like the algorithm you are proposing is something like: for each ping that we need to send: if the ping does not succeed: write data to disk shove ping back into the pendingPings queue break; at shutdown: for each ping in pendingPings: write ping to disk I am saying that the first loop could be: for each ping that we need to send: if the ping does not succeed: write data to disk break; which would then minimize the amount of data that you'd write to disk at shutdown. It's possible you could do shutdown in the middle of writing that data if you wrote it async (?), so maybe that's not desirable.

(dormant account)

Comment 15

•

13 years ago

> It is not clear to me what this has to do with persisting data to disk. And > we now save bandwidth by stopping at the first failed ping, so I'm also not > seeing what saves us bandwidth here... Good point. Forgot the little detail of us stopping the sending on failure. > > > This also wont save you io on shutdown, because data we write on shutdown is > > a superset of what happened at idle-daily. It reduces chances of not writing > > any telemetry data to disk due to an unclean shutdown. > > It sounds like the algorithm you are proposing is something like: > > for each ping that we need to send: > if the ping does not succeed: > write data to disk > shove ping back into the pendingPings queue > break; > > at shutdown: > for each ping in pendingPings: > write ping to disk not what I meant. > > I am saying that the first loop could be: This is what I was trying to propose. > > for each ping that we need to send: > if the ping does not succeed: > write data to disk > break; > > which would then minimize the amount of data that you'd write to disk at > shutdown. It's possible you could do shutdown in the middle of writing that > data if you wrote it async (?), so maybe that's not desirable. We'll see how it goes.

Nathan Froyd [:froydnj]

Reporter

Comment 16

•

12 years ago

Comment on attachment 633206 [details] [diff] [review] patch After looking at this again today, I remember what was confusing me so much about this bug. There's two different schools of thought here: 1) If we fail to send a ping, then just save the data in memory and try to send it later. It's OK for said data to not actually get persisted; that's a different bug (bug 719167, essentially). This is what I tried to implemented in the patch that was attached and got r-'d because the second approach is... 2) If we fail to send a ping, then we need to serialize that data to disk and try again later. This is more robust, but there are privacy issues and it's quite a bit more complicated because our current saved ping implementation assumes there's only one ping for any given session stored on disk at any given time. I'd like to avoid the privacy issues in (2); Sid already OK'd the solution in (1), so I'd like to ask for re-review on this.

Attachment #633206 - Flags: review- → review?(vdjeric)

Nathan Froyd [:froydnj]

Reporter

Comment 17

•

12 years ago

Vladan and I talked about this on IRC. Just resending pings without saving them to disk means that we miss data from no-network sessions that end in crashes. I had assumed, based on discussion in bug 719167 that we didn't want to serialize anything to disk. But re-reading through that bug, it sounds like serializing to disk is OK so long as the data is deleted once it gets sent (which we already do). If we can persist failed pings (and then, by extension, intermittently save pings in bug 719167), that would make fixing this bug much easier and make Telemetry more robust as well. Tom, do you think that's reasonable vis-a-vis privacy concerns?

Flags: needinfo?(tom)

Tom Lowenthal

Comment 18

•

12 years ago

I understand that all of the saved info is sent at the next available ping, in sequence. If there's a problem, the pending info is queued up for the next 24h slot, and sent then. This works for me.

Flags: needinfo?(tom)

Vladan Djeric (:vladan)

Comment 19

•

12 years ago

Comment on attachment 633206 [details] [diff] [review] patch Clearing review flag because it sounds like you're going to have a patch with a different approach

Attachment #633206 - Flags: review?(vdjeric)

Nathan Froyd [:froydnj]

Reporter

Comment 20

•

12 years ago

Attached patch persist failed telemetry pings for possible re-sending later — Details — Splinter Review

This was easy enough. This saves the ping for later sending: * If we send another ping during the same session, then: - The second ping will succeed. We'll automatically remove the saved ping file through the normal mechanism of removing saved pings. - The second ping will fail. In which case we'll overwrite the saved information with the second ping. * If we don't send another ping during the same session, then: - The session will shutdown normally, and we'll overwrite the saved ping file by the normal, save-a-ping-at-shutdown means. - The session will crash, and we'll load the ping at the next session.

Attachment #633206 - Attachment is obsolete: true

Attachment #690561 - Flags: review?(vdjeric)

Annie Elliott

Comment 21

•

12 years ago

I folks, I've cc:ed Daniel and

Annie Elliott

Comment 22

•

12 years ago

Harsha to flesh out whether this will have an impact on validations.

Harsha [:harsha]

Comment 23

•

12 years ago

I am seeing quite a few telemetry submission missing info field. Is that caused by this bug?

Nathan Froyd [:froydnj]

Reporter

Comment 24

•

12 years ago

(In reply to schintalapani from comment #23) > I am seeing quite a few telemetry submission missing info field. Is that > caused by this bug? No, nothing's been committed from this bug yet. Can you be more specific about what fields you see missing?

Harsha [:harsha]

Comment 25

•

12 years ago

some documents have no info field at all and some don't have appVersion in them. On other note I am seeing inconsistent no.of telemetry submissions per day. Here are counts for few days yyyymmdd total docs 20121118 5579384 20121119 6226229 20121120 6865824 20121121 7806403 20121126 9975334 20121127 9728292 20121128 9943121 20121129 9575144 20121130 7586829 20121201 5155001 20121202 4446206 20121203 3236707 and for the last two days 20121210 15155607 20121211 17242419

Nathan Froyd [:froydnj]

Reporter

Comment 26

•

12 years ago

Please file a new bug on that; make it depend on bug 816166, which looks like the most probable candidate for causing problems.

Vladan Djeric (:vladan)

Comment 27

•

12 years ago

Comment on attachment 690561 [details] [diff] [review] persist failed telemetry pings for possible re-sending later >diff --git a/toolkit/components/telemetry/TelemetryPing.js b/toolkit/components/telemetry/TelemetryPing.js >index 8741a89..be1bea6 100644 >--- a/toolkit/components/telemetry/TelemetryPing.js >+++ b/toolkit/components/telemetry/TelemetryPing.js >@@ -629,16 +629,17 @@ TelemetryPing.prototype = { > function onSuccess() { > if (data.slug == this._uuid) { > this._lastPingDate = new Date(); > this._lastAppBuildID = Services.appinfo.appBuildID; > } > this.sendPingsFromIterator(server, reason, i); > } > function onError() { >+ this.savePing(data, true); > // Notify that testing is complete, even if we didn't send everything. > finishPings(reason); > } > this.doPing(server, data, > onSuccess.bind(this), onError.bind(this)); > }, So let's say the following happens: - sending a ping fails because we are offline - Firefox then saves the "failed ping" to disk - Firefox successfully sends a ping 24 hours later from the same session - Firefox crashes - Firefox is restarted and sends out the saved "failed ping" Won't that cause us to replace the more recent Telemetry ping on the Metrics' servers with the older "failed ping" from disk? Shouldn't we delete the "failed ping" from disk after a successful send during the same session?

Attachment #690561 - Flags: review?(vdjeric) → review-

Nathan Froyd [:froydnj]

Reporter

Comment 28

•

12 years ago

(In reply to Vladan Djeric (:vladan) from comment #27) > So let's say the following happens: > - sending a ping fails because we are offline > - Firefox then saves the "failed ping" to disk > - Firefox successfully sends a ping 24 hours later from the same session > - Firefox crashes > - Firefox is restarted and sends out the saved "failed ping" > > Won't that cause us to replace the more recent Telemetry ping on the > Metrics' servers with the older "failed ping" from disk? Shouldn't we delete > the "failed ping" from disk after a successful send during the same session? That won't happen because of this code which kicks in at step 3: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryPing.js#631

Vladan Djeric (:vladan)

Comment 29

•

12 years ago

Comment on attachment 690561 [details] [diff] [review] persist failed telemetry pings for possible re-sending later (In reply to Nathan Froyd (:froydnj) from comment #28) > That won't happen because of this code which kicks in at step 3: You're right, I should have traced this code deeper

Attachment #690561 - Flags: review- → review+

Nathan Froyd [:froydnj]

Reporter

Comment 30

•

12 years ago

remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/cc544979aeaa remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/6c224e817390 Two commits because mercurial is odd.

Ed Morley [:emorley]

Comment 31

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/cc544979aeaa https://hg.mozilla.org/mozilla-central/rev/6c224e817390

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla20

patch 13 years ago Nathan Froyd [:froydnj] 4.81 KB, patch		Details \| Diff \| Splinter Review
persist failed telemetry pings for possible re-sending later 12 years ago Nathan Froyd [:froydnj] 1003 bytes, patch	vladan : review+	Details \| Diff \| Splinter Review