Closed Bug 1319026 Opened 8 years ago Closed 7 years ago

Investigate downwards trend in Telemetry upload success rates

Categories

(Toolkit :: Telemetry, defect, P1)

defect
Points:
3

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox53 --- wontfix
firefox54 --- wontfix
firefox55 --- fixed

People

(Reporter: gfritzsche, Assigned: chutten)

References

Details

(Whiteboard: [measurement:client])

Attachments

(3 files)

I see a downwards trend in Telemetry upload success rates from beta 50 on: https://mzl.la/2fTBOwz Beta 50 had overall ~61% success: https://mzl.la/2fTAQjR vs Beta 49 with overall 71% success: https://mzl.la/2fTyAsL
OS breakdown: | b50 | b49 windows | 61% | 71% osx | 77% | 83% linux | 80% | 78%
Mark, were there any potentially related Edge changes around Oct 22 (or in a ~2 week window before)?
Flags: needinfo?(mreid)
There might have been changes relating to the new pipeline infra. :whd, any ideas? If it were an ingestion-related change, I would expect the issue to apply to all versions starting on a given date. Can we compare the rate for all versions before/after that date?
Flags: needinfo?(mreid) → needinfo?(whd)
To me it looks like it could be something client-side that rode the trains with 51 then was uplifted to 50. It appears to show up around Sept 3 on Nightly: https://mzl.la/2fUyTng Which gets picked up on Aurora 51 on merge day, though it is later fixed: https://mzl.la/2fUzDc8 Beta got it right away on that same merge day to 50: https://mzl.la/2fUyq4q
This should mean it is caused by something that was backed out in this pushlog: https://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?startdate=oct+1+2016&enddate=oct+16+2016 I don't see anything standing out to me on a first glance. Also, i can't view the secure bug 1263665.
(In reply to Mark Reid [:mreid] from comment #3) > There might have been changes relating to the new pipeline infra. :whd, any > ideas? > > If it were an ingestion-related change, I would expect the issue to apply to > all versions starting on a given date. Can we compare the rate for all > versions before/after that date? The only changes of note in infrastructure during Oct 8-22 were: current infra: 2016-10-20 Add "tls-13-study" docType. new infra: 2016-10-07: Switch to uploading to the real prod bucket 2016-10-17: remove loop derived data set 2016-10-20: hindsight based CEP deployed The only one of these I can see having any effect on submissions would potentially be "Switch to uploading to the real prod bucket". This effectively doubled the number of objects being uploaded to s3 for the net-mozaws-prod-us-west-2-pipeline-data bucket (the new infra was previously sending data to net-mozaws-prod-us-west-2-pipeline-data-test) which might have increased s3 upload error rates. However, we have automatic recovery for when s3 failures occur both in the current and new infra. I checked a few DWL boxes and the upload error rate does not appear to have changed. As Mark mentions, any ingestion issue would probably be affecting all versions.
Flags: needinfo?(whd)
Priority: P2 → P1
See Also: → 1353003
It seems to me what we need is a more thorough understanding of what kind of failures we're seeing. The boolean "success" we drain everything down to is a little reductive. The only specific case we seem to call out is TELEMETRY_PING_EVICTED_FOR_SERVER_ERRORS (which is when we have a successful HTTP request/response, but get a 4xx back) and there are vanishingly few of those on Beta over the period in question (and into modernity): https://mzl.la/2nGQSRB It seems from my own about:telemetry on Windows, I have a fairly frequent (about 1 in 3) problem with TELEMETRY_SUCCESS, so maybe through judicious logging I can find a couple places for a few new probes that can identify for us a little more clearly what kinds of problems we're facing.
Assignee: nobody → chutten
Status: NEW → ASSIGNED
Comment on attachment 8855441 [details] bug 1319026 - Log failed Telemetry send requests. https://reviewboard.mozilla.org/r/127292/#review130406 Let's estimate from Telemetry data what the payload size impact of this and the other events will be, including the worst-case scenarios. We don't want to have the side-effect of killing upload for a subset of users. Does TelemetryLog have any sanity size limits? What kind of analysis do you expect here? A custom spark job? Is it enough to have just the failure data here? Is it worth adding logging for: - successful uploads - what ping types are effected - startup event(s) - wakeup event - shutdown event(s) ::: toolkit/components/telemetry/TelemetrySend.jsm:95 (Diff revision 1) > > // The age of a pending ping to be considered overdue (in milliseconds). > const OVERDUE_PING_FILE_AGE = 7 * 24 * 60 * MS_IN_A_MINUTE; // 1 week > > +// TelemetryLog Key for logging failures to send Telemetry > +const LOG_FAILURE_KEY = "TelemetrySend_FAILURE"; In Experiments we use all-caps - `TELEMETRY_SEND_FAILURE`? ::: toolkit/components/telemetry/TelemetrySend.jsm:1060 (Diff revision 1) > onCompletion(); > }); > }; > > let errorhandler = (event) => { > + TelemetryLog.log(LOG_FAILURE_KEY, [event.type]); Can we use a different keys here? Or add something like "errorhandler" as the first argument? This seems like a different log type from the one below. Does `event.type` tell us whether it's a timeout, error, or abort?
Attachment #8855441 - Flags: review?(gfritzsche)
Some other questions: Does a low TELEMETRY_SUCCESS rate usually mean a client has a significant number of missing subsessions? Are all clients affected equally or is there e.g. a strong OS correlation? If we can see this in our own Firefox installations (i see some higher ratios), can we add very verbose logging (to files or so) to our machines and analyze that? That might not cover all error types the broader population sees, but could uncover any bugs we hit faster and with more data.
Comment on attachment 8855441 [details] bug 1319026 - Log failed Telemetry send requests. https://reviewboard.mozilla.org/r/127292/#review131600 No problems with what we're doing here, but I want better docs. ::: toolkit/components/telemetry/TelemetrySend.jsm:1084 (Diff revision 1) > this._log.error("_doPing - error submitting to " + url + ", status: " + status > + " - ping request broken?"); > Telemetry.getHistogramById("TELEMETRY_PING_EVICTED_FOR_SERVER_ERRORS").add(); > // TODO: we should handle this better, but for now we should avoid resubmitting > // broken requests by pretending success. > + TelemetryLog.log(LOG_FAILURE_KEY, ["4xx failure we pretend is a success", status]); Where can we document these log entries so that they show up at gecko.readthedocs.org, rather than just being buried in the code? ::: toolkit/components/telemetry/TelemetrySend.jsm:1084 (Diff revision 1) > this._log.error("_doPing - error submitting to " + url + ", status: " + status > + " - ping request broken?"); > Telemetry.getHistogramById("TELEMETRY_PING_EVICTED_FOR_SERVER_ERRORS").add(); > // TODO: we should handle this better, but for now we should avoid resubmitting > // broken requests by pretending success. > + TelemetryLog.log(LOG_FAILURE_KEY, ["4xx failure we pretend is a success", status]); I wish we weren't using TelemetryLog for anything new, but I don't have any other good suggestions right now so it's not a blocker.
Attachment #8855441 - Flags: review?(benjamin) → review-
(In reply to Georg Fritzsche [:gfritzsche] [away Apr 13 - 18] from comment #10) > Some other questions: > > Does a low TELEMETRY_SUCCESS rate usually mean a client has a significant > number of missing subsessions? Oh man, trying to figure that out would be a nightmare. Trying to figure out from `previousSubsessionId`s what pairs of client_id and subsession_id are somewhere in the corpus, then trying to estimate how many of them are missing? Also, I'm not sure it would help too much. If users successfully tell us that they're failing, they're likely successfully sending us their backlog at the same time. > Are all clients affected equally or is there e.g. a strong OS correlation? Good intuition. Windows users report a harder time sending: https://mzl.la/2nFojrQ 32-bit users also have a harder time: https://mzl.la/2nFgxOH 32-bit is correlated highly with Windows, and I'm willing to assume that Linux and Mac are correlated with good Internet connections (and wealth). So I'm not sure it says as much about the code as it does the likely environment it is running within. The OS:Success proportion holds across channels (aurora https://mzl.la/2nFhMNy, beta https://mzl.la/2nFeJoU) which makes me wonder if the original per-channel difference I noticed could be ascribed to a difference in proportion of Windows users on that channel. > If we can see this in our own Firefox installations (i see some higher > ratios), can we add very verbose logging (to files or so) to our machines > and analyze that? > That might not cover all error types the broader population sees, but could > uncover any bugs we hit faster and with more data. Possibly, but I don't see it being any easier that going straight for the broader population in the first place. And then wed have no guarantees that our volunteers would be representative of the errors in the wild.
(In reply to Georg Fritzsche [:gfritzsche] [away Apr 13 - 18] from comment #9) > Comment on attachment 8855441 [details] > bug 1319026 - Log failed Telemetry send requests. > > https://reviewboard.mozilla.org/r/127292/#review130406 > > Let's estimate from Telemetry data what the payload size impact of this and > the other events will be, including the worst-case scenarios. The size of this is [size of the log line] * [number of TELEMETRY_SUCCESS samples that are false]. For the last month of Nightly 55, there were 2.6M "false" samples for TELEMETRY_SUCCESS. So I think we're fine for size here. > Does TelemetryLog have any sanity size limits? To my knowledge there are no sanity limits for TelemetryLog. > What kind of analysis do you expect here? I'll run a custom spark job after a week or so to see what the more common failures are, and what information we can glean from that. This is exploratory, so I'm afraid I'm a little light on the details. > Is it enough to have just the failure data here? Yup. Successes are already counted by TELEMETRY_SUCCESS. > Is it worth adding logging for: > - successful uploads > - what ping types are effected > - startup event(s) > - wakeup event > - shutdown event(s) - nope - maybe - - - This is just finding out what kinds of failures they are, not when they happen (which I suppose could be related, but I assume most failures (like most successes) will happen 57 seconds after profile load), so I'm fine with going forward without them.
Flags: needinfo?(gfritzsche)
(In reply to Chris H-C :chutten from comment #14) > (In reply to Georg Fritzsche [:gfritzsche] [away Apr 13 - 18] from comment > #9) > > Does TelemetryLog have any sanity size limits? > > To my knowledge there are no sanity limits for TelemetryLog. Ok, lets land a trivial upper limit for the number TelemetryLog entries here. > > What kind of analysis do you expect here? > > I'll run a custom spark job after a week or so to see what the more common > failures are, and what information we can glean from that. This is > exploratory, so I'm afraid I'm a little light on the details. > > > Is it enough to have just the failure data here? > > Yup. Successes are already counted by TELEMETRY_SUCCESS. For exploratory analysis, having the success in the log too might help? E.g. to see if the failures happen right after successful uploads etc.? Either way, this should be your call. > > Is it worth adding logging for: > > - successful uploads > > - what ping types are effected > > - startup event(s) > > - wakeup event > > - shutdown event(s) > > - nope > - maybe > - > - > - This is just finding out what kinds of failures they are, not when they > happen (which I suppose could be related, but I assume most failures (like > most successes) will happen 57 seconds after profile load), so I'm fine with > going forward without them. I see, ok. We can still re-iterate on that then if needed. (In reply to Chris H-C :chutten from comment #13) > (In reply to Georg Fritzsche [:gfritzsche] [away Apr 13 - 18] from comment > #10) > > Some other questions: > > > > Does a low TELEMETRY_SUCCESS rate usually mean a client has a significant > > number of missing subsessions? > > Oh man, trying to figure that out would be a nightmare. Trying to figure out > from `previousSubsessionId`s what pairs of client_id and subsession_id are > somewhere in the corpus, then trying to estimate how many of them are > missing? > > Also, I'm not sure it would help too much. If users successfully tell us > that they're failing, they're likely successfully sending us their backlog > at the same time. Accepting lower precision, we could look at per-client profileSubsessionCounter gaps? For now just to see if there is a link at all? Good point about the "survivor" issue here though. > > Are all clients affected equally or is there e.g. a strong OS correlation? > > Good intuition. Windows users report a harder time sending: > https://mzl.la/2nFojrQ > 32-bit users also have a harder time: https://mzl.la/2nFgxOH > > 32-bit is correlated highly with Windows, and I'm willing to assume that > Linux and Mac are correlated with good Internet connections (and wealth). So > I'm not sure it says as much about the code as it does the likely > environment it is running within. > > The OS:Success proportion holds across channels (aurora > https://mzl.la/2nFhMNy, beta https://mzl.la/2nFeJoU) which makes me wonder > if the original per-channel difference I noticed could be ascribed to a > difference in proportion of Windows users on that channel. This is interesting. At least focusing on Windows initially looks most promising. The OS:success ratios holding up on Nightly means starting from pre-release log data looks useful. Thanks for the run-down, f+ on the plan here.
Flags: needinfo?(gfritzsche)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #11) > ::: toolkit/components/telemetry/TelemetrySend.jsm:1084 > (Diff revision 1) > > this._log.error("_doPing - error submitting to " + url + ", status: " + status > > + " - ping request broken?"); > > Telemetry.getHistogramById("TELEMETRY_PING_EVICTED_FOR_SERVER_ERRORS").add(); > > // TODO: we should handle this better, but for now we should avoid resubmitting > > // broken requests by pretending success. > > + TelemetryLog.log(LOG_FAILURE_KEY, ["4xx failure we pretend is a success", status]); > > Where can we document these log entries so that they show up at > gecko.readthedocs.org, rather than just being buried in the code? I agree with that concern, but for this bugs temporary measurement we can just document them in main-ping.rst, in the "log" section. Longer-term, instead of improving TelemetryLog, i would like to figure out how to use Event Telemetry for these use-cases.
Comment on attachment 8862087 [details] bug 1319026 - Put an arbitrary upper-limit on TelemetryLog https://reviewboard.mozilla.org/r/134038/#review137212 I left some inline comments below. We also should: - add simple test coverage to test_TelemetryLog.js for hitting the limit. - update the docs to mention the limit. ::: toolkit/components/telemetry/TelemetryLog.jsm:12 (Diff revision 1) > const Cc = Components.classes; > const Ci = Components.interfaces; > > const Telemetry = Cc["@mozilla.org/base/telemetry;1"].getService(Ci.nsITelemetry); > + > +const LOG_ENTRY_MAX_COUNT = 5000; 5000 sounds a bit high. How about 1-2000? Using an example entry like: `["TelemetrySend_FAILURE", 86481348, ["4xx failure we pretend is a success","401"]]` ... we have a length of 80 characters. A rather crude estimation of uncompressed size makes this: 80 * 5000 = ~ 400000 B = ~ 0.38 MB That sounds risky for payload impact - see `TelemetryStorage.MAXIMUM_PING_SIZE` in TelemetrySend.jsm. ::: toolkit/components/telemetry/TelemetryLog.jsm:32 (Diff revision 1) > + while (gLogEntries.length >= LOG_ENTRY_MAX_COUNT) { > + gLogEntries.shift(); We can keep (1) the "N most recent" or (2) the "first N" log entries in a session. Any reason for choosing (1) specifically? I went with (2) for Event Telemetry, as it gives us a view from (sub)session start until "too much data". ::: toolkit/components/telemetry/TelemetryLog.jsm:34 (Diff revision 1) > if (data !== undefined) { > entry = entry.concat(Array.prototype.map.call(data, String)); > } > + while (gLogEntries.length >= LOG_ENTRY_MAX_COUNT) { > + gLogEntries.shift(); > + } Should we also put a marker in the data for "hit the limit"?
Attachment #8862087 - Flags: review?(gfritzsche)
Points: --- → 3
Attachment #8862087 - Flags: review?(gfritzsche) → review+
Comment on attachment 8862087 [details] bug 1319026 - Put an arbitrary upper-limit on TelemetryLog https://reviewboard.mozilla.org/r/134038/#review137212 > Should we also put a marker in the data for "hit the limit"? This was dropped?
Comment on attachment 8855441 [details] bug 1319026 - Log failed Telemetry send requests. https://reviewboard.mozilla.org/r/127292/#review137228 r=me with the `type` question cleared up. ::: toolkit/components/telemetry/TelemetrySend.jsm:1062 (Diff revision 3) > onCompletion(); > }); > }; > > let errorhandler = (event) => { > + TelemetryLog.log(LOG_FAILURE_KEY, ["errorhandler", event.type]); Does that actually give us the error type, "error", "timeout", "abort"? Or do we need to add that explicitly? ::: toolkit/components/telemetry/TelemetrySend.jsm:1086 (Diff revision 3) > this._log.error("_doPing - error submitting to " + url + ", status: " + status > + " - ping request broken?"); > Telemetry.getHistogramById("TELEMETRY_PING_EVICTED_FOR_SERVER_ERRORS").add(); > // TODO: we should handle this better, but for now we should avoid resubmitting > // broken requests by pretending success. > + TelemetryLog.log(LOG_FAILURE_KEY, ["4xx failure we pretend is a success", status]); Nit: For analysis a shorter string like "4xx failure" should be enough?
Attachment #8855441 - Flags: review?(gfritzsche) → review+
For analysis, note that TelemetryLog doesn't reset between subsessions currently (bug 1360518).
Comment on attachment 8862087 [details] bug 1319026 - Put an arbitrary upper-limit on TelemetryLog https://reviewboard.mozilla.org/r/134038/#review137212 > This was dropped? len(log) == LIMIT is the marker.
Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/86a1fc3677d7 Log failed Telemetry send requests. r=gfritzsche https://hg.mozilla.org/integration/autoland/rev/d3b10d6dacac Put an arbitrary upper-limit on TelemetryLog r=gfritzsche
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Georg pointed out that I neglected to seek appropriate data-r+ from you before proceeding. I treated the r- as r+with comments which was incorrect. Are there any changes needed for the final form of this code?
Flags: needinfo?(benjamin)
Comment on attachment 8855441 [details] bug 1319026 - Log failed Telemetry send requests. https://reviewboard.mozilla.org/r/127292/#review138090 data-r=me. We should probably document this better later with the exact log entries.
Attachment #8855441 - Flags: review?(benjamin) → review+
Flags: needinfo?(benjamin)
Analysis is reviewed: http://reports.telemetry.mozilla.org/post/projects/telemetry_send_failures.kp In short: about 1/4 of failures are timeouts, the other 3/4 are "error". :jduell, Is there a way to break down what "error" means in the context of a ServiceRequest's errorhandler's event's `type`? (Relevant code: http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/TelemetrySend.jsm#1083 )
Status: RESOLVED → REOPENED
Flags: needinfo?(jduell.mcbugs)
Resolution: FIXED → ---
ServiceRequest here is just a XMLHttpRequest that sets the channels .beConservative flag. http://searchfox.org/mozilla-central/rev/224cc663d54085994a4871ef464b7662e0721e83/toolkit/modules/ServiceRequest.jsm#29
Redirecting... :mayhemer, is there a way to break down what "error" means when that is set as an error handler's event's `type`?
Flags: needinfo?(jduell.mcbugs) → needinfo?(honzab.moz)
Redirecting... I'm not a service worker person.
Flags: needinfo?(honzab.moz) → needinfo?(josh)
I don't see ServiceWorker mentioned anywhere in this bug so far.
Flags: needinfo?(josh) → needinfo?(honzab.moz)
(In reply to Josh Matthews [:jdm] from comment #37) > I don't see ServiceWorker mentioned anywhere in this bug so far. As, sorry, ServiceRequest has nothing to do with SW. It's XHR! Then this should head up to DOM. Ben, was it you responsible for XHR? If not, don't you know who it should be?
Flags: needinfo?(honzab.moz) → needinfo?(bkelly)
I think Andrea knows more about XHR.
Flags: needinfo?(bkelly) → needinfo?(amarchesini)
First of all, I would split these 3 callbacks: http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/TelemetrySend.jsm#1086 request.onerror = errorhandler; request.ontimeout = errorhandler; request.onabort = errorhandler; Let's assume we have the onerror case, currently we don't have any extra information about what went wrong. We can add a chromeonly method if needed where it is possible to retrieve the error code as a nsresult.
Flags: needinfo?(amarchesini)
That would be wonderful if the nsresult would have more information than just "an error occurred somewhere on, in, around, or near the network" :) Whereabouts should I file a bug for that?
Flags: needinfo?(amarchesini)
Remove the logging. We have the data we need until we get the more detailed information from :baku
Attachment #8869477 - Flags: review?(gfritzsche)
Blocks: 1367094
From conversation with :baku on IRC, the short-form of how to get detailed error information is: 1) Retype XMLHttpRequestMainThread's mErrorLoad from bool to an enum 2) Whenever mErrorLoad was set to true, set it to a unique error enum value 3) Add a [ChromeOnly] attribute int errorCode to its webidl (and implement ErrorCode) to expose it to JS (need to introduce a stub version for the WorkerThread version) 4) Use it in JS.
Blocks: 1367110
Flags: needinfo?(amarchesini)
Comment on attachment 8869477 [details] [diff] [review] 0001-bug-1319026-Remove-logging-for-failed-Telemetry-send.patch Review of attachment 8869477 [details] [diff] [review]: ----------------------------------------------------------------- Simple removal, this looks good to me.
Attachment #8869477 - Flags: review?(gfritzsche) → review+
Alrighty, let's get this in-tree and proceed in the follow-up bugs.
Keywords: checkin-needed
Pushed by ryanvm@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/b39905246ab8 Remove logging for failed Telemetry send requests. r=gfritzsche
Keywords: checkin-needed
Status: REOPENED → RESOLVED
Closed: 8 years ago7 years ago
Resolution: --- → FIXED
Is this something you may want in beta 54? Or is 55 good enough?
Flags: needinfo?(chutten)
This was a temporary change that was put into and taken out of Nightly55, so none of it should make it to Beta :)
Flags: needinfo?(chutten)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: