Closed Bug 1332780 Opened 3 years ago Closed 3 years ago
Telemetry probes not recognizing 4xx and 5xx server status codes
59 bytes, text/x-review-board-request
There are about 3% of update requests that return a status code we don't understand: https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-01-19&keys=google!other!mozilla!__none__&max_channel_version=nightly%252F53&measure=URLCLASSIFIER_UPDATE_REMOTE_STATUS2&min_channel_version=null&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2016-12-21&table=0&trim=1&use_submission_date=0 This is unusual and we should investigate why we can't get the real status code for these. At the very least, these failed requests should end up in the 4xx or 5xx buckets. On the shavar.services.mozilla.com side, we see a lot of 502s but there are none in telemetry.
Assignee: nobody → dlee
Status: NEW → ASSIGNED
Right now the "other" category of URLCLASSIFIER_UPDATE_REMOTE_STATUS2 actually contains the result of v2 update(both google and mozilla) and v4 update (See Bug 1332770 Comment 1). I see the old telemetry(URLCLASSIFIER_UPDATE_REMOTE_STATUS) also has unrecognized status code, so to fully understand the unrecognized status code, we must fix Bug 1332770 first so we can know which provider return those.
Depends on: 1332770
(In reply to François Marier [:francois] from comment #0) > On the shavar.services.mozilla.com side, we see a lot of 502s but there are > none in telemetry. This is because lots of errors from shavar.services.mozilla.com were recorded to "other" category.
Something else that ckolos found is that there's a known bug in the Amazon ELB that says if the client aborts a connection when the Transfer-Encoding header is set, it will count as a 4xx.
Just checked the telemetry dashboard after bug 1332770 was landed, there is still about 3% unknown error code: https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-02-14&keys=google!google4!mozilla!__none__&max_channel_version=nightly%252F54&measure=URLCLASSIFIER_UPDATE_REMOTE_STATUS2&min_channel_version=null&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-02-02&table=0&trim=1&use_submission_date=0 so I will dig into this.
Hi francois, I suspect that the error may because of connection error to server, for example, tcp timeout, which we will treat it as unknown error. So i am thinking if we should add another telemetry to record these, how do you think ?  https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierStreamUpdater.cpp#653  https://dxr.mozilla.org/mozilla-central/source/xpcom/base/ErrorList.h#205
(In reply to Dimi Lee[:dimi][:dlee] from comment #5) > I suspect that the error may because of connection error to server, for > example, tcp timeout, which we will treat it as unknown error. > So i am thinking if we should add another telemetry to record these, how > do you think ? I think you're right, it would be good to see any kind of connection error reported separately from genuine "unknown errrors". We can't do much about the connection errors, but they could be hiding logic errors in our code.
Attachment #8839381 - Flags: review?(francois)
An alternative approach will be something like uint16_t errorCode = NS_ERROR_GET_MODULE(status) == NS_ERROR_MODULE_NETWORK ? NS_ERROR_GET_CODE(status) : 0; mozilla::Telemetry::Accumulate( mozilla::Telemetry::URLCLASSIFIER_UPDATE_REMOTE_NETWORK_ERROR, mTelemetryProvider, errorCode); This way we will get complete result but it also requires very large bucket for telemetry. I saw error message "Histograms with large numbers of buckets use disproportionately high amounts of resources." when trying to add a large bucket to telemetry, so I just check connection errors and DNS errors in the patch.
Comment on attachment 8839381 [details] Bug 1332780 - Add telemetry to record network error while SafeBrowsing update. https://reviewboard.mozilla.org/r/114064/#review116510 I like your patch and your selection of network errors to report. The only thing that's missing IMHO is the "no error" case. I'd like to be able to know the error rate per provider. So let's make `0 = NS_OK` then `1 = unknown`, and then the rest of what you have. This way, we can see the % of updates that don't have network errors (the `0` case) and therefore the relative importance of these errors. datareview+ ::: toolkit/components/telemetry/Histograms.json:4121 (Diff revision 2) > + "expires_in_version": "60", > + "kind": "enumerated", > + "keyed": true, > + "n_values": 30, > + "bug_numbers": , > + "description": "Network error from SafeBrowsing database updates. (0=others, 1=already connected, 2=not connected, 3=connection refused,4=net timeout, 5=offline, 6=port access not allowed, 7=net reset, 8=net interrupt, 9=proxy connection refused,10=partial transfer,11=inadequate security,12=unknown host,13=dns lookup queue full,14=unknown proxy host" nit: "other" instead of "others"
Attachment #8839381 - Flags: review?(francois) → review+
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/integration/autoland/rev/9d135a51856f Add telemetry to record network error while SafeBrowsing update. r=francois
You need to log in before you can comment on or make changes to this bug.