Closed Bug 1332780 Opened 9 years ago Closed 9 years ago

Telemetry probes not recognizing 4xx and 5xx server status codes

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla54

Tracking Flags:

Tracking

Status

firefox54

---

fixed

People

(Reporter: francois, Assigned: dimi)

References

Details

(Whiteboard: #sbv4-m5)

Attachments

(1 file)

Bug 1332780 - Add telemetry to record network error while SafeBrowsing update. 9 years ago Dimi Lee [:dimi] 59 bytes, text/x-review-board-request	francois : review+	Details

François Marier [:francois]

Reporter

Description

•

9 years ago

There are about 3% of update requests that return a status code we don't understand: https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-01-19&keys=google!other!mozilla!__none__&max_channel_version=nightly%252F53&measure=URLCLASSIFIER_UPDATE_REMOTE_STATUS2&min_channel_version=null&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2016-12-21&table=0&trim=1&use_submission_date=0 This is unusual and we should investigate why we can't get the real status code for these. At the very least, these failed requests should end up in the 4xx or 5xx buckets. On the shavar.services.mozilla.com side, we see a lot of 502s but there are none in telemetry.

Dimi Lee [:dimi]

Assignee

Updated

•

9 years ago

Assignee: nobody → dlee

Status: NEW → ASSIGNED

Dimi Lee [:dimi]

Assignee

Comment 1

•

9 years ago

Right now the "other" category of URLCLASSIFIER_UPDATE_REMOTE_STATUS2 actually contains the result of v2 update(both google and mozilla) and v4 update (See Bug 1332770 Comment 1). I see the old telemetry(URLCLASSIFIER_UPDATE_REMOTE_STATUS) also has unrecognized status code, so to fully understand the unrecognized status code, we must fix Bug 1332770 first so we can know which provider return those.

Depends on: 1332770

Dimi Lee [:dimi]

Assignee

Comment 2

•

9 years ago

(In reply to François Marier [:francois] from comment #0) > On the shavar.services.mozilla.com side, we see a lot of 502s but there are > none in telemetry. This is because lots of errors from shavar.services.mozilla.com were recorded to "other" category.

François Marier [:francois]

Reporter

Comment 3

•

9 years ago

Something else that ckolos found is that there's a known bug in the Amazon ELB that says if the client aborts a connection when the Transfer-Encoding header is set, it will count as a 4xx.

Dimi Lee [:dimi]

Assignee

Updated

•

9 years ago

Whiteboard: #sbv4-m6

Dimi Lee [:dimi]

Assignee

Comment 4

•

9 years ago

Just checked the telemetry dashboard after bug 1332770 was landed, there is still about 3% unknown error code: https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-02-14&keys=google!google4!mozilla!__none__&max_channel_version=nightly%252F54&measure=URLCLASSIFIER_UPDATE_REMOTE_STATUS2&min_channel_version=null&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-02-02&table=0&trim=1&use_submission_date=0 so I will dig into this.

Dimi Lee [:dimi]

Assignee

Comment 5

•

9 years ago

Hi francois, I suspect that the error may because of connection error to server, for example, tcp timeout, which we will treat it as unknown error[1]. So i am thinking if we should add another telemetry to record these[2], how do you think ? [1] https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierStreamUpdater.cpp#653 [2] https://dxr.mozilla.org/mozilla-central/source/xpcom/base/ErrorList.h#205

Flags: needinfo?(francois)

François Marier [:francois]

Reporter

Comment 6

•

9 years ago

(In reply to Dimi Lee[:dimi][:dlee] from comment #5) > I suspect that the error may because of connection error to server, for > example, tcp timeout, which we will treat it as unknown error[1]. > So i am thinking if we should add another telemetry to record these[2], how > do you think ? I think you're right, it would be good to see any kind of connection error reported separately from genuine "unknown errrors". We can't do much about the connection errors, but they could be hiding logic errors in our code.

Flags: needinfo?(francois)

Dimi Lee [:dimi]

Assignee

Updated

•

9 years ago

Whiteboard: #sbv4-m6 → #sbv4-m5

Comment hidden (mozreview-request)

Dimi Lee [:dimi]

Assignee

Updated

•

9 years ago

Attachment #8839381 - Flags: review?(francois)

Comment hidden (mozreview-request)

Dimi Lee [:dimi]

Assignee

Comment 10

•

9 years ago

An alternative approach will be something like uint16_t errorCode = NS_ERROR_GET_MODULE(status) == NS_ERROR_MODULE_NETWORK ? NS_ERROR_GET_CODE(status) : 0; mozilla::Telemetry::Accumulate( mozilla::Telemetry::URLCLASSIFIER_UPDATE_REMOTE_NETWORK_ERROR, mTelemetryProvider, errorCode); This way we will get complete result but it also requires very large bucket for telemetry. I saw error message "Histograms with large numbers of buckets use disproportionately high amounts of resources." when trying to add a large bucket to telemetry, so I just check connection errors and DNS errors in the patch.

François Marier [:francois]

Reporter

Comment 11

•

9 years ago

mozreview-review

Comment on attachment 8839381 [details] Bug 1332780 - Add telemetry to record network error while SafeBrowsing update. https://reviewboard.mozilla.org/r/114064/#review116510 I like your patch and your selection of network errors to report. The only thing that's missing IMHO is the "no error" case. I'd like to be able to know the error rate per provider. So let's make `0 = NS_OK` then `1 = unknown`, and then the rest of what you have. This way, we can see the % of updates that don't have network errors (the `0` case) and therefore the relative importance of these errors. datareview+ ::: toolkit/components/telemetry/Histograms.json:4121 (Diff revision 2) > + "expires_in_version": "60", > + "kind": "enumerated", > + "keyed": true, > + "n_values": 30, > + "bug_numbers": [1332780], > + "description": "Network error from SafeBrowsing database updates. (0=others, 1=already connected, 2=not connected, 3=connection refused,4=net timeout, 5=offline, 6=port access not allowed, 7=net reset, 8=net interrupt, 9=proxy connection refused,10=partial transfer,11=inadequate security,12=unknown host,13=dns lookup queue full,14=unknown proxy host" nit: "other" instead of "others"

Attachment #8839381 - Flags: review?(francois) → review+

Comment hidden (mozreview-request)

Dimi Lee [:dimi]

Assignee

Updated

•

9 years ago

Keywords: checkin-needed

Pulsebot

Comment 13

•

9 years ago

Pushed by cbook@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9d135a51856f Add telemetry to record network error while SafeBrowsing update. r=francois

Keywords: checkin-needed

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 14

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9d135a51856f

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

status-firefox54: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla54

You need to log in before you can comment on or make changes to this bug.