Closed Bug 1172688 Opened 9 years ago Closed 8 years ago

Add telemetry for when gethash calls timeout

Categories

(Toolkit :: Safe Browsing, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
firefox48 --- fixed

People

(Reporter: francois, Assigned: dlee)

References

Details

(Whiteboard: tpe-seceng)

Attachments

(1 file, 2 obsolete files)

It would be useful to know how often these completions timeout when trying to reach Google's servers.
Blocks: 1149867
No longer blocks: 1024555
See Also: → 1024555
Component: DOM: Security → Safe Browsing
Product: Core → Toolkit
This is probably going to be similar to bug 1150921.
Whiteboard: tpe-seceng
I would like to work on this bug.
Assignee: nobody → dlee
Status: NEW → ASSIGNED
Hi François,
Do we measure this by how many success gethash request with a timeout or the time between each timeout ?
Flags: needinfo?(francois)
I don't think we have a need for the actual time it took. We just want to make sure that requests don't time out very often.

So I would suggest: # timeouts / (# timeouts + # completed requests)

which means we also need to add telemetry for the requests that succeeded.
Flags: needinfo?(francois)
Attached patch Add telemetry for gethash (obsolete) — Splinter Review
Attachment #8732696 - Flags: review?(francois)
Comment on attachment 8732696 [details] [diff] [review]
Add telemetry for gethash

Review of attachment 8732696 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/nsUrlClassifierHashCompleter.js
@@ +553,5 @@
>      // Notify the RequestBackoff once a response is received.
>      this._completer.finishRequest(this.gethashUrl, httpStatus);
>  
>      if (success) {
> +      Services.telemetry.getHistogramById("URLCLASSIFIER_COMPLETE_TIMEOUT").add(0);

What happens when the request doesn't timeout but still fails? For example, if it returns a 503 instead of a 200.

I would say that as far as the telemetry goes, we should consider that a non-timeout event and mark it as such. We're only interested in seeing how often the gethash endpoint fails to answer at all.
Attachment #8732696 - Flags: review?(francois)
Attached patch Add telemetry for gethash v2 (obsolete) — Splinter Review
Modification of this patch:
- Record all gethash requests in telemetry instead of just record success requests
Attachment #8732696 - Attachment is obsolete: true
Attachment #8733220 - Flags: review?(francois)
Comment on attachment 8733220 [details] [diff] [review]
Add telemetry for gethash v2

Review of attachment 8733220 [details] [diff] [review]:
-----------------------------------------------------------------

That looks fine to me. I think Benjamin needs to review all new telemetry probes.
Attachment #8733220 - Flags: review?(francois) → review?(benjamin)
Comment on attachment 8733220 [details] [diff] [review]
Add telemetry for gethash v2

If this is an exploratory measurement, it should expire (typically in 6 months).

If this is an operational alerting metric, I think we should be collecting from the release population, and you should talk a bit about how you plan on monitoring this with either daily or realtime alerting.
Attachment #8733220 - Flags: review?(benjamin) → feedback-
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #9)
> If this is an exploratory measurement, it should expire (typically in 6
> months).
> 
> If this is an operational alerting metric, I think we should be collecting
> from the release population, and you should talk a bit about how you plan on
> monitoring this with either daily or realtime alerting.

The question we're trying to answer is: Is the 10s timeout reasonable and (especially) does this happen often in practice?

Is that what you'd consider an exploratory measurement? If so, can we run it for 1 year instead of just 6 months?
Flags: needinfo?(benjamin)
Please run it for 6 months first and if you still need more data you can extend it.

You could instead add a metric that measures the total time a gethash call takes, which would give you a distribution and not just a boolean.
Flags: needinfo?(benjamin)
Dimi, I think we should go with the 6 months that Benjamin suggested.

https://bugzilla.mozilla.org/page.cgi?id=splinter.html&bug=1237198&attachment=8735634 suggests that all we need to do is add "expires_in_version" with whatever release will come 6 months after Fx 48.
Attachment #8733220 - Attachment is obsolete: true
Comment on attachment 8737773 [details]
MozReview Request: Bug 1172688 - Add telemetry for when gethash calls timeout. r=francois, bsmedberg

https://reviewboard.mozilla.org/r/44105/#review40815
Attachment #8737773 - Flags: review?(francois) → review+
(In reply to Dimi Lee[:dimi][:dlee] from comment #13)
> Created attachment 8737773 [details]
> MozReview Request: Bug 1172688 - Add telemetry for when gethash calls
> timeout. r?francois, bsmedberg
> 
> Review commit: https://reviewboard.mozilla.org/r/44105/diff/#index_header
> See other reviews: https://reviewboard.mozilla.org/r/44105/

Forget to say, based on comment 4, i still use boolean for now instead of metric suggested in comment 11
I think the boolean is fine for this initial exploration.
Attachment #8737773 - Flags: review?(benjamin) → review+
Comment on attachment 8737773 [details]
MozReview Request: Bug 1172688 - Add telemetry for when gethash calls timeout. r=francois, bsmedberg

https://reviewboard.mozilla.org/r/44105/#review41015

data-review=me with that change or something equivalent

::: toolkit/components/telemetry/Histograms.json:3364
(Diff revision 1)
> +  "URLCLASSIFIER_COMPLETE_TIMEOUT": {
> +    "alert_emails": ["gcp@mozilla.com", "francois@mozilla.com"],
> +    "expires_in_version": "52",
> +    "kind": "boolean",
> +    "bug_numbers": [1172688],
> +    "description": "True if gethash lookup timeout"

Suggest more detail: "this metric is recorded every time a gethash lookup is performed. `true` is recorded if the lookup times out."
Comment on attachment 8737773 [details]
MozReview Request: Bug 1172688 - Add telemetry for when gethash calls timeout. r=francois, bsmedberg

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/44105/diff/1-2/
Attachment #8737773 - Attachment description: MozReview Request: Bug 1172688 - Add telemetry for when gethash calls timeout. r?francois, bsmedberg → MozReview Request: Bug 1172688 - Add telemetry for when gethash calls timeout. r=francois, bsmedberg
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/51688b3611d4
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: