Closed Bug 1150921 Opened 5 years ago Closed 4 years ago

Add telemetry for response codes to SafeBrowsing requests

Categories

(Toolkit :: Safe Browsing, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla47
Tracking Status
firefox40 --- affected
firefox47 --- fixed

People

(Reporter: gcp, Assigned: gcp)

References

Details

Attachments

(1 file, 1 obsolete file)

We currently don't have any telemetry that can spot things like:
https://bugzilla.mozilla.org/show_bug.cgi?id=1150334

Checking for strange things in server response codes to our requests sounds pretty useful.
Add telemetry for the reponses from the remote server.
I grouped the status code into buckets, a bit of a judgement
call which distinctions might be useful and which not.
Attachment #8715350 - Flags: review?(francois)
Comment on attachment 8715350 [details] [diff] [review]
Add telemetry for response codes to SafeBrowsing requests

Review of attachment 8715350 [details] [diff] [review]:
-----------------------------------------------------------------

According to the spec these return codes are valid for list updates:

    200: OK — Data is available in the HTTP response body.
    400: Bad Request — The HTTP request was not correctly formed. The client did not provide all required CGI parameters or the body did not contain any meaningful entries.
    403: Forbidden — The client id is invalid.
    503: Service Unavailable — The server cannot handle the request. Clients MUST follow the backoff behavior specified in the Request Frequency section.
    505: HTTP Version Not Supported — The server CANNOT handle the requested protocol major version.

and for gethash:

    200: OK — Data is available in the HTTP response body.
    204: No Content — There are no full-length hashes with the requested prefix.
    400: Bad Request — The HTTP request was not correctly formed. The client did not provide all required CGI parameters.
    403: Forbidden — The client id is invalid.
    503: Service Unavailable — The server cannot handle the request. Clients MUST follow the backoff behavior specified in the Request Frequency section.
    505: HTTP Version Not Supported — The server CANNOT handle the requested protocol major version.

So in other words, updates is missing: 505 and gethash is missing: 204, 505.

I would suggest having more buckets:

- one bucket per documented return code
- one bucket for all 1xx undocumented error codes, one for 2xx, etc.
- one bucket for return codes we don't care about
- one bucket for anything else we'd like to single out

So, here's a concrete proposal for gethash:

case 100:
case 101:
    // Unexpected 1xx return code
    statusBucket = TODO;
    break;

case 200:
    // OK - Data is available in the HTTP response body.
    statusBucket = TODO;
    break;

case 201:
case 202:
case 203:
case 205:
case 206:
    // Unexpected 2xx return code
    statusBucket = TODO;
    break;

case 204:
    // No Content - There are no full-length hashes with the requested prefix.
    statusBucket = TODO;
    break;

case 300:
case 301:
case 302:
case 303:
case 304:
case 305:
case 307:
case 308:
    // Unexpected 3xx return code
    statusBucket = TODO;
    break;

case 400:
    // Bad Request - The HTTP request was not correctly formed. The client did not provide all required CGI parameters.
    statusBucket = TODO;
    break;

case 401:
case 402:
case 405:
case 406:
case 407:
case 409:
case 410:
case 411:
case 412:
case 413:
case 414:
case 415:
case 416:
case 417:
case 421:
case 426:
case 428:
case 429:
case 431:
case 451:
    // Unexpected 4xx return code
    statusBucket = TODO;
    break;

case 403:
    // Forbidden - The client id is invalid.
    statusBucket = TODO;
    break;

case 404:
    // Not Found
    statusBucket = TODO;
    break;

case 408:
    // Request Timeout
    statusBucket = TODO;
    break;

case 500:
case 501:
case 510:
    // Unexpected 5xx return code
    statusBucket = TODO;
    break;

case 502:
case 504:
case 511:
    // Local network errors, we'll ignore these.
    statusBucket = TODO;
    break;

case 503:
    // Service Unavailable - The server cannot handle the request. Clients MUST follow the backoff behavior specified in the Request Frequency section.
    statusBucket = TODO;
    break;

case 505:
    // HTTP Version Not Supported - The server CANNOT handle the requested protocol major version.
    statusBucket = TODO;
    break;


List updates would be the same except with these changes:

case 201:
case 202:
case 203:
case 204:  <-- no longer in a separate bucket
case 205:
case 206:
    // Unexpected 2xx return code
    statusBucket = TODO;
    break;

case 400: <-- different comment to match SB spec
    // Bad Request - The HTTP request was not correctly formed. The client did not provide all required CGI parameters or the body did not contain any meaningful entries.
    statusBucket = TODO;
    break;
Attachment #8715350 - Flags: review?(francois) → review-
Assignee: nobody → gpascutto
Status: NEW → ASSIGNED
Implemented your suggestion, but with same code->bucket map for both
cases, and taking 413 as a seperate error.
Attachment #8716277 - Flags: review?(francois)
Attachment #8715350 - Attachment is obsolete: true
Comment on attachment 8716277 [details] [diff] [review]
Add telemetry for response codes to SafeBrowsing requests

Review of attachment 8716277 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for making these changes, that should cover all of the possible failure cases.

::: toolkit/components/telemetry/Histograms.json
@@ +3242,5 @@
> +    "expires_in_version":"never",
> +    "kind": "enumerated",
> +    "n_values": 16,
> +    "bug_numbers": [1150921],
> +    "description": "Server response code from remote SafeBrowsing lookups"

Can you change this to "remote SafeBrowsing gethash lookups"? Just so it's clearly totally clear it's not talking about Application Reputation remote lookups.

::: toolkit/components/url-classifier/nsUrlClassifierHashCompleter.js
@@ +123,5 @@
> +    // Request Timeout
> +    statusBucket = 9;
> +    break;
> +  case 413:
> +    // Request Entity Too Large - Bug 1150334

Nice :)

::: toolkit/components/url-classifier/nsUrlClassifierStreamUpdater.cpp
@@ +467,5 @@
> +    // Unexpected 2xx return code
> +    statusBucket = 2;
> +    break;
> +  case 204:
> +    // No Content - There are no full-length hashes with the requested prefix.

The "full-length hashes" bit is not revelant here. If you want to have exactly the same code+comments as the other one, maybe just use "No Content" without the rest?
Attachment #8716277 - Flags: review?(francois) → review+
Comment on attachment 8716277 [details] [diff] [review]
Add telemetry for response codes to SafeBrowsing requests

Add bsmedberg for privacy/telemetry policy.
Attachment #8716277 - Flags: review?(benjamin)
Comment on attachment 8716277 [details] [diff] [review]
Add telemetry for response codes to SafeBrowsing requests

If this is "never", who (person or team) is committing to monitor this metric forever? Will the existing telemetry dashboards or automated alerts be sufficient, or will you be building a custom dashboard for this monitoring?

Please document what the enumerated values are: either list them in the "description" of the histogram, or reference some existing enumeration that will contain them. That way people browsing telemetry.mozilla.org will be able to understand what the values represent.

Is the goal eventually to make this an opt-out operational metric? I would encourage that, after you've tested it. If that's the case, I'd also encourage you to file a separate future bug for real operational monitoring (nagios or the equivalent).
Flags: needinfo?(gpascutto)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #6)
> If this is "never", who (person or team) is committing to monitor this
> metric forever? Will the existing telemetry dashboards or automated alerts
> be sufficient, or will you be building a custom dashboard for this
> monitoring?

Will be the SafeBrowsing team/module owner. The histogram-based Telemetry Alerts should suffice for this (though I wish those had an option to send email as we've missed warnings before).
 
> Please document what the enumerated values are: either list them in the
> "description" of the histogram, or reference some existing enumeration that
> will contain them.

I think 16 values is too much for the description? And not sure what you mean by the second sentence. Is linking the source code that defines the buckets ok? There is already a link to this bug.

> Is the goal eventually to make this an opt-out operational metric? I would
> encourage that, after you've tested it. If that's the case, I'd also
> encourage you to file a separate future bug for real operational monitoring
> (nagios or the equivalent).

The goal is to monitor this permanently, and coordinate with the server operator (Google) if we see Firefox clients are having issues.
Flags: needinfo?(gpascutto) → needinfo?(benjamin)
(In reply to Gian-Carlo Pascutto [:gcp] from comment #7)
> (In reply to Benjamin Smedberg  [:bsmedberg] from comment #6)
> > If this is "never", who (person or team) is committing to monitor this
> > metric forever? Will the existing telemetry dashboards or automated alerts
> > be sufficient, or will you be building a custom dashboard for this
> > monitoring?
> 
> Will be the SafeBrowsing team/module owner. The histogram-based Telemetry
> Alerts should suffice for this (though I wish those had an option to send
> email as we've missed warnings before).

Please use the "alert_emails" field to set this up. However, I am confident that the current alerting mechanism is not sufficient to monitor this. It was built to monitor performance metrics, and it will not be good at monitoring enumerated success/error codes without some modifications. I encourage you to file a separate bug about monitoring and ask Roberto Vitillo for suggestions.

>  
> > Please document what the enumerated values are: either list them in the
> > "description" of the histogram, or reference some existing enumeration that
> > will contain them.
> 
> I think 16 values is too much for the description?

It's not. It's fine for the description to be long, and then the values will show up in the various documentation and dashboards automatically.

> > Is the goal eventually to make this an opt-out operational metric? I would
> > encourage that, after you've tested it. If that's the case, I'd also
> > encourage you to file a separate future bug for real operational monitoring
> > (nagios or the equivalent).
> 
> The goal is to monitor this permanently, and coordinate with the server
> operator (Google) if we see Firefox clients are having issues.

After you've verified that the data is correct and useful, please make this an opt-out metric.
Flags: needinfo?(benjamin)
Comment on attachment 8716277 [details] [diff] [review]
Add telemetry for response codes to SafeBrowsing requests

data-review+ with an updated and more detailed description field and a list or individual emails in the alert_emails field.
Attachment #8716277 - Flags: review?(benjamin) → feedback+
https://hg.mozilla.org/integration/mozilla-inbound/rev/d93a009fcc9f14bf4ce1f006c8c4b703f7d27e7a
Bug 1150921 - Add telemetry for response codes to SafeBrowsing requests. r=francois f=bsmedberg
https://hg.mozilla.org/mozilla-central/rev/d93a009fcc9f
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Duplicate of this bug: 960000
You need to log in before you can comment on or make changes to this bug.