Closed Bug 1311910 Opened 8 years ago Closed 7 years ago

Add telemetry to measure update error rate for V2 and V4

Categories

(Toolkit :: Safe Browsing, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla53
Tracking Status
firefox53 --- fixed

People

(Reporter: dlee, Assigned: tnguyen)

References

Details

(Whiteboard: #sbv4-m3)

Attachments

(1 file, 2 obsolete files)

As discussed in WW, we should record the update fail rate for v2 and v4 to get better understanding about if we could ship v4.
Assignee: nobody → dlee
Status: NEW → ASSIGNED
Priority: -- → P2
Summary: Add telemetry to measure update error rate for V2 and V4 → Add telemetry to measure update error and update timeout rate for V2 and V4
Comment on attachment 8805032 [details]
Bug 1311910 - Add telemetry to measure update error and update timeout rate for V2 and V4.

Hi Henry,
I still can't set you as a reviewer by mozreview, i am not sure if the reviewer "henry" is you.
Attachment #8805032 - Flags: review?(hchang)
Attachment #8805032 - Flags: review?(hchang)
Attachment #8805032 - Flags: review?(hchang)
Comment on attachment 8805032 [details]
Bug 1311910 - Add telemetry to measure update error and update timeout rate for V2 and V4.

https://reviewboard.mozilla.org/r/88862/#review88256

The change should be very useful to v4 development :) BTW, the suffix matching ('-proto') seems to have scattered everywhere. Maybe it's now worth having a utility function for v2/v4. I would file a bug for this.
Attachment #8805032 - Flags: review?(hchang) → review+
(In reply to Henry Chang [:henry][:hchang] from comment #3)
> 
> The change should be very useful to v4 development :) BTW, the suffix
> matching ('-proto') seems to have scattered everywhere. Maybe it's now worth
> having a utility function for v2/v4. I would file a bug for this.

Agree! Thanks for doing that :)
Attachment #8805032 - Flags: review?(francois)
Comment on attachment 8805032 [details]
Bug 1311910 - Add telemetry to measure update error and update timeout rate for V2 and V4.

https://reviewboard.mozilla.org/r/88862/#review89076

While thinking more about this patch and what we are trying to measure with this patch, I realized that we're not currently bucketing our existing server-related telemetry in the right way.

I think the data should be keyed by provider instead of protocol. That would allow us, for example, to separate the TP updates from the Google SB ones, as well as separating google from google4 all with a single `URLCLASSIFIER_UPDATE_REMOTE_STATUSCODE` telemetry probe (we can't change existing probes, we need to add new ones).

So we probably need a keyed histogram like this:
https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2016-10-30&keys=All!AV%252C241-480!AV%252C481-720!V%252C241-480&max_channel_version=nightly%252F52&measure=VIDEO_SUSPEND_RECOVERY_TIME_MS&min_channel_version=null&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2016-09-19&table=0&trim=1&use_submission_date=0

We should also replace these histograms with keyed ones:

- `URLCLASSIFIER_COMPLETE_TIMEOUT`
- `URLCLASSIFIER_COMPLETE_REMOTE_STATUS`

The other thing I'm thinking about is that this telemetry we already have isn't as useful as it could be:

```
"URLCLASSIFIER_UPDATE_ERROR_TYPE": {
    "alert_emails": ["safebrowsing-telemetry@mozilla.org"],
    "expires_in_version": "58",
    "kind": "enumerated",
    "n_values": 10,
    "bug_numbers": [1305801],
    "description": "An error was encountered while parsing a partial update returned by a Safe Browsing V4 server (0 = addition of an already existing prefix, 1 = parser got into an infinite loop, 2 = removal index out of bounds)"
}
```

since the denominator is unclear. If we kept track of successes as well, we would get the error/success rate here instead of just an error count with nothing to compare it against.

So we could replace `URLCLASSIFIER_UPDATE_V2_SUCCESS_RATE` and `URLCLASSIFIER_UPDATE_V4_SUCCESS_RATE` with `URLCLASSIFIER_UPDATE_ERROR`, a keyed version of `URLCLASSIFIER_UPDATE_ERROR_TYPE` that would keep track of sucessess too.

This would allow us to remove `URLCLASSIFIER_UPDATE_ERROR_TYPE` since we just added it and never actually used it in a release.

Do you think the above changes will allow us to answer the questions we are interested in?
Attachment #8805032 - Flags: review?(francois) → review-
(In reply to François Marier [:francois] from comment #5)
> Comment on attachment 8805032 [details]
> Bug 1311910 - Add telemetry to measure update error and update timeout rate
> for V2 and V4.
> 
> https://reviewboard.mozilla.org/r/88862/#review89076
> 
> Do you think the above changes will allow us to answer the questions we are
> interested in?

Yes, i think keyed by provider instead of protocol version is a good idea.
I'll update my patch, thanks for suggestion !
(In reply to François Marier [:francois] from comment #5)
> 
> So we could replace `URLCLASSIFIER_UPDATE_V2_SUCCESS_RATE` and
> `URLCLASSIFIER_UPDATE_V4_SUCCESS_RATE` with `URLCLASSIFIER_UPDATE_ERROR`, a
> keyed version of `URLCLASSIFIER_UPDATE_ERROR_TYPE` that would keep track of
> sucessess too.
> 
> This would allow us to remove `URLCLASSIFIER_UPDATE_ERROR_TYPE` since we
> just added it and never actually used it in a release.
> 

Right now URLCLASSIFIER_UPDATE_ERROR_TYPE is an enum and it is used to record different type of Full/Partial update "algorithm" error(not include update network error, protocol parser error ... etc). So if we want to merge it with keyed URLCLASSIFIER_UPDATE_SUCCESS_RATE here, then how should we treat all the other update errors ? do we just set those errors with a fixed value in histogram ? or we should expand our enumeration to include all different kind of errors ?
Flags: needinfo?(francois)
(In reply to Dimi Lee[:dimi][:dlee] from comment #7)
> Right now URLCLASSIFIER_UPDATE_ERROR_TYPE is an enum and it is used to
> record different type of Full/Partial update "algorithm" error(not include
> update network error, protocol parser error ... etc). So if we want to merge
> it with keyed URLCLASSIFIER_UPDATE_SUCCESS_RATE here, then how should we
> treat all the other update errors ? do we just set those errors with a fixed
> value in histogram ? or we should expand our enumeration to include all
> different kind of errors ?

Using an enum is fine, but we only need to keep track of the errors we care about. I would suggest using 0 for "success", 1 for "unspecified error" and then have 2, 3, 4, etc. for the specific errors we already have.

Of course, if you see other errors that we probably want to know about separately, feel free to move them out of the "1" bucket and give them their own error number.

One thing to note if you don't already know is that we can't bump the "n_values" in Histograms.json once we set it. So maybe we should make that 16 to give ourselves some room.
Flags: needinfo?(francois)
Assignee: dlee → tnguyen
MozReview-Commit-ID: 5xFbV33l8Q7
Attachment #8805032 - Attachment is obsolete: true
Attachment #8813112 - Attachment is obsolete: true
Comment on attachment 8814789 [details]
Bug 1311910 - Add telemetry to measure update error and update timeout rate for V2 and V4

https://reviewboard.mozilla.org/r/94628/#review96952

This patch looks good and it seems like you've updated everything correctly.

I also think we can avoid adding a new member variable to hold the update errors. See my inline comments.

::: toolkit/components/telemetry/Histograms.json:4138
(Diff revision 1)
>      "kind": "exponential",
>      "high": 200,
>      "n_buckets": 10,
>      "description": "Size of the completion cache in entries"
>    },
> -  "URLCLASSIFIER_UPDATE_REMOTE_STATUS": {
> +  "URLCLASSIFIER_UPDATE_REMOTE_KEYED_STATUS": {

nit: `URLCLASSIFIER_UPDATE_REMOTE_STATUS2` seems to be the common pattern when we just want to change the detaisl of a probe

::: toolkit/components/url-classifier/Classifier.cpp:610
(Diff revision 1)
>      LOG(("update took %dms\n",
>           PR_IntervalToMilliseconds(clockEnd - clockStart)));
>    }
>  
> +  mLastUpdateResult = UPDATE_SUCCESS;
> +

nit: we can remove this blank line

::: toolkit/components/url-classifier/Classifier.cpp:1086
(Diff revision 1)
>      NS_ENSURE_TRUE(updateV4, NS_ERROR_FAILURE);
>  
>      if (updateV4->IsFullUpdate()) {
>        input->Clear();
>        output->Clear();
> -      rv = lookupCache->ApplyUpdate(updateV4, *input, *output);
> +      rv = lookupCache->ApplyUpdate(updateV4, *input, *output, mLastUpdateResult);

This is a bit weird. We return two different error codes.

Maybe what we need is simply to defined a new `NS_ERROR_MODULE_URL_CLASSIFIER` in `xpcom/base/ErrorList.h` just like http://searchfox.org/mozilla-central/rev/ef3cede9a5b050fffade9055ca274654f2bc719e/xpcom/base/ErrorList.h#684-718

That way, we can simply return `nsresult`s like `NS_ERROR_UC_UPDATE_SUCCESS` and `NS_ERROR_UC_UPDATE_DUPLICATE_PREFIX`.

::: toolkit/components/url-classifier/SafeBrowsingError.h:6
(Diff revision 1)
> +//* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
> +/* This Source Code Form is subject to the terms of the Mozilla Public
> + * License, v. 2.0. If a copy of the MPL was not distributed with this
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> +
> +#ifndef SafeBrowsingError_h__

The old code we have doesn't following the coding guidelines, but let's make sure we name the guards properly:

https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style#CC_practices

i.e. `SafeBrowsingError_h`
Attachment #8814789 - Flags: review?(francois) → review-
Comment on attachment 8814789 [details]
Bug 1311910 - Add telemetry to measure update error and update timeout rate for V2 and V4

https://reviewboard.mozilla.org/r/94628/#review98254

::: toolkit/components/telemetry/Histograms.json:4169
(Diff revision 3)
> -    "expires_in_version": "58",
> +    "expires_in_version": "59",
>      "kind": "enumerated",
> -    "n_values": 10,
> -    "bug_numbers": [1305801],
> -    "description": "An error was encountered while parsing a partial update returned by a Safe Browsing V4 server (0 = addition of an already existing prefix, 1 = parser got into an infinite loop, 2 = removal index out of bounds, 3 = checksum mismatch, 4 = missing checksum)"
> +    "keyed": true,
> +    "n_values": 16,
> +    "bug_numbers": [1311910],
> +    "description": "An error was encountered while update (0 = success, 1 = unspecified error, 2 = addition of an already existing prefix, 3 = parser got into an infinite loop, 4 = removal index out of bounds, 5 = checksum mismatch, 6 = missing checksum). Keyed by provider"

"Whether or not an error was encountered while processing a Safe Browsing update (0 = success..."

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:630
(Diff revision 3)
> +  nsCString provider;
> +  // Assume that all the tables in update should have the same provider.
> +  urlUtil->GetTelemetryProvider(mUpdateTables.SafeElementAt(0, EmptyCString()), provider);
> +
> +  nsresult updateStatus =
> +    NS_ERROR_GET_MODULE(mUpdateStatus) == NS_ERROR_MODULE_URL_CLASSIFIER ? mUpdateStatus :

It's a very good idea to check this!

::: xpcom/base/ErrorList.h:995
(Diff revision 3)
>  
>    /* ======================================================================= */
> +  /* 42: NS_ERROR_MODULE_URL_CLASSIFIER */
> +  /* ======================================================================= */
> +#define MODULE NS_ERROR_MODULE_URL_CLASSIFIER
> +  /* Update-related codes */

"Update-related error codes"

::: xpcom/base/ErrorList.h:996
(Diff revision 3)
>    /* ======================================================================= */
> +  /* 42: NS_ERROR_MODULE_URL_CLASSIFIER */
> +  /* ======================================================================= */
> +#define MODULE NS_ERROR_MODULE_URL_CLASSIFIER
> +  /* Update-related codes */
> +  /* A success code that indicates the update is successfully applied */

I think this comment is redundant.

::: xpcom/base/ErrorList.h:997
(Diff revision 3)
> +  /* 42: NS_ERROR_MODULE_URL_CLASSIFIER */
> +  /* ======================================================================= */
> +#define MODULE NS_ERROR_MODULE_URL_CLASSIFIER
> +  /* Update-related codes */
> +  /* A success code that indicates the update is successfully applied */
> +  ERROR(NS_SUCCESS_UC_UPDATE,                       SUCCESS(0)),

Let's rely on `NS_OK` for this case so that we don't have to define a new return code for the case we don't care about.

::: xpcom/base/ErrorList.h:1000
(Diff revision 3)
> +  /* Update-related codes */
> +  /* A success code that indicates the update is successfully applied */
> +  ERROR(NS_SUCCESS_UC_UPDATE,                       SUCCESS(0)),
> +
> +  ERROR(NS_ERROR_UC_UPDATE_UNKNOWN,                 FAILURE(1)),
> +  /* Error codes of applying V4 update. */

We could also remove this one.
Attachment #8814789 - Flags: review?(francois) → review-
Comment on attachment 8814789 [details]
Bug 1311910 - Add telemetry to measure update error and update timeout rate for V2 and V4

https://reviewboard.mozilla.org/r/94628/#review98262

A few minor things to fix and then this is good to go.

::: toolkit/components/telemetry/Histograms.json:4145
(Diff revisions 4 - 5)
>      "expires_in_version": "never",
>      "keyed": true,
>      "kind": "enumerated",
>      "n_values": 16,
>      "bug_numbers": [1311910],
> -    "description": "Server HTTP status code from SafeBrowsing database updates. (0=1xx, 1=200, 2=2xx, 3=204, 4=3xx, 5=400, 6=4xx, 7=403, 8=404, 9=408, 10=413, 11=5xx, 12=502|504|511, 13=503, 14=505, 15=Other). Keyed by provider"
> +    "description": "Whether or not an error was encountered while processing a Safe Browsing update. (0=1xx, 1=200, 2=2xx, 3=204, 4=3xx, 5=400, 6=4xx, 7=403, 8=404, 9=408, 10=413, 11=5xx, 12=502|504|511, 13=503, 14=505, 15=Other). Keyed by provider"

That's the wording I suggested for `URLCLASSIFIER_UPDATE_ERROR`, not `URLCLASSIFIER_UPDATE_REMOTE_STATUS2`.

::: toolkit/components/telemetry/Histograms.json:4169
(Diff revision 4)
> -    "expires_in_version": "58",
> +    "expires_in_version": "59",
>      "kind": "enumerated",
> -    "n_values": 10,
> -    "bug_numbers": [1305801],
> -    "description": "An error was encountered while parsing a partial update returned by a Safe Browsing V4 server (0 = addition of an already existing prefix, 1 = parser got into an infinite loop, 2 = removal index out of bounds, 3 = checksum mismatch, 4 = missing checksum)"
> +    "keyed": true,
> +    "n_values": 16,
> +    "bug_numbers": [1311910],
> +    "description": "An error was encountered while update (0 = success, 1 = unspecified error, 2 = addition of an already existing prefix, 3 = parser got into an infinite loop, 4 = removal index out of bounds, 5 = checksum mismatch, 6 = missing checksum). Keyed by provider"

"during a Safe Browsing update"

::: toolkit/components/url-classifier/nsUrlClassifierDBService.h:144
(Diff revision 4)
>    // Thread that we do the updates on.
>    static nsIThread* gDbBackgroundThread;
>  };
>  
> -class nsUrlClassifierDBServiceWorker final : public nsIUrlClassifierDBService
> +class nsUrlClassifierDBServiceWorker final : public nsIUrlClassifierDBService,
> +                                              public nsIURIClassifierCallback

nit: indentation

::: toolkit/components/url-classifier/nsUrlClassifierDBService.h:151
(Diff revision 4)
>  public:
>    nsUrlClassifierDBServiceWorker();
>  
>    NS_DECL_THREADSAFE_ISUPPORTS
>    NS_DECL_NSIURLCLASSIFIERDBSERVICE
> +NS_DECL_NSIURICLASSIFIERCALLBACK

nit: indentation

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:359
(Diff revision 4)
>  }
> -
> +NS_IMETHODIMP
> +nsUrlClassifierDBServiceWorker::OnClassifyComplete(nsresult aErrorCode)
> +{
> +  return NS_OK;
> +  

nit: unnecessary blank line

::: toolkit/components/url-classifier/LookupCacheV4.cpp:295
(Diff revision 5)
>      // checksum in .metadata
>      std::string stdChecksum(checksum.BeginReading(), checksum.Length());
>      aTableUpdate->NewChecksum(stdChecksum);
> -
>    } else if (aTableUpdate->Checksum() != checksum){
> -    LOG(("Checksum mismatch after applying partial update"));
> +    return NS_ERROR_UC_UPDATE_CHECKSUM_MISMATCH;

We should retain this LOG line.
Attachment #8814789 - Flags: review?(francois) → review-
since thomas is PTO today, i will help fix nits addressed in Comment 17
ah sorry i should push the last commit by thomas first then push fixes of Comment 17 again so we won't loss diff...
Comment on attachment 8805032 [details]
Bug 1311910 - Add telemetry to measure update error and update timeout rate for V2 and V4.

https://reviewboard.mozilla.org/r/88862/#review99276

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:1859
(Diff revision 2)
>        gFreshnessGuarantee = Preferences::GetInt(CONFIRM_AGE_PREF,
>          CONFIRM_AGE_DEFAULT_SEC);
>      }
>    } else if (!strcmp(aTopic, "quit-application")) {
>      Shutdown();
> +    nsresult rv;

I thought this was an unrelated change displayed because of a rebase, but this is not on mozilla-central yet: http://searchfox.org/mozilla-central/rev/09ae31fddfbc229ef1b8f1a3ccd87ad05680e695/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#1834

Any idea what this is?
Attachment #8805032 - Flags: review?(francois) → review-
(In reply to François Marier [:francois] from comment #21)
> 
> I thought this was an unrelated change displayed because of a rebase, but
> this is not on mozilla-central yet:
> http://searchfox.org/mozilla-central/rev/
> 09ae31fddfbc229ef1b8f1a3ccd87ad05680e695/toolkit/components/url-classifier/
> nsUrlClassifierDBService.cpp#1834
> 
> Any idea what this is?

Just checked the review history of this bug and recently changes in nsUrlClassifierDBService.cpp.
I think this is just accidentally added during rebase and maybe messed with local changes since thomas was working on shutdown-aware bug?

I don't think we want this and will remove this in next patch. Thanks for finding this!
Comment on attachment 8805032 [details]
Bug 1311910 - Add telemetry to measure update error and update timeout rate for V2 and V4.

https://reviewboard.mozilla.org/r/88862/#review99552

Ok thanks for removing the stray change.

I've also flagged a couple of other suspicious changes here. Can you double-check whether or not they are necessary?

If they are, then feel free to land this without further review. If they're not necessary, just take them out and consider it already r+.

datareview+

::: toolkit/components/url-classifier/nsUrlClassifierDBService.h:144
(Diff revision 3)
>    // Thread that we do the updates on.
>    static nsIThread* gDbBackgroundThread;
>  };
>  
> -class nsUrlClassifierDBServiceWorker final : public nsIUrlClassifierDBService
> +class nsUrlClassifierDBServiceWorker final : public nsIUrlClassifierDBService,
> +                                             public nsIURIClassifierCallback

Is this necessary?

::: toolkit/components/url-classifier/nsUrlClassifierDBService.h:151
(Diff revision 3)
>  public:
>    nsUrlClassifierDBServiceWorker();
>  
>    NS_DECL_THREADSAFE_ISUPPORTS
>    NS_DECL_NSIURLCLASSIFIERDBSERVICE
> +  NS_DECL_NSIURICLASSIFIERCALLBACK

Is this necessary?

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:140
(Diff revision 3)
>  
>  static mozilla::Atomic<int32_t> gFreshnessGuarantee(CONFIRM_AGE_DEFAULT_SEC);
>  
>  NS_IMPL_ISUPPORTS(nsUrlClassifierDBServiceWorker,
> -                  nsIUrlClassifierDBService)
> +                  nsIUrlClassifierDBService,
> +                  nsIURIClassifierCallback)

Is this necessary?

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:356
(Diff revision 3)
>    }
>  
>    return NS_OK;
>  }
>  
> +NS_IMETHODIMP

Is this necessary?
Attachment #8805032 - Flags: review?(francois) → review+
(In reply to Dimi Lee[:dimi][:dlee] from comment #22)
> (In reply to François Marier [:francois] from comment #21)
> > 
> > I thought this was an unrelated change displayed because of a rebase, but
> > this is not on mozilla-central yet:
> > http://searchfox.org/mozilla-central/rev/
> > 09ae31fddfbc229ef1b8f1a3ccd87ad05680e695/toolkit/components/url-classifier/
> > nsUrlClassifierDBService.cpp#1834
> > 
> > Any idea what this is?
> 
> Just checked the review history of this bug and recently changes in
> nsUrlClassifierDBService.cpp.
> I think this is just accidentally added during rebase and maybe messed with
> local changes since thomas was working on shutdown-aware bug?
> 
> I don't think we want this and will remove this in next patch. Thanks for
> finding this!

Oops, the my hg code base was corrupted when rebasing and I did not notice that. Thanks for helping me to fix that :(
(In reply to François Marier [:francois] from comment #24)
> I've also flagged a couple of other suspicious changes here. Can you
> double-check whether or not they are necessary?
> 

Checked with thomas, those changes should be removed.
Attachment #8814789 - Attachment is obsolete: true
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/77b7f79852e0
Add telemetry to measure update error and update timeout rate for V2 and V4. r=francois,henry
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/77b7f79852e0
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
Summary: Add telemetry to measure update error and update timeout rate for V2 and V4 → Add telemetry to measure update error rate for V2 and V4
Depends on: 1346196
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: