1737923 - Add quick suggest probe that records fallbacks from Merino to remote settings

Assignee

Description

•

4 years ago

•

Basically, the desired measure is the fraction of Merino fallbacks: e.g., # of Merino fallback/# of Merino requests. I believe this can be achieved by adding a keyed scalar probe that has keys (success, fallback), with counter values.

Will want to know the amount of times (count) merino was successful and amount of time failed back.

There is a discussion around how to implement it. Ideal if let merino finish and the probe captures the time - though fall back to remote setting will happen at a set time.

mythmon says "fallback" means when we request suggestions from Merino for a query but end up using remote settings for that query.

Drew Willcoxon :adw

Assignee

Comment 1

•

4 years ago

Corey, I'm going to need a tighter definition of "fallback" in order to implement this. CC'ing mythmon too. There are several cases where we would fall back from Merino to remote settings:

The Merino request times out
There's a network error connecting to Merino, e.g., the user's internet is down or the Merino server is down
The Merino request completes successfully but the server returns an HTTP error, e.g., the server is misconfigured or buggy
The Merino request completes successfully without an HTTP error but it doesn't return a suggestion (i.e., there's no matching suggestion) -- but in this case, does "fallback" depend on whether remote settings returns a suggestion?
The Merino request completes successfully without an HTTP error and returns a suggestion but the suggestion's score is lower than the remote settings suggestion score

One other question: Bug 1737928 adds the timeout mechanism, and as part of that I'm adding a keyed scalar probe to record the number of timeouts. I guess if "fallback" just means timeouts, then that would be enough? But it doesn't record the number of successes, which we would also want? If fallback does not mean timeouts, then would a separate timeout probe still be useful?

Flags: needinfo?(cdowhygelund)

Drew Willcoxon :adw

Assignee

Updated

•

4 years ago

Iteration: 95.2 - Oct 18 - Oct 31 → 96.1 - Nov 1 - Nov 14

Depends on: 1737928

Corey Dow-Hygelund [:ccd]

Comment 2

•

4 years ago

Drew, would it be too difficult to add the keys correspond to these different cases? This would allow for more granular analyses.

However, if this is not possible, I believe 1-3 should be lumped together as "fallback" insofar as telemetry is concerned. Cases 4 and 5 aren't really a fallback, as they don't correspond to a Merino failure, rather the nature of suggestions returned.

To answer your second question, having the # of successes is important for comparison purposes. I believe it would be more useful to have all of this information in this single room, with different keys:

success
timeout
network_error
http_error

In this case a separate timeout probe is redundant.

Flags: needinfo?(cdowhygelund)

Drew Willcoxon :adw

Assignee

Comment 3

•

4 years ago

That sounds good, we can do that, thanks Corey. I'll go with the four keys you suggested.

Drew Willcoxon :adw

Assignee

Comment 4

•

4 years ago

Attached file Bug 1737923 - Add a telemetry histogram for recording Merino response categories. — Details

This adds a new categorical histogram called FX_URLBAR_MERINO_RESPONSE. There
are four categories per the discussion in the bug:

0: success
1: timeout
2: network_error
3: http_error

Only one value is recorded per fetch, so for example if Merino times out but
then later finishes successfully, we only record the timeout.

Depends on D129772

Drew Willcoxon :adw

Assignee

Comment 5

•

4 years ago

Attached file request.md — Details

Data review request for the FX_URLBAR_MERINO_RESPONSE categorical histogram

Attachment #9249825 - Flags: data-review?(cdowhygelund)

Corey Dow-Hygelund [:ccd]

Updated

•

4 years ago

Attachment #9249825 - Flags: data-review?(cdowhygelund) → data-review+

Corey Dow-Hygelund [:ccd]

Comment 6

•

4 years ago

request.md

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes, it will be available with other telemetry on DTMO.

Is there a control mechanism that allows the user to turn the data collection on and off?

Clients may use the Firefox telemetry opt-out mechanism.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Drew Willcoxon Contexual Services team.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 2, Interaction data

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes

Does the data collection use a third-party collection tool?

No

Result: datareview+

Pulsebot

Comment 7

•

4 years ago

Pushed by dwillcoxon@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ef51860f3668 Add a telemetry histogram for recording Merino response categories. r=nanj

Iulian Moraru

Comment 8

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/ef51860f3668

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

status-firefox96: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 96 Branch

Cosmin Muntean [:cmuntean], Ecosystem QA

Comment 9

•

4 years ago

@Drew, I have started to create test cases for Merino and also wanted to cover this new histogram. In order to cover and verify this bug, I have the following scenarios:

0: success

Enable Merino and trigger a Sponsored/Non-Sponsored result. The value of "0" column increases.

1: timeout

Enable Merino and change the "browser.urlbar.merino.timeoutMs" pref to "1". After triggering a Sponsred/Non-Sponsored, the value of "1" column increases.

2: network_error

Enable Merino then disable the internet connection. After triggering a Sponsred/Non-Sponsored, the value of "2" column increases.

3: http_error

Enable Merino then change the "browser.urlbar.merino.endpointURL" pref to a invalid endpoint. After triggering a Sponsred/Non-Sponsored, the value of "3" column increases.

Can you please let me know if these are the correct scenarios in order to verify this histogram? Are there any other scenarios that we should cover for this?
I have noticed on the Histogram that there is a fourth "4" column in some cases, but I never saw a value for it (see this screenshot). Is there another scenarios for this and when it is triggered?

Flags: needinfo?(adw)

Drew Willcoxon :adw

Assignee

Comment 10

•

4 years ago

Thanks Cosmin.

(In reply to Cosmin Muntean [:cmuntean], Ecosystem QA from comment #9)

@Drew, I have started to create test cases for Merino and also wanted to cover this new histogram. In order to cover and verify this bug, I have the following scenarios:

0: success

Enable Merino and trigger a Sponsored/Non-Sponsored result. The value of "0" column increases.

Yes

1: timeout

Enable Merino and change the "browser.urlbar.merino.timeoutMs" pref to "1". After triggering a Sponsred/Non-Sponsored, the value of "1" column increases.

Yes

2: network_error

Enable Merino then disable the internet connection. After triggering a Sponsred/Non-Sponsored, the value of "2" column increases.

This one might be hard to trigger. Your STR here are good, but the problem is the timeout might happen before the network error. If you set browser.urlbar.merino.timeoutMs to a very large value like 30000 (30 seconds) then use your STR, that should work.

3: http_error

Enable Merino then change the "browser.urlbar.merino.endpointURL" pref to a invalid endpoint. After triggering a Sponsred/Non-Sponsored, the value of "3" column increases.

That would trigger network_error, not http_error. In order to test this, you need the Merino server to return a non-200 (non-success) response. In other words you would need to modify the server somehow, or you could set up any local web server that returns a 500 response and set its URL to browser.urlbar.merino.endpointURL. TBH I'm not sure it's worth your time to do that, so IMO you can skip verifying this value.

Are there any other scenarios that we should cover for this?

That's it, thanks.

I have noticed on the Histogram that there is a fourth "4" column in some cases, but I never saw a value for it

There isn't a 4th value, only 0-3, so you can ignore it. I think the extra value is just a consequence of how histograms are implemented, but I'm not sure.

Flags: qe-verify+

Flags: needinfo?(adw)

Flags: in-testsuite+

Drew Willcoxon :adw

Assignee

Comment 11

•

4 years ago

Comment on attachment 9249525 [details]
Bug 1737923 - Add a telemetry histogram for recording Merino response categories.

Beta/Release Uplift Approval Request

User impact if declined: We need this for the Firefox Suggest preferences redesign targeting 95/94.
Is this code covered by automated tests?: Yes
Has the fix been verified in Nightly?: No
Needs manual test from QE?: Yes
If yes, steps to reproduce: Please see comment 9 and 10
List of other uplifts needed: Please see uplift spreadsheet: https://docs.google.com/spreadsheets/d/1LavihS-VOPFYEyum7mrx6FKXmuQeHi9xQHfGNSxjnoY/edit?usp=sharing
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): This only adds some new telemetry related to Merino client integration, which is disabled for all users and will only be enabled in a future Merino rollout.
String changes made/needed:

Attachment #9249525 - Flags: approval-mozilla-beta?

Ryan VanderMeulen [:RyanVM]

Comment 12

•

4 years ago

Comment on attachment 9249525 [details]
Bug 1737923 - Add a telemetry histogram for recording Merino response categories.

Approved for 95.0b5.

Attachment #9249525 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Ryan VanderMeulen [:RyanVM]

Comment 13

•

4 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/37b181233bc7

status-firefox95: --- → fixed

Cornel Ionce [:noni]

Updated

•

4 years ago

QA Whiteboard: [qa-triaged]

Cosmin Muntean [:cmuntean], Ecosystem QA

Comment 14

•

4 years ago

3: http_error

Enable Merino then change the "browser.urlbar.merino.endpointURL" pref to a invalid endpoint. After triggering a Sponsred/Non-Sponsored, the value of "3" column increases.

That would trigger network_error, not http_error. In order to test this, you need the Merino server to return a non-200 (non-success) response. In other words you would need to modify the server somehow, or you could set up any local web server that returns a 500 response and set its URL to browser.urlbar.merino.endpointURL. TBH I'm not sure it's worth your time to do that, so IMO you can skip verifying this value.

@Drew, I have managed to trigger the "3: http_error" by setting the "browser.urlbar.merino.endpointURL" to "https://stage.merino.nonprod.cloudops.mozgcp.net/api/v1/suggest1". I have added only the "1" number at the end of the endpoint. Indeed if I change the endpoint to something invalid like "https://www.test.com" the " http_error" is not triggered.
It is ok if we use this scenario to verify this?

Flags: needinfo?(adw)

Drew Willcoxon :adw

Assignee

Comment 15

•

4 years ago

Oh, good idea, yeah that's good. Adding a 1 at the end causes the response to be a 404, which triggers http_error. Thanks Cosmin!

Flags: needinfo?(adw)

Cosmin Muntean [:cmuntean], Ecosystem QA

Comment 16

•

4 years ago

We have verified this bug on the latest Nightly 96.0a1 build (Build ID: 20211109190508) and the latest Beta 95.0b5 (Build ID: 20211109194756) on Windows 10 x64, macOS 10.15.7 and Ubuntu 20.04 x64.

In order to verify this issue we have used the scenarios from comment 10 and comment 14.

Status: RESOLVED → VERIFIED

status-firefox95: fixed → verified

status-firefox96: fixed → verified

Flags: qe-verify+

Drew Willcoxon :adw

Assignee

Comment 17

•

4 years ago

[Tracking Requested - why for this release]: We need this for the Firefox Suggest preferences redesign targeting 95/94.

tracking-firefox94: --- → ?

Drew Willcoxon :adw

Assignee

Comment 18

•

4 years ago

Comment on attachment 9249525 [details]
Bug 1737923 - Add a telemetry histogram for recording Merino response categories.

Beta/Release Uplift Approval Request

User impact if declined: We need this for the Firefox Suggest preferences redesign targeting 95/94.
Is this code covered by automated tests?: Yes
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: Yes
If yes, steps to reproduce: Please see comment 9 and 10
List of other uplifts needed: Please see uplift spreadsheet: https://docs.google.com/spreadsheets/d/1LavihS-VOPFYEyum7mrx6FKXmuQeHi9xQHfGNSxjnoY/edit?usp=sharing
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): This only adds some new telemetry related to Merino client integration, which is disabled for all users and will only be enabled in a future Merino rollout.
String changes made/needed:

Attachment #9249525 - Flags: approval-mozilla-release?

Drew Willcoxon :adw

Assignee

Updated

•

4 years ago

Flags: qe-verify+

Cosmin Muntean [:cmuntean], Ecosystem QA

Comment 19

•

4 years ago

We have verified this bug on Firefox 94.0.2 try build on Windows 10 x64, macOS 10.15.7 and Ubuntu 20.04 x64.

In order to verify this issue we have used the scenarios from comment 10 and comment 14.

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

tracking-firefox94: ? → +

Ryan VanderMeulen [:RyanVM]

Comment 20

•

4 years ago

Comment on attachment 9249525 [details]
Bug 1737923 - Add a telemetry histogram for recording Merino response categories.

Approved for 94.0.2.

Attachment #9249525 - Flags: approval-mozilla-release? → approval-mozilla-release+

Ryan VanderMeulen [:RyanVM]

Comment 21

•

4 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-release/rev/4d442d44e98f

status-firefox94: --- → fixed

Cosmin Muntean [:cmuntean], Ecosystem QA

Comment 22

•

4 years ago

We have verified this bug on Firefox 94.0.2 candidate build (Build ID: 20211117154346) on Windows 10 x64, macOS 10.15.7 and Ubuntu 20.04 x64.

In order to verify this issue we have used the scenarios from comment 10 and comment 14.

status-firefox94: fixed → verified

Flags: qe-verify+

Bug 1737923 - Add a telemetry histogram for recording Merino response categories. 4 years ago Drew Willcoxon :adw 48 bytes, text/x-phabricator-request	RyanVM : approval-mozilla-beta+ RyanVM : approval-mozilla-release+	Details \| Review
request.md 4 years ago Drew Willcoxon :adw 3.29 KB, text/plain	ccd : data-review+	Details