Add quick suggest probe that records fallbacks from Merino to remote settings
Categories
(Firefox :: Address Bar, task, P1)
Tracking
()
People
(Reporter: adw, Assigned: adw)
References
Details
Attachments
(2 files)
|
48 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta+
RyanVM
:
approval-mozilla-release+
|
Details | Review |
|
3.29 KB,
text/plain
|
ccd
:
data-review+
|
Details |
Basically, the desired measure is the fraction of Merino fallbacks: e.g., # of Merino fallback/# of Merino requests. I believe this can be achieved by adding a keyed scalar probe that has keys (
success,fallback), with counter values.
Will want to know the amount of times (count) merino was successful and amount of time failed back.
There is a discussion around how to implement it. Ideal if let merino finish and the probe captures the time - though fall back to remote setting will happen at a set time.
mythmon says "fallback" means when we request suggestions from Merino for a query but end up using remote settings for that query.
| Assignee | ||
Comment 1•4 years ago
|
||
Corey, I'm going to need a tighter definition of "fallback" in order to implement this. CC'ing mythmon too. There are several cases where we would fall back from Merino to remote settings:
- The Merino request times out
- There's a network error connecting to Merino, e.g., the user's internet is down or the Merino server is down
- The Merino request completes successfully but the server returns an HTTP error, e.g., the server is misconfigured or buggy
- The Merino request completes successfully without an HTTP error but it doesn't return a suggestion (i.e., there's no matching suggestion) -- but in this case, does "fallback" depend on whether remote settings returns a suggestion?
- The Merino request completes successfully without an HTTP error and returns a suggestion but the suggestion's score is lower than the remote settings suggestion score
One other question: Bug 1737928 adds the timeout mechanism, and as part of that I'm adding a keyed scalar probe to record the number of timeouts. I guess if "fallback" just means timeouts, then that would be enough? But it doesn't record the number of successes, which we would also want? If fallback does not mean timeouts, then would a separate timeout probe still be useful?
| Assignee | ||
Updated•4 years ago
|
Comment 2•4 years ago
|
||
Drew, would it be too difficult to add the keys correspond to these different cases? This would allow for more granular analyses.
However, if this is not possible, I believe 1-3 should be lumped together as "fallback" insofar as telemetry is concerned. Cases 4 and 5 aren't really a fallback, as they don't correspond to a Merino failure, rather the nature of suggestions returned.
To answer your second question, having the # of successes is important for comparison purposes. I believe it would be more useful to have all of this information in this single room, with different keys:
- success
- timeout
- network_error
- http_error
In this case a separate timeout probe is redundant.
| Assignee | ||
Comment 3•4 years ago
|
||
That sounds good, we can do that, thanks Corey. I'll go with the four keys you suggested.
| Assignee | ||
Comment 4•4 years ago
|
||
This adds a new categorical histogram called FX_URLBAR_MERINO_RESPONSE. There
are four categories per the discussion in the bug:
0: success
1: timeout
2: network_error
3: http_error
Only one value is recorded per fetch, so for example if Merino times out but
then later finishes successfully, we only record the timeout.
Depends on D129772
| Assignee | ||
Comment 5•4 years ago
|
||
Data review request for the FX_URLBAR_MERINO_RESPONSE categorical histogram
Updated•4 years ago
|
Comment 6•4 years ago
|
||
request.md
DATA COLLECTION REVIEW RESPONSE:
Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?
Yes, it will be available with other telemetry on DTMO.
Is there a control mechanism that allows the user to turn the data collection on and off?
Clients may use the Firefox telemetry opt-out mechanism.
If the request is for permanent data collection, is there someone who will monitor the data over time?
Yes, Drew Willcoxon Contexual Services team.
Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?
Category 2, Interaction data
Is the data collection request for default-on or default-off?
Default on for all channels.
Does the instrumentation include the addition of any new identifiers?
No.
Is the data collection covered by the existing Firefox privacy notice?
Yes
Does the data collection use a third-party collection tool?
No
Result: datareview+
Comment 8•4 years ago
|
||
| bugherder | ||
Comment 9•4 years ago
|
||
@Drew, I have started to create test cases for Merino and also wanted to cover this new histogram. In order to cover and verify this bug, I have the following scenarios:
0: success
- Enable Merino and trigger a Sponsored/Non-Sponsored result. The value of "0" column increases.
1: timeout
- Enable Merino and change the "browser.urlbar.merino.timeoutMs" pref to "1". After triggering a Sponsred/Non-Sponsored, the value of "1" column increases.
2: network_error
- Enable Merino then disable the internet connection. After triggering a Sponsred/Non-Sponsored, the value of "2" column increases.
3: http_error
- Enable Merino then change the "browser.urlbar.merino.endpointURL" pref to a invalid endpoint. After triggering a Sponsred/Non-Sponsored, the value of "3" column increases.
Can you please let me know if these are the correct scenarios in order to verify this histogram? Are there any other scenarios that we should cover for this?
I have noticed on the Histogram that there is a fourth "4" column in some cases, but I never saw a value for it (see this screenshot). Is there another scenarios for this and when it is triggered?
| Assignee | ||
Comment 10•4 years ago
|
||
Thanks Cosmin.
(In reply to Cosmin Muntean [:cmuntean], Ecosystem QA from comment #9)
@Drew, I have started to create test cases for Merino and also wanted to cover this new histogram. In order to cover and verify this bug, I have the following scenarios:
0: success
- Enable Merino and trigger a Sponsored/Non-Sponsored result. The value of "0" column increases.
Yes
1: timeout
- Enable Merino and change the "browser.urlbar.merino.timeoutMs" pref to "1". After triggering a Sponsred/Non-Sponsored, the value of "1" column increases.
Yes
2: network_error
- Enable Merino then disable the internet connection. After triggering a Sponsred/Non-Sponsored, the value of "2" column increases.
This one might be hard to trigger. Your STR here are good, but the problem is the timeout might happen before the network error. If you set browser.urlbar.merino.timeoutMs to a very large value like 30000 (30 seconds) then use your STR, that should work.
3: http_error
- Enable Merino then change the "browser.urlbar.merino.endpointURL" pref to a invalid endpoint. After triggering a Sponsred/Non-Sponsored, the value of "3" column increases.
That would trigger network_error, not http_error. In order to test this, you need the Merino server to return a non-200 (non-success) response. In other words you would need to modify the server somehow, or you could set up any local web server that returns a 500 response and set its URL to browser.urlbar.merino.endpointURL. TBH I'm not sure it's worth your time to do that, so IMO you can skip verifying this value.
Are there any other scenarios that we should cover for this?
That's it, thanks.
I have noticed on the Histogram that there is a fourth "4" column in some cases, but I never saw a value for it
There isn't a 4th value, only 0-3, so you can ignore it. I think the extra value is just a consequence of how histograms are implemented, but I'm not sure.
| Assignee | ||
Comment 11•4 years ago
|
||
Comment on attachment 9249525 [details]
Bug 1737923 - Add a telemetry histogram for recording Merino response categories.
Beta/Release Uplift Approval Request
- User impact if declined: We need this for the Firefox Suggest preferences redesign targeting 95/94.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: Yes
- If yes, steps to reproduce: Please see comment 9 and 10
- List of other uplifts needed: Please see uplift spreadsheet: https://docs.google.com/spreadsheets/d/1LavihS-VOPFYEyum7mrx6FKXmuQeHi9xQHfGNSxjnoY/edit?usp=sharing
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This only adds some new telemetry related to Merino client integration, which is disabled for all users and will only be enabled in a future Merino rollout.
- String changes made/needed:
Comment 12•4 years ago
|
||
Comment on attachment 9249525 [details]
Bug 1737923 - Add a telemetry histogram for recording Merino response categories.
Approved for 95.0b5.
Comment 13•4 years ago
|
||
| bugherder uplift | ||
Updated•4 years ago
|
Comment 14•4 years ago
|
||
3: http_error
- Enable Merino then change the "browser.urlbar.merino.endpointURL" pref to a invalid endpoint. After triggering a Sponsred/Non-Sponsored, the value of "3" column increases.
That would trigger network_error, not http_error. In order to test this, you need the Merino server to return a non-200 (non-success) response. In other words you would need to modify the server somehow, or you could set up any local web server that returns a 500 response and set its URL to
browser.urlbar.merino.endpointURL. TBH I'm not sure it's worth your time to do that, so IMO you can skip verifying this value.
@Drew, I have managed to trigger the "3: http_error" by setting the "browser.urlbar.merino.endpointURL" to "https://stage.merino.nonprod.cloudops.mozgcp.net/api/v1/suggest1". I have added only the "1" number at the end of the endpoint. Indeed if I change the endpoint to something invalid like "https://www.test.com" the " http_error" is not triggered.
It is ok if we use this scenario to verify this?
| Assignee | ||
Comment 15•4 years ago
|
||
Oh, good idea, yeah that's good. Adding a 1 at the end causes the response to be a 404, which triggers http_error. Thanks Cosmin!
Comment 16•4 years ago
|
||
We have verified this bug on the latest Nightly 96.0a1 build (Build ID: 20211109190508) and the latest Beta 95.0b5 (Build ID: 20211109194756) on Windows 10 x64, macOS 10.15.7 and Ubuntu 20.04 x64.
- In order to verify this issue we have used the scenarios from comment 10 and comment 14.
| Assignee | ||
Comment 17•4 years ago
|
||
[Tracking Requested - why for this release]: We need this for the Firefox Suggest preferences redesign targeting 95/94.
| Assignee | ||
Comment 18•4 years ago
|
||
Comment on attachment 9249525 [details]
Bug 1737923 - Add a telemetry histogram for recording Merino response categories.
Beta/Release Uplift Approval Request
- User impact if declined: We need this for the Firefox Suggest preferences redesign targeting 95/94.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: Yes
- If yes, steps to reproduce: Please see comment 9 and 10
- List of other uplifts needed: Please see uplift spreadsheet: https://docs.google.com/spreadsheets/d/1LavihS-VOPFYEyum7mrx6FKXmuQeHi9xQHfGNSxjnoY/edit?usp=sharing
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This only adds some new telemetry related to Merino client integration, which is disabled for all users and will only be enabled in a future Merino rollout.
- String changes made/needed:
| Assignee | ||
Updated•4 years ago
|
Comment 19•4 years ago
|
||
We have verified this bug on Firefox 94.0.2 try build on Windows 10 x64, macOS 10.15.7 and Ubuntu 20.04 x64.
- In order to verify this issue we have used the scenarios from comment 10 and comment 14.
Updated•4 years ago
|
Comment 20•4 years ago
|
||
Comment on attachment 9249525 [details]
Bug 1737923 - Add a telemetry histogram for recording Merino response categories.
Approved for 94.0.2.
Comment 21•4 years ago
|
||
| bugherder uplift | ||
Comment 22•4 years ago
|
||
We have verified this bug on Firefox 94.0.2 candidate build (Build ID: 20211117154346) on Windows 10 x64, macOS 10.15.7 and Ubuntu 20.04 x64.
- In order to verify this issue we have used the scenarios from comment 10 and comment 14.
Description
•