Closed Bug 1592934 Opened 4 months ago Closed 2 months ago

QM: Add new telemetry probes for tracking initialization success rates

Categories

(Core :: Storage: Quota Manager, task, P1)

task

Tracking

()

RESOLVED FIXED
mozilla73
Tracking Status
firefox72 --- fixed
firefox73 --- fixed

People

(Reporter: janv, Assigned: tt)

References

(Blocks 1 open bug)

Details

Attachments

(12 files, 4 obsolete files)

47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
2.65 KB, text/plain
chutten
: data-review+
Details
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
2.97 KB, text/plain
mmccorquodale
: data-review+
Details
47 bytes, text/x-phabricator-request
Details | Review

We need a simple report that can be used to evaluate temporary storage initialization success rate after we land a fix that is supposed to improve it. That's current requirement and it needs to be done ASAP. Long term, the telemetry should probably stay active so we will be able to detect possible regressions.
Ideally, we would have this on all channels (Nightly, Beta, Release) and we should also try to get this into 71 Release (71 is still in relatively early beta cycle, so there's a chance for an approval there).

Status: NEW → ASSIGNED

From try, it looks like the added test pass on all platform

I'm looking at the nightly only telemetry for specific issues and I found out that we don't cover EnsureStorageIsInitialized. If that fails all origins will fail to initialize too, but that would be a different bug. Should we cover it here at least ?

(In reply to Jan Varga [:janv] from comment #4)

I'm looking at the nightly only telemetry for specific issues and I found out that we don't cover EnsureStorageIsInitialized. If that fails all origins will fail to initialize too, but that would be a different bug. Should we cover it here at least ?

Maybe having two telemetry probes for these two different success rates? We can discuss this in the meeting today afternoon.

If we have one for EnsureStorageIsInitialized and the other one for EnsureTemporaryStorageIsInitialized, then we can cover more cases. (e.g. Maintance only call EnsureStorageIsInitialized). And ideally, these two success rates should close to a really high number (e.g. 95%).

Yeah, maybe, but I'm afraid we need to aim for even higher success rate. As we found out, this is really essential for LSNG.

There are still places in EnsureOriginIsInitializedInternal that can fail (for example creation of the metadata file for a new origin directory) which wouldn't be covered by these new telemetry probes. I have to think about it more.

Ok, so if we decided to have separate telemetry probes for EnsureStorageIsInitialized and EnsureTemporaryStorageIsInitialized, we need another one for EnsureOriginIsInitialized that will track success/failure not included in QM_STORAGE_INIT_RESULT and QM_TEMPORARY_STORAGE_INIT_RESULT.

I also wonder why these telemetries need to be scalars and not histograms.

(In reply to Jan Varga [:janv] from comment #10)

Ok, so if we decided to have separate telemetry probes for EnsureStorageIsInitialized and EnsureTemporaryStorageIsInitialized, we need another one for EnsureOriginIsInitialized that will track success/failure not included in QM_STORAGE_INIT_RESULT and QM_TEMPORARY_STORAGE_INIT_RESULT.

Will add a patch for that.

I also wonder why these telemetries need to be scalars and not histograms.

I did that based on the documentation here:
https://docs.telemetry.mozilla.org/concepts/pipeline/data_pipeline.html?highlight=scalar#firefox

And I quote that here: "scalars – for recording single values;".
Since the telemetry (success rate) we want to collect here is quite simple (a boolean), I choose "scalars"

I checked some scalars and it seems it's useful for example for tracking user preferences which take effect after restart, so it doesn't make sense to record all values. In our case, the initialization can be repeated and may succeed next time. Especially the additional probe for EnsureOriginIsInitialized may succeed for some origins and fail for other origins. We need to do it right and consider all the downsides and upsides.

There is a downside for Scalars:

There is a warning message in the document of the Scalars:
"Scalar operations are designed to be cheap, not free. If you wish to manipulate Scalars in a performance-sensitive piece of code, store the operations locally and change the Scalar only after the performance-sensitive piece (“hot path”) has completed."

It probably indicates the Scalars are not really suitable for the performance-sensitive piece of code.

Upside:
However, note that there is no similar type of collection in Histogram for Scalar::Set(flag). (There was flag in Histogram, but it was deprecated.) In Histogram, the value would be updated every time. (that means if we are going to use the same way to record, we might get a lot for failures for a client (because at most each client would only have one success))

For the QM_STORAGE_INIT_RESULT and QM_TEMPORARY_STORAGE_INIT_RESULT, I think the Scalar is the most suitable tool for collecting the data.

However, for EnsureOriginIsInitialized, histogram might be more suitable since we want to record the success rate for each origin? (We can avoid leaking origin name by hashing it into some un-readable string)

I think there's some confusion, Histogram doesn't support a flag (bool) explicitelly, but why we can't use uint32_t for that instead?
Yes, flag was deprecated, but it was meant for something else anyway:
"This histogram type allows you to record a single value (0 or 1, default 0). This type is useful if you need to track whether a feature was ever used during a Firefox session.
Flag histograms will ignore any changes after the flag is set, so once the flag is set, it cannot be unset."

Tom, please wait a bit. I'm investigating all the options, I'll let you know soon.

(In reply to Jan Varga [:janv] from comment #15)

Tom, please wait a bit. I'm investigating all the options, I'll let you know soon.

Sure

(In reply to Jan Varga [:janv] from comment #14)
Just to elaborate more, if we take this option, we probably want to query data in https://sql.telemetry.mozilla.org rather than just checking the https://telemetry.mozilla.org
Consider the case below:
1st 2nd 3rd
Failure, failure, success

We would get {0: 2, 1: 1} (0 for failure cases, 1 for success cases).
We can easily filter the data by checking the number of successes.

Ok, so I think we want "permanent boolean keyed histogram" here. We would record only the first result because that matters the most. It's not so important to know if some of the initialization methods succeed later (after failing to complete first time).
The name of telemetry probe would be QM_FIRST_INITIALIZATION_RESULT.
Keys: Storage (for EnsureStorageIsInitialized), TemporaryStorage (for EnsureTemporaryStorageIsInitialized)
Another key would be Origin (for EnsureOriginIsInitialized), but that is a bit more complicated, so it can be done in a different bug.
We can later add for example "UpgradeStorageFrom2_3To3_0" key to track the new upgrade success rate.

Attached image TemporaryStorage init success (obsolete) —
Attached image TemporaryStorage init fail (obsolete) —

As I said in comment 18, we want to record only the first result. The second screenshot for the failure case shows there were 11 failures, but we don't want to record all of them, only the first result (either success or failure).

Attachment #9108758 - Attachment is obsolete: true
Attachment #9108759 - Attachment is obsolete: true
Attached file request-bug1592934.md

I will ask for a data review once one of this patch is r+

Summary: QM: Add a new telemetry for tracking temporary storage initialization success rate → QM: Add new telemetry probes for tracking initialization success rates
Attachment #9109965 - Flags: data-review?(jrediger)
Comment on attachment 9109965 [details]
request-bug1592934.md

I'm not a data steward ([see the list in the wiki](https://wiki.mozilla.org/Firefox/Data_Collection)), but I redirect to :chutten.
Attachment #9109965 - Flags: data-review?(jrediger) → data-review?(chutten)

Note that we intend to land D51267 D51845 first once the data-review is r+. D53392 will be landed once it's r+ && data-r+. Ideally, we want to uplift them to 71 after they are verified on m-c for a few days.

Comment on attachment 9109965 [details]
request-bug1592934.md

DATA COLLECTION REVIEW RESPONSE:

    Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes. This collection is Telemetry so is documented in its definitions file [Histograms.json](https://hg.mozilla.org/mozilla-central/file/tip/toolkit/components/telemetry/Histograms.json) and the [Probe Dictionary](https://telemetry.mozilla.org/probe-dictionary/).

    Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

    If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Tom Tung is responsible.

    Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

    Is the data collection request for default-on or default-off?

Default on for all channels.

    Does the instrumentation include the addition of any new identifiers?

No.

    Is the data collection covered by the existing Firefox privacy notice?

Yes.

    Does there need to be a check-in in the future to determine whether to renew the data?

No. This collection is permanent.

---
Result: datareview+
Attachment #9109965 - Flags: data-review?(chutten) → data-review+
Pushed by ttung@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e8dd15fbffd0
Add a telemetry probe to track temporary storage initialization success rate; r=janv,janerik
https://hg.mozilla.org/integration/autoland/rev/fa7c7f462993
Add a telemetry probe to track storage initialization success rate; r=janv
Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla72
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---

The test test_qm_first_initialization_attempt.js fails in Thunderbird, which doesn't have the new probe being tested. I'll post a patch to skip it.

Pushed by geoff@darktrojan.net:
https://hg.mozilla.org/integration/autoland/rev/40b16ced3f44
Disable test_qm_first_initialization_attempt.js on Thunderbird, which doesn't have the Telemetry probe tested. r=ttung

Comment on attachment 9105539 [details]
Bug 1592934 - Add a telemetry probe to track temporary storage initialization success rate;

Beta/Release Uplift Approval Request

  • User impact if declined: we cannot collect the data (the success rates of functions for the storage initialization). And these data will be used to decide if it's okay to ship LSNG or Storage related feature to Release or not.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It only tracks the result of specific functions and calls Telemetry's API, so it should be not risky.
  • String changes made/needed:
Attachment #9105539 - Flags: approval-mozilla-beta?
Attachment #9106571 - Flags: approval-mozilla-beta?
Attachment #9110728 - Flags: approval-mozilla-beta?

Jan, have you checked that the telemetry data is flowing in since it landed in Nightly? Also I see that 72 is still marked as affected and that there is another patch attached to the bug after the uplift. Is the issue actually fixed on nightly or not? Thanks

Flags: needinfo?(jvarga)

Due to time constraints, we split the work into several patches. The first two patches that landed on Nightly are the most important ones and it would be great to have them in 71.
We are working on other patches, but they are rather complementary and it' ok if they are not uplifted to 71.

Flags: needinfo?(jvarga)

(In reply to Jan Varga [:janv] from comment #36)

Due to time constraints, we split the work into several patches. The first two patches that landed on Nightly are the most important ones and it would be great to have them in 71.
We are working on other patches, but they are rather complementary and it' ok if they are not uplifted to 71.

Are you receiving telemetry data on Nightly from these new probes?

Comment on attachment 9105539 [details]
Bug 1592934 - Add a telemetry probe to track temporary storage initialization success rate;

Telemetry probe to be able to decide on LSNG in 72, uplift approved for the beta branch before RC.

Attachment #9105539 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9106571 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9110728 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9110862 - Attachment description: Bug 1592934 - Refactor some storage initialization methods before adding new telemetry probes → Bug 1592934 - Refactor some storage initialization methods before adding new telemetry probes; r=ttung
Pushed by jvarga@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/72344e4b7737
Refactor some storage initialization methods before adding new telemetry probes; r=ttung
Attachment #9109368 - Attachment description: Bug 1592934 - Add a telemetry probe to track origin initialization success rate; → Bug 1592934 - Add telemetry probes to track persistent origins and temporary origins initialization success rate;

We will have to needinfo a data steward before patches with new keys can land on m-c. So don't land them for now.
I also plan to add more new keys in this bug, so it would be better to needinfo a data steward once everything is ready.

(In reply to Jan Varga [:janv] from comment #45)

We will have to needinfo a data steward before patches with new keys can land on m-c. So don't land them for now.
I also plan to add more new keys in this bug, so it would be better to needinfo a data steward once everything is ready.

I see. Thanks for the heads up.

Tom, I refactored the code a bit. It should be now really easy to add new keys.

I propose to add additional telemetry (keys) for all storage upgrade methods:
UpgradeStorageFrom0_0To1_0
UpgradeStorageFrom1_0To2_0
UpgradeStorageFrom2_0To2_1
UpgradeStorageFrom2_1To2_2
UpgradeStorageFrom2_2To2_3

And also for InitializeRepository method:
"DefaultRepository" key for PERSISTENCE_TYPE_DEFAULT
"TemporaryRepository" key for PERSISTENCE_TYPE_TEMPORARY

That's enough for now I think, but if you have some other stuff in mind that should be tracked by this telemetry, let me know.

Do you want to work on these new keys ?
(I have other work to do).

The test will have to use the installPackagedProfile stuff now (I hesitated to add it before, but the upgrade methods need it I think).

(In reply to Jan Varga [:janv] from comment #49)

Tom, I refactored the code a bit. It should be now really easy to add new keys.

I propose to add additional telemetry (keys) for all storage upgrade methods:
UpgradeStorageFrom0_0To1_0
UpgradeStorageFrom1_0To2_0
UpgradeStorageFrom2_0To2_1
UpgradeStorageFrom2_1To2_2
UpgradeStorageFrom2_2To2_3

And also for InitializeRepository method:
"DefaultRepository" key for PERSISTENCE_TYPE_DEFAULT
"TemporaryRepository" key for PERSISTENCE_TYPE_TEMPORARY

That's enough for now I think, but if you have some other stuff in mind that should be tracked by this telemetry, let me know.

Do you want to work on these new keys ?
(I have other work to do).

Sure! I will work on this.

Depends on D55834

Attachment #9113731 - Attachment is obsolete: true
Attached file request-bug1592934-v2.md (obsolete) —

Hi Chris,

We want measure the success rate for more functions:
D53392:

  • PersistentOrigin for EnsurePersistentOriginIsInitialized,
  • TemporaryOrigin for EnsureTemporaryOriginIsInitialized
    D55834:
  • UpgradeStorageFrom0_0To1_0,
  • UpgradeStorageFrom1_0To2_0,
  • UpgradeStorageFrom2_0To2_1,
  • UpgradeStorageFrom2_1To2_2,
  • UpgradeStorageFrom2_2To2_3,
  • DefaultRepository for InitializeRepository(PERSISTENCE_TYPE_DEFAULT),
  • TemporaryRepository for InitializeRepository(PERSISTENCE_TYPE_TEMPORARY)

Would it be okay if we just add them as new keys to the existing telemetry (QM_FIRST_INITIALIZATION_ATTEMPT)? Thanks in advance!

If the data-reivew'ed document needs to be updated to with these keys, I have updated the request-bug1592934-v2.md as a new document.

Please note that the content of D55834 may change, but I am assuming these keys won't be changed.

Flags: needinfo?(chutten)

(In reply to Tom Tung [:tt, :ttung] from comment #55)

Would it be okay if we just add them as new keys to the existing telemetry (QM_FIRST_INITIALIZATION_ATTEMPT)? Thanks in advance!

It's fine from a technical POV. From a Data POV you need a Data Review to cover this, so if the original didn't cover it, then you'll need to file an expanded request and get it reviewed. From an Analysis POV, be sure that adding new keys will answer the questions you have and that you have dedicated the resources to analyse this now slightly-more complex collection.

If the data-reivew'ed document needs to be updated to with these keys, I have updated the request-bug1592934-v2.md as a new document.

Well, it looks as though the original data collection already went through (I see commits landing in mozilla-central). Please file the review for the expansion as its own review (and data-review? it to a Steward).

Flags: needinfo?(chutten)
Attached file expansion request

(I assume the expanding request use the same template as the data-review template. Please feel free to point me out the correct template if it's not)

Attachment #9114588 - Attachment is obsolete: true
Attachment #9114618 - Flags: data-review?(chutten)
Comment on attachment 9114618 [details]
expansion request

Load-balance to Megan.
Attachment #9114618 - Flags: data-review?(chutten) → data-review?(mmccorquodale)
Comment on attachment 9114618 [details]
expansion request

	1.	Is there or will there be documentation that describes the schema for the ultimate data set in a public, complete, and accurate way?
This will be documented in the probe dictionary. 

	2.	Is there a control mechanism that allows the user to turn the data collection on and off? 
Data collection can be turned off by opting out of telemetry. 

	3.	If the request is for permanent data collection, is there someone who will monitor the data over time?

Data will be monitored by Tom Tung. 

	4.	Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, technical data. 

	5.	Is the data collection request for default-on or default-off?

Default on. 

	6.	Does the instrumentation include the addition of any new identifiers?

No new identifiers. 

	7.	Is the data collection covered by the existing Firefox privacy notice? 

Yes. 

	8.	Does there need to be a check-in in the future to determine whether to renew the data? 
No. 

	9.	Does the data collection use a third-party collection tool?
No. 

-----------------
data-review +
Attachment #9114618 - Flags: data-review?(mmccorquodale) → data-review+
Pushed by ttung@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/bae11e823511
Add telemetry probes to track persistent origins and temporary origins initialization success rate; r=janv
https://hg.mozilla.org/integration/autoland/rev/b6c0fc6e074b
Change some bools to bitfields before adding new telemetry probes; r=ttung
https://hg.mozilla.org/integration/autoland/rev/7df01e595d24
Use spread syntax to reduce the code in test_qm_first_initialization_attempt.js; r=janv,dom-workers-and-storage-reviewers
https://hg.mozilla.org/integration/autoland/rev/9b214965658d
Abstract first initialization attempt recording into a standalone class; r=ttung,dom-workers-and-storage-reviewers,sg
https://hg.mozilla.org/integration/autoland/rev/127122b887d8
Measure the success rate for the UpgradeStorageFrom* functions and the InitializeRepository function per repository; r=janv,dom-workers-and-storage-reviewers

I will consider whether to uplift these patches to beta next week

Keywords: leave-open

One more patch is needed.

Keywords: leave-open

I will also submit a patch for some additional cleanup.

Comment on attachment 9109368 [details]
Bug 1592934 - Add telemetry probes to track persistent origins and temporary origins initialization success rate;

Beta/Release Uplift Approval Request

  • User impact if declined: We cannot gather the telemetry data (success rates for a few more functions during storage initialization).
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It only tracks the result of specific functions and calls Telemetry's API, so it should be not risky.
  • String changes made/needed:
Attachment #9109368 - Flags: approval-mozilla-beta?
Attachment #9110862 - Flags: approval-mozilla-beta?
Attachment #9111892 - Flags: approval-mozilla-beta?
Attachment #9112224 - Flags: approval-mozilla-beta?
Attachment #9113195 - Flags: approval-mozilla-beta?
Attachment #9113515 - Flags: approval-mozilla-beta?
Attachment #9115136 - Flags: approval-mozilla-beta?

Why is this needed in 72, vs riding the trains?
And, can the cleanup happen in a separate bug to make tracking easier? leave-open and uplifts don't interact well.

Flags: needinfo?(ttung)

Yes, the cleanup can be done separately, sorry about that.

Status: REOPENED → RESOLVED
Closed: 3 months ago2 months ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: mozilla72 → mozilla73

(In reply to Julien Cristau [:jcristau] from comment #69)

Why is this needed in 72, vs riding the trains?
And, can the cleanup happen in a separate bug to make tracking easier? leave-open and uplifts don't interact well.

(In reply to Julien Cristau [:jcristau] from comment #69)

Why is this needed in 72, vs riding the trains?

On Beta, we can gather the data for the success rates for two functions. We want to get more in detail for these Channel as soon as possible. So that we can decide which part of the storage initialization do we want to improve to ship LSNG or other features in the feature on Release.

We want to understand the situation on Beta and Release as soon as possible. Before this bug, most of our telemetry was collected on Nightly only so that we are lacking in understanding the situation other channels.

And, can the cleanup happen in a separate bug to make tracking easier? leave-open and uplifts don't interact well.

Just a side note the cleanup won't be uplifted (Jan and I spoke about that on our team channel)

Flags: needinfo?(ttung)

Comment on attachment 9110862 [details]
Bug 1592934 - Refactor some storage initialization methods before adding new telemetry probes; r=ttung

extra telemetry for quota manager, approved for 72.0b9

Attachment #9110862 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9111892 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9112224 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9113195 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9113515 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9115136 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9109368 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9110862 [details]
Bug 1592934 - Refactor some storage initialization methods before adding new telemetry probes; r=ttung

This one's actually already in 72.

Attachment #9110862 - Flags: approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.