Closed Bug 1609625 Opened 6 years ago Closed 5 years ago

Add a temporary telemetry for QM's temporary storage initialization time

Categories

(Core :: Storage: Quota Manager, task, P2)

task

Tracking

()

RESOLVED FIXED
81 Branch
Tracking Status
firefox74 --- fixed
firefox75 --- fixed
firefox81 --- fixed

People

(Reporter: tt, Assigned: tt)

References

Details

Attachments

(6 files, 2 obsolete files)

Now, we have QM_REPOSITORIES_INITIALIZATION_TIME_V2 and QM_REPOSITORIES_INITIALIZATION_TIME (will be dropped recently)

The goal is to have a permanet telemetry for measuring the initialization time of QM temporary storage.

In this bug, we should:

  1. good estimation of upper bound
  2. persistent telemetry (so it won't ever expire)
  3. probably new name with respect to QM v4

My thoughts for now is that we can add another temporary telemetry to catch the proper upper bound. After that, we can add the final permanent telemetry to our code.

I will ask someone in the telemtry team to see if there is a better way to handle this.

Assignee: nobody → ttung
Status: NEW → ASSIGNED

Currently, we have QM_REPOSITORIES_INITIALIZATION_TIME_V2 to track the initialization time of QM temporary storage. However, we want to catch the highest initialization time and make it permanent. For now, there are around 0.38% of sample greater than exisiting upper bound.

Our goal is to have a permanent telemetry for that and we can find approixmate highest number in the graph.

To do that, my impression is to add another temporary telemetry to catch that (e.g. QM_REPOSITORIES_INITIALIZATION_TIME_V3; "high": 600000). After we catch the upper bound, we can add a permanent telemetry for the initilaization time.

Jan-Erik, is there a better way to handle this? Like, we can somehow get the highest number by qurey data at sql.telemetry.mozilla.org. (Although, IIUC, telemetry collects data by sending the which bucket it is. That means we probably cannot get the highest value through existing data) Or, do you have any suggestion? Thanks in advance!

Flags: needinfo?(jrediger)

Bigger values land in an overflow bucket and we loose any precision about what exact values those were. So no, there's no way to get the highest value based on that.

As histograms are essentially immutable once defined and in use, having a renamed one is the only possibility right now.
Why is having the exact upper bound so important? 0.38% in the overflow bucket already seems low to me, what will a more precise measurement give you?

Flags: needinfo?(jrediger)

(In reply to Jan-Erik Rediger [:janerik] from comment #3)

Bigger values land in an overflow bucket and we loose any precision about what exact values those were. So no, there's no way to get the highest value based on that.

I see. Thanks for verifying that!

As histograms are essentially immutable once defined and in use, having a renamed one is the only possibility right now.
Why is having the exact upper bound so important? 0.38% in the overflow bucket already seems low to me, what will a more precise measurement give you?

Ideally, it'd be great to catch the worst time by the second right-most bucket which means the right-most one should be zero.

One of our goals is to ensure initialization time under a certain number including the worst-case since that might cause a shutdown to hang. Also, it's important for us to establish the baseline so that we can always know how much impact on initialization time does a new feature.

(In reply to Tom Tung [:tt, :ttung] from comment #4)

One of our goals is to ensure initialization time under a certain number including the worst-case since that might cause a shutdown to hang. Also, it's important for us to establish the baseline so that we can always know how much impact on initialization time does a new feature.

To elaborate more, if we, unfortunately, cannot limit the initialization time under the current time limitation for shutting down, than that value could be a reference for deciding a new time limitation for shutting down.

An example of recording unit by scalar: https://mzl.la/2U5jpmQ

If you want to know the exact maximum time then you should record that.
uint scalars can record this for you and have a usable API for that: setMaximum (see JS API)

(In reply to Jan-Erik Rediger [:janerik] from comment #8)

If you want to know the exact maximum time then you should record that.
uint scalars can record this for you and have a usable API for that: setMaximum (see JS API)

Thanks! That's the exact approach that P1 takes :)

Attached file request-bug1609625.md
Attachment #9125412 - Flags: data-review?(chutten)
Comment on attachment 9125412 [details] request-bug1609625.md DATA COLLECTION REVIEW RESPONSE: Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate? Yes. This collection is Telemetry so is documented in its definitions file [Scalars.yaml](https://hg.mozilla.org/mozilla-central/file/tip/toolkit/components/telemetry/Scalars.yaml) and the [Probe Dictionary](https://telemetry.mozilla.org/probe-dictionary/). Is there a control mechanism that allows the user to turn the data collection on and off? Yes. This collection is Telemetry so can be controlled through Firefox's Preferences. If the request is for permanent data collection, is there someone who will monitor the data over time? No. This collection will expire in Firefox 76. Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under? Category 1, Technical. Is the data collection request for default-on or default-off? Default on for all channels. Does the instrumentation include the addition of any new identifiers? No. Is the data collection covered by the existing Firefox privacy notice? Yes. Does there need to be a check-in in the future to determine whether to renew the data? Yes. :tt is responsible for renewing or removing the collection before it expires in Firefox 76. --- Result: datareview+
Attachment #9125412 - Flags: data-review?(chutten) → data-review+

We will add another (permanent) Histogram for initialization time

Keywords: leave-open
Pushed by ttung@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b6e660ab21ee P1 - Add a temporary scalar probes to get the longest initializattion time; r=janv,dom-workers-and-storage-reviewers
Blocks: 1614397

Comment on attachment 9123517 [details]
Bug 1609625 - P1 - Add a temporary scalar probes to get the longest initializattion time;

Beta/Release Uplift Approval Request

  • User impact if declined: We want to get the boundary on all channels as soon as possible. If it's declined, then we need to wait extra one month to achieve thing.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch only records data, so the risk should be low.
  • String changes made/needed:
Attachment #9123517 - Flags: approval-mozilla-beta?

Comment on attachment 9123517 [details]
Bug 1609625 - P1 - Add a temporary scalar probes to get the longest initializattion time;

Telemetry only, uplift approved for 74.0b6, thanks.

Attachment #9123517 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Priority: P2 → P3
Severity: normal → N/A

We want to renew the telemetry probe, but we also want to filter the data on cpp/our side (e.g. end time > start time && filter some incredible number (>= 1 hr))

Priority: P3 → P2

Ok, I found out that we have these notifications:
NS_WIDGET_SLEEP_OBSERVER_TOPIC
NS_WIDGET_WAKE_OBSERVER_TOPIC

https://searchfox.org/mozilla-central/source/widget/nsIWidget.h#247

So, we could use that in QM to adjust the measurement of the initialization (quota loading).
If the adjustment is not too complicated to do, we could add another temporary scalar before introducing permanent telemetry.

It would be interesting to find out that the very high values happen when the computer/OS goes to sleep :)

I hope you'll forgive me a drive-by context link to bug 1204823 -- depending on the platform, your timers across sleep may behave differently on different platforms. We've run into problems caused by this in the past, so hopefully I can save you similar grief by linking it here.

Ok, in that case, can we just ignore storage initializations which were interrupted by a computer sleep ?

I'd still count the ones that are interrupted so you know how many you're excluding (and from which clients) in your analyses.

Ok, thank you very much for this information, it will save us a lot of time!

Thanks for the information and feedback!

Would it make sense if we:

  • Have a permanent telemetry probe to track the execution time for LoadQuota and excluded the cases that over an hour and the cases were interrupted by a computer sleep.
  • Have a temporary telemetry probe to track the times that are excluded with keys/categories to track their executing times and the reasons.

And this should be only recorded once per session.

(In reply to Tom Tung [:tt, :ttung] from comment #26)

Just had a short meeting with Jan and I summarize the conclusion below:

  • Have a permanent keyed telemetry probe to track the execution time for LoadQuota
  • We plan to have three keys at the moment. One is for tracking interrupted time. Another is for tracking system error time (time that has a negative duration). The other is for tracking non-interrupted time.
  • Have a debug assertion to ensure the time should only be recorded once.
Attachment #9163173 - Attachment description: Bug 1609625 - Use NowLoRes for getting the current time stamp and don't record initialization time in some cases; → Bug 1609625 - Remove the probe for tracking time duration on initializing repositories and add a probe for tracking time duration on loading quota;
Attachment #9163174 - Attachment is obsolete: true
Attachment #9163518 - Attachment description: Bug 1609625 - Count the number of time that OS sleeps; → Bug 1609625 - Record time duration for LoadQuota and deal with cases that if it across sleep/wake;
Attachment #9163518 - Attachment description: Bug 1609625 - Record time duration for LoadQuota and deal with cases that if it across sleep/wake; → Bug 1609625 - Deal with recorded time duration if it across sleep/wake;
Summary: Add a permanent telemetry for QM's temporary storage initialization time → Add a temporary telemetry for QM's temporary storage initialization time
Attachment #9163518 - Attachment is obsolete: true
Attachment #9163518 - Attachment is obsolete: false
Attached file request-bug1609625.md (obsolete) —

Related patches are 9163173 (D83305) and 9163518 (D83524)

Note that 9125412 is requested for temporary Scalar probe and it has already expiried. This one is requested for temporary Histogram probe.

Attachment #9170608 - Flags: data-review?(chutten)

Comment on attachment 9170608 [details]
request-bug1609625.md

The data review request form has a new question in it (Q6, about where documentation can be found). Please re-request review with the new form.

Also, Q8 should probably note that this collection is Telemetry so the Firefox data preference is the opt-out mechanism.

Attachment #9170608 - Flags: data-review?(chutten)

Thanks for the reminder and I have updated the file!

Attachment #9170608 - Attachment is obsolete: true
Attachment #9170870 - Flags: data-review?(chutten)

Comment on attachment 9170870 [details]
data-review for QM_QUOTA_INFO_LOAD_TIME_V0

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

No. This collection will expire in Firefox 87.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

Yes. :tt is responsible for renewing or removing the collection before it expires in Firefox 87.


Result: datareview+

Attachment #9170870 - Flags: data-review?(chutten) → data-review+
Pushed by ttung@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/bf8f95a7074f Remove the probe for tracking time duration on initializing repositories and add a probe for tracking time duration on loading quota; r=dom-workers-and-storage-reviewers,sg,janv https://hg.mozilla.org/integration/autoland/rev/8e56c44aa954 Deal with recorded time duration if it across sleep/wake; r=dom-workers-and-storage-reviewers,sg,janv
Blocks: 1660027
Keywords: leave-open
Pushed by ttung@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b59e653c2808 Align the name of key for TimeStamp errors; r=dom-workers-and-storage-reviewers,janv
See Also: → 1683102
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: