Add a temporary telemetry for QM's temporary storage initialization time
Categories
(Core :: Storage: Quota Manager, task, P2)
Tracking
()
People
(Reporter: tt, Assigned: tt)
References
Details
Attachments
(6 files, 2 obsolete files)
|
47 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
|
2.10 KB,
text/plain
|
chutten
:
data-review+
|
Details |
|
47 bytes,
text/x-phabricator-request
|
Details | Review | |
|
47 bytes,
text/x-phabricator-request
|
Details | Review | |
|
2.52 KB,
text/plain
|
chutten
:
data-review+
|
Details |
|
47 bytes,
text/x-phabricator-request
|
Details | Review |
Now, we have QM_REPOSITORIES_INITIALIZATION_TIME_V2 and QM_REPOSITORIES_INITIALIZATION_TIME (will be dropped recently)
The goal is to have a permanet telemetry for measuring the initialization time of QM temporary storage.
In this bug, we should:
- good estimation of upper bound
- persistent telemetry (so it won't ever expire)
- probably new name with respect to QM v4
My thoughts for now is that we can add another temporary telemetry to catch the proper upper bound. After that, we can add the final permanent telemetry to our code.
| Assignee | ||
Comment 1•6 years ago
|
||
I will ask someone in the telemtry team to see if there is a better way to handle this.
| Assignee | ||
Comment 2•6 years ago
|
||
Currently, we have QM_REPOSITORIES_INITIALIZATION_TIME_V2 to track the initialization time of QM temporary storage. However, we want to catch the highest initialization time and make it permanent. For now, there are around 0.38% of sample greater than exisiting upper bound.
Our goal is to have a permanent telemetry for that and we can find approixmate highest number in the graph.
To do that, my impression is to add another temporary telemetry to catch that (e.g. QM_REPOSITORIES_INITIALIZATION_TIME_V3; "high": 600000). After we catch the upper bound, we can add a permanent telemetry for the initilaization time.
Jan-Erik, is there a better way to handle this? Like, we can somehow get the highest number by qurey data at sql.telemetry.mozilla.org. (Although, IIUC, telemetry collects data by sending the which bucket it is. That means we probably cannot get the highest value through existing data) Or, do you have any suggestion? Thanks in advance!
Bigger values land in an overflow bucket and we loose any precision about what exact values those were. So no, there's no way to get the highest value based on that.
As histograms are essentially immutable once defined and in use, having a renamed one is the only possibility right now.
Why is having the exact upper bound so important? 0.38% in the overflow bucket already seems low to me, what will a more precise measurement give you?
| Assignee | ||
Comment 4•6 years ago
|
||
(In reply to Jan-Erik Rediger [:janerik] from comment #3)
Bigger values land in an overflow bucket and we loose any precision about what exact values those were. So no, there's no way to get the highest value based on that.
I see. Thanks for verifying that!
As histograms are essentially immutable once defined and in use, having a renamed one is the only possibility right now.
Why is having the exact upper bound so important? 0.38% in the overflow bucket already seems low to me, what will a more precise measurement give you?
Ideally, it'd be great to catch the worst time by the second right-most bucket which means the right-most one should be zero.
One of our goals is to ensure initialization time under a certain number including the worst-case since that might cause a shutdown to hang. Also, it's important for us to establish the baseline so that we can always know how much impact on initialization time does a new feature.
| Assignee | ||
Comment 5•6 years ago
|
||
(In reply to Tom Tung [:tt, :ttung] from comment #4)
One of our goals is to ensure initialization time under a certain number including the worst-case since that might cause a shutdown to hang. Also, it's important for us to establish the baseline so that we can always know how much impact on initialization time does a new feature.
To elaborate more, if we, unfortunately, cannot limit the initialization time under the current time limitation for shutting down, than that value could be a reference for deciding a new time limitation for shutting down.
Updated•6 years ago
|
| Assignee | ||
Comment 6•6 years ago
|
||
An example of recording unit by scalar: https://mzl.la/2U5jpmQ
| Assignee | ||
Comment 7•6 years ago
|
||
If you want to know the exact maximum time then you should record that.
uint scalars can record this for you and have a usable API for that: setMaximum (see JS API)
| Assignee | ||
Comment 9•6 years ago
|
||
(In reply to Jan-Erik Rediger [:janerik] from comment #8)
If you want to know the exact maximum time then you should record that.
uint scalars can record this for you and have a usable API for that:setMaximum(see JS API)
Thanks! That's the exact approach that P1 takes :)
| Assignee | ||
Comment 10•6 years ago
|
||
Comment 11•6 years ago
|
||
| Assignee | ||
Comment 12•6 years ago
|
||
We will add another (permanent) Histogram for initialization time
Comment 13•6 years ago
|
||
Comment 14•6 years ago
|
||
| bugherder | ||
| Assignee | ||
Comment 15•6 years ago
|
||
Comment on attachment 9123517 [details]
Bug 1609625 - P1 - Add a temporary scalar probes to get the longest initializattion time;
Beta/Release Uplift Approval Request
- User impact if declined: We want to get the boundary on all channels as soon as possible. If it's declined, then we need to wait extra one month to achieve thing.
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This patch only records data, so the risk should be low.
- String changes made/needed:
Updated•6 years ago
|
Comment 16•6 years ago
|
||
Comment on attachment 9123517 [details]
Bug 1609625 - P1 - Add a temporary scalar probes to get the longest initializattion time;
Telemetry only, uplift approved for 74.0b6, thanks.
Comment 17•6 years ago
|
||
| bugherder uplift | ||
Updated•6 years ago
|
| Assignee | ||
Updated•5 years ago
|
| Assignee | ||
Comment 18•5 years ago
|
||
We want to renew the telemetry probe, but we also want to filter the data on cpp/our side (e.g. end time > start time && filter some incredible number (>= 1 hr))
| Assignee | ||
Comment 19•5 years ago
|
||
Depends on D83304
| Assignee | ||
Comment 20•5 years ago
|
||
Depends on D83305
Comment 21•5 years ago
|
||
Ok, I found out that we have these notifications:
NS_WIDGET_SLEEP_OBSERVER_TOPIC
NS_WIDGET_WAKE_OBSERVER_TOPIC
https://searchfox.org/mozilla-central/source/widget/nsIWidget.h#247
So, we could use that in QM to adjust the measurement of the initialization (quota loading).
If the adjustment is not too complicated to do, we could add another temporary scalar before introducing permanent telemetry.
It would be interesting to find out that the very high values happen when the computer/OS goes to sleep :)
Comment 22•5 years ago
|
||
I hope you'll forgive me a drive-by context link to bug 1204823 -- depending on the platform, your timers across sleep may behave differently on different platforms. We've run into problems caused by this in the past, so hopefully I can save you similar grief by linking it here.
Comment 23•5 years ago
|
||
Ok, in that case, can we just ignore storage initializations which were interrupted by a computer sleep ?
Comment 24•5 years ago
|
||
I'd still count the ones that are interrupted so you know how many you're excluding (and from which clients) in your analyses.
Comment 25•5 years ago
|
||
Ok, thank you very much for this information, it will save us a lot of time!
| Assignee | ||
Comment 26•5 years ago
|
||
Thanks for the information and feedback!
Would it make sense if we:
- Have a permanent telemetry probe to track the execution time for
LoadQuotaand excluded the cases that over an hour and the cases were interrupted by a computer sleep. - Have a temporary telemetry probe to track the times that are excluded with keys/categories to track their executing times and the reasons.
And this should be only recorded once per session.
| Assignee | ||
Comment 27•5 years ago
|
||
(In reply to Tom Tung [:tt, :ttung] from comment #26)
Just had a short meeting with Jan and I summarize the conclusion below:
- Have a permanent keyed telemetry probe to track the execution time for
LoadQuota - We plan to have three keys at the moment. One is for tracking interrupted time. Another is for tracking system error time (time that has a negative duration). The other is for tracking non-interrupted time.
- Have a debug assertion to ensure the time should only be recorded once.
Updated•5 years ago
|
| Assignee | ||
Comment 28•5 years ago
|
||
Depends on D83305
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
| Assignee | ||
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
| Assignee | ||
Comment 29•5 years ago
|
||
Related patches are 9163173 (D83305) and 9163518 (D83524)
Note that 9125412 is requested for temporary Scalar probe and it has already expiried. This one is requested for temporary Histogram probe.
| Assignee | ||
Comment 30•5 years ago
|
||
Comment 31•5 years ago
|
||
Comment on attachment 9170608 [details]
request-bug1609625.md
The data review request form has a new question in it (Q6, about where documentation can be found). Please re-request review with the new form.
Also, Q8 should probably note that this collection is Telemetry so the Firefox data preference is the opt-out mechanism.
| Assignee | ||
Comment 32•5 years ago
|
||
Thanks for the reminder and I have updated the file!
Comment 33•5 years ago
|
||
Comment on attachment 9170870 [details]
data-review for QM_QUOTA_INFO_LOAD_TIME_V0
DATA COLLECTION REVIEW RESPONSE:
Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?
Yes.
Is there a control mechanism that allows the user to turn the data collection on and off?
Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.
If the request is for permanent data collection, is there someone who will monitor the data over time?
No. This collection will expire in Firefox 87.
Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?
Category 1, Technical.
Is the data collection request for default-on or default-off?
Default on for all channels.
Does the instrumentation include the addition of any new identifiers?
No.
Is the data collection covered by the existing Firefox privacy notice?
Yes.
Does there need to be a check-in in the future to determine whether to renew the data?
Yes. :tt is responsible for renewing or removing the collection before it expires in Firefox 87.
Result: datareview+
Comment 34•5 years ago
|
||
| Assignee | ||
Updated•5 years ago
|
| Assignee | ||
Comment 35•5 years ago
|
||
Comment 36•5 years ago
|
||
Comment 37•5 years ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/bf8f95a7074f
https://hg.mozilla.org/mozilla-central/rev/8e56c44aa954
https://hg.mozilla.org/mozilla-central/rev/b59e653c2808
Description
•