Add telemetry to see how often we end up in broken Quota Manager storage situations

RESOLVED FIXED in Firefox 66

Status

()

enhancement
P2
major
RESOLVED FIXED
a year ago
4 months ago

People

(Reporter: overholt, Assigned: tt)

Tracking

(Blocks 2 bugs)

unspecified
mozilla66
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox66 fixed)

Details

Attachments

(2 attachments, 18 obsolete attachments)

47 bytes, text/x-phabricator-request
Details | Review
2.40 KB, text/plain
chutten
: review+
Details
Reporter

Description

a year ago
It would be interesting to see data on how often we end up in the broken states detected on http://firefox-storage-test.glitch.me/.
Severity: normal → major
Priority: -- → P2
Agreed, :janv what do you think?
Flags: needinfo?(jvarga)
Giving to :asuth
Assignee: nobody → bugmail
Assignee

Comment 3

10 months ago
(In reply to Andrew Overholt [:overholt] from comment #0)
> It would be interesting to see data on how often we end up in the broken
> states detected on http://firefox-storage-test.glitch.me/.

Should be "https"? Otherwise operations in both DOM Cache and QM would fail => https://firefox-storage-test.glitch.me/
Yes, https.

Updated

10 months ago
Blocks: 1482662

Comment 5

10 months ago
Tom might want to take this too.
Flags: needinfo?(jvarga) → needinfo?(shes050117)
Assignee

Comment 6

9 months ago
Taking this bug with Andrew's permission! \o/
Flags: needinfo?(shes050117)
Assignee

Comment 7

9 months ago
Since bug 1481716 has already tested whether does QM sweep successfully or not, I will try to add telemetry probes to see IDB's and Cache's database work fine or not
Assignee: bugmail → shes050117
Depends on: 1481716
Assignee

Comment 8

9 months ago
Hi Andrew, and Jan,

I'd like to add a few probes in DOM Cache and indexedDB to get the information of the comment 0. But, I'm slightly not sure whether I understand correctly, so I try to write them down. Please correct me if I am wrong.

So, if we just do what exactlly comment #0 to do, we might have a rate like: 
  number_of_access_{idb or cache}_successfully / number_of_access_{idb or cache} for users

However, because the number_of_access_{idb or cache} would be rather big, we might want to break the original idea into small pieces.

There are two main reasons lead the failure of accessing idb or cache. One is that failures in QM. The other is that failures in quota clients which are mostly idb or cache.

QM fails mostly because of failures happen while initialization (either sweeping repositories or upgrading QM's version).
We have aimed to add probes in QM in bug 1481716, but note that we expect the data for upgrading would be really small since most of the users should have upgraded to the newest QM version. Thus, currently, patch in bug 1481716 is only for tracking the success rate of sweeping repositories. 

As for tracking failures in idb or cache, I believe most of the failures happen on initializing and upgrading (db schema) version. Among them, failures happen during initializing should be covered in bug 1481716. Thus, I plan to add probes to track the success rate of upgrading database schema in this bug.

Therefore, I intend to add a boolean type telemetry in both IDB and Cache for tracking their success rate for upgrading to the newest schema.

To achieve that in IDB, I propose to add probes in [1], [2], [3] for tracking the failure cases while [4] for the success case.

[1] https://searchfox.org/mozilla-central/rev/71ef4447db179639be9eff4471f32a95423962d7/dom/indexedDB/ActorsParent.cpp#4742-4746
[2] https://searchfox.org/mozilla-central/rev/71ef4447db179639be9eff4471f32a95423962d7/dom/indexedDB/ActorsParent.cpp#4748-4750
[3] https://searchfox.org/mozilla-central/rev/71ef4447db179639be9eff4471f32a95423962d7/dom/indexedDB/ActorsParent.cpp#4753-4755
[4] https://searchfox.org/mozilla-central/rev/71ef4447db179639be9eff4471f32a95423962d7/dom/indexedDB/ActorsParent.cpp#4759

To achieve that in Cache, I propose to add probes in [5] for tracking the failure case while [6] for the success case.

[5] https://searchfox.org/mozilla-central/rev/71ef4447db179639be9eff4471f32a95423962d7/dom/cache/DBSchema.cpp#479
[6] https://searchfox.org/mozilla-central/rev/71ef4447db179639be9eff4471f32a95423962d7/dom/cache/DBSchema.cpp#480

So that we could get the rate like:
  number_of_upgrade_successfully / number_of_trying_to_upgrade for users.

Note that:
- We probably will only get a little data with high failure rate, because most of the users upgraded their idb and cache to the newest version. The rest of them have failed to upgrade. However, I think these data is still useful to let us have some concept and can be the baseline for the next upgrade of idb or cache.

Could you give me some feedback about that or correct me if I am not in a right direction?
Flags: needinfo?(jvarga)
Flags: needinfo?(bugmail)
Assignee

Comment 9

9 months ago
On Storage meeting, Andrew told me that Scalar can avoid having the same result to present again based on client_id for each user. 

If what I proposed in comment 8 is fine, I imagine that we might collect data like whether an upgrade for database under the IDB directory succeeded for each origin directory for a user. So does bug 1481716.

It looks like I still need to figure out that if it's possible to have a user with possibly multiple data for different origin/directory(e.g. for a user -- data1{foo.com: success}, data2{boo.com: fail}, ... etc). But, at least, filtering duplicate data by client_id seems like a good idea to me and maybe I can apply it to filter the result.

If it's a bit hard to do that, we can simply record these data via Histogram as the patch in bug 1481716 did.

I'll recheck this tomorrow.
Assignee

Comment 10

9 months ago
(In reply to Tom Tung [:tt] from comment #9)
After rethinking, I think we might want to simply add Scalars' probes.

We can have two counters for each IDB and DOM Cache. One for tracking how many times does it upgrade its DB schema and the other for tracking how many time does it fail to upgrade.

So, we might have counters in IDB for tracking how many times does it upgrade its DB schema.

[1] https://searchfox.org/mozilla-central/rev/71ef4447db179639be9eff4471f32a95423962d7/dom/indexedDB/ActorsParent.cpp#4694

How many time does it fail to upgrade:

[2] https://searchfox.org/mozilla-central/rev/71ef4447db179639be9eff4471f32a95423962d7/dom/indexedDB/ActorsParent.cpp#4742-4746
[3] https://searchfox.org/mozilla-central/rev/71ef4447db179639be9eff4471f32a95423962d7/dom/indexedDB/ActorsParent.cpp#4748-4750
[4] https://searchfox.org/mozilla-central/rev/71ef4447db179639be9eff4471f32a95423962d7/dom/indexedDB/ActorsParent.cpp#4753-4755

For Cache:
How many times does it upgrade its DB schema:
[5] https://searchfox.org/mozilla-central/rev/71ef4447db179639be9eff4471f32a95423962d7/dom/cache/DBSchema.cpp#477

How many time does it fail to upgrade:
[6] https://searchfox.org/mozilla-central/rev/71ef4447db179639be9eff4471f32a95423962d7/dom/cache/DBSchema.cpp#479

Then, the results are expected as the same as what I mentioned in comment #8. The failure rate (#_fail/#_upgrades) may be rather high, but we can utilize that as a baseline to be improved and that will be useful when we have a new upgrade in DOM Cache or IDB.
Thank you for the detailed descriptions of your rationale and evolving thought process here!  I've refreshed my understanding of how Telemetry's scalars and histograms work to try and make sure I understand what's going on.

Summarizing how both types of telemetry probes work:
- Both histograms and scalars are submitted as part of the "main" ping documented at https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/data/main-ping.html
- The main ping's contents get submitted and reset every 24h at local midnight, clearing out the contents of the telemetry values.  The main ping also gets submitted and reset if the browser is quit.  https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/concepts/sessions.html has a nice diagram.
- Both scalars and (boolean) histograms are only submitted if they are non-zero/non-empty.  However, it appears that the telemetry.longitudinal dataset normalizes them so it's as if they were present.
  - It does seem to be the case that the omitted boolean histogram will manifest as a [0, 0] entry in the per-main-ping array for the value (per experimental query against `shutdown_ok`) versus {value=null} for the omitted scalar.  SQL queries can compensate for both[1].
- This means that in a functional QM installation, we'd expect to see telemetry data in every main ping indicating a single QM startup with no failures.
- In a broken QM installation, we'd also see telemetry data every ping, but in the cases where QM keeps trying to initialize itself, we'd potentially see histograms with bins with values >1 and similarly we'd see scalars with values >1.
- The difference between a boolean histogram and using 2 scalars seems to not exist where we use ScalarAdd to accumulate boolean failures with a 0/1 value instead of using ScalarSet/ScalarSetMaximum.
  - For "measurement dashboard" purposes, both are computed into histograms for presentation purposes.
  - For sql.telemetry.mozilla.org we're able to re-derive the boolean status of failure/success on a per-client/per-ping basis.
- The different between scalars and histograms only becomes relevant with non-boolean values.  In that case, a scalar allows us to report a uint32 value that will be aggregated into a histogram by the "measurement dashboard" but much more efficiently transported to the telemetry server.

Summarizing the plan:
- Bug 1481716:
  - QM_REPOSITORIES_INITIALIZATION_TIME is an exponential histogram maxing out at 5 seconds (5000 milliseconds) that tells us how long QuotaManager::EnsureTemporaryStorageIsInitialized takes.  At first I thought 5 seconds was low given the realities of I/O, but that's a good value since 5 seconds is WAY too long to block the startup of things like the newtab page.  So 5 seconds is a good value and this provides a good scale as we cache the information to accelerate startup in the future.
  - QM_DEFAULT_REPOSITORY_INITIALIZATION_SUCCEEDED is a boolean histogram that tracks success/failure.
  - QM_TEMPORARY_REPOSITORY_INITIALIZATION_SUCCEEDED is the same setup for the different storage class.
- This bug adds 2 scalars for both the IDB and Cache Quota clients to track: total # of upgrade attempts and number of failed upgrade attempts, which can provide us with a proportion.

Combining the two, I think:
- My original suggestion of using scalars for this is really no different from using a boolean histogram.
- However, we can potentially leverage the advantages of scalars if we don't use them to store a count of boolean failures.  Instead, we could count the number of origins for which each quota client experienced errors and then use ScalarSetMaximum() with the per-sweep value.  This would require that QuotaManager::InitializeRepository continue through errors in its inner while loop, counting the errors but not immediately aborting with an error code.  I think we want to be doing this anyways, and this would allow us to count the number of distinct origins that have problems.  (It would, however, also make failing sweeps take much longer to run.)
  - We would still have 2 scalars per client type.  The other scalar would be the total number of origins for which we had directories.  So we would know for each of [temporary, default] the number of origins with data for the given quota client stored plus the number which were broken for the given quota client.
  - The previous point does somewhat imply we'd instead want something like a keyed scalar, but I think those have encoding efficiency issues, and at this point really the only quota clients that matter are IDB and DOM Cache.  So no need to do that currently.


1: I ran the query `SELECT client_id, shutdown_ok, scalar_parent_sw_synthesized_res_count FROM telemetry.longitudinal LIMIT 20` against the "Athena" "telemetry.longitudinal" dataset.
Flags: needinfo?(bugmail)
Assignee

Comment 12

9 months ago
(In reply to Andrew Sutherland [:asuth] from comment #11)
> Summarizing the plan:
> - Bug 1481716:
>   - QM_REPOSITORIES_INITIALIZATION_TIME is an exponential histogram maxing
> out at 5 seconds (5000 milliseconds) that tells us how long
> QuotaManager::EnsureTemporaryStorageIsInitialized takes.  At first I thought
> 5 seconds was low given the realities of I/O, but that's a good value since
> 5 seconds is WAY too long to block the startup of things like the newtab
> page.  So 5 seconds is a good value and this provides a good scale as we
> cache the information to accelerate startup in the future.
>   - QM_DEFAULT_REPOSITORY_INITIALIZATION_SUCCEEDED is a boolean histogram
> that tracks success/failure.
>   - QM_TEMPORARY_REPOSITORY_INITIALIZATION_SUCCEEDED is the same setup for
> the different storage class.
> - This bug adds 2 scalars for both the IDB and Cache Quota clients to track:
> total # of upgrade attempts and number of failed upgrade attempts, which can
> provide us with a proportion.
> 
> Combining the two, I think:
> - My original suggestion of using scalars for this is really no different
> from using a boolean histogram.
> - However, we can potentially leverage the advantages of scalars if we don't
> use them to store a count of boolean failures.  Instead, we could count the
> number of origins for which each quota client experienced errors and then
> use ScalarSetMaximum() with the per-sweep value.  This would require that
> QuotaManager::InitializeRepository continue through errors in its inner
> while loop, counting the errors but not immediately aborting with an error
> code.  I think we want to be doing this anyways, and this would allow us to
> count the number of distinct origins that have problems.  (It would,
> however, also make failing sweeps take much longer to run.)
>   - We would still have 2 scalars per client type.  The other scalar would
> be the total number of origins for which we had directories.  So we would
> know for each of [temporary, default] the number of origins with data for
> the given quota client stored plus the number which were broken for the
> given quota client.
>   - The previous point does somewhat imply we'd instead want something like
> a keyed scalar, but I think those have encoding efficiency issues, and at
> this point really the only quota clients that matter are IDB and DOM Cache. 
> So no need to do that currently.

Thanks for the feedback and explaining them detailedly! And sorry for the late reply...

Remove the ni flag for :janv since I think it's clear for me what to do next.
Please correct me if I understand something wrong below:

So, we would like to track the number of failure in each origin directory for a repository during initializing.
We don't need to break the reason for failure in initialization down into smaller pieces (e.g. due to upgrade failure or etc...) here.

I put a brief summary of what I reckon I should do for the next step:
- Add two Scalar integer counters for Cache and IndexedDB in this bug. One is used to track the number of failure during initializing. The other is to get the number of directories for that quota client under a repository.

- Remove the boolean flags in bug 1481716 since we can get success/failure rate by calculating the number of the computing result here (numebr_of_broken_dir/number_of_dir) which is not 0%. We might not exactly the percentages of users' profile are broken since we track cache and idb separately, but we can get the percentage of origin directories are broken for each client type (which are Cache and IndexedDB for now). I think they are kind of the same for us in this stage.

- Since we want to make the QM sweep the whole repository even when a failure happens in a directory in this bug, I would like to make the change and the probes only live in Nightly.
Flags: needinfo?(jvarga)
(In reply to Tom Tung [:tt] from comment #12)
> - Since we want to make the QM sweep the whole repository even when a
> failure happens in a directory in this bug, I would like to make the change
> and the probes only live in Nightly.

I agree this is a good first step because of the potential for massively increased user I/O.  Over the longer run I do think we'd want to get the numbers from release users... but indeed it does make sense to wait to do that until we're actually able to fix the problems for release users rather than merely report them.  Having the numbers in nightly at least helps us know if there's a big up-tick in problems very quickly as well as ballparking how systemic the problem is.
Assignee

Comment 14

9 months ago
(In reply to Andrew Sutherland [:asuth] from comment #13)
> I agree this is a good first step because of the potential for massively
> increased user I/O.  Over the longer run I do think we'd want to get the
> numbers from release users... but indeed it does make sense to wait to do
> that until we're actually able to fix the problems for release users rather
> than merely report them.  Having the numbers in nightly at least helps us
> know if there's a big up-tick in problems very quickly as well as
> ballparking how systemic the problem is.

I see where you are coming from. Indeed, we should know how many release users have initialization issues, but I guess it's better to do that in the long run as you mentioned. I wanted to see whether we can grab enough information on Nightly users as our first step and then we can make a decision if we want to check this even to the Beta and Release.

I'll file a follow-up bug for reminding to do that in the longer run.

To do that in the Release, I have a naive thought. Perhaps, we could have a diagnosis minor upgrade in Quota Manager to traverse all the directories so that we could minimize the IO overhead to only one time. Also, we could consider returning an error asynchronously (notify the caller to avoid from blocking any task, and queue another task to finish the rest of traversing on QM IO thread or Background thread as a background task).
Assignee

Updated

9 months ago
Blocks: 1487779
(In reply to Tom Tung [:tt] from comment #14)
> To do that in the Release, I have a naive thought. Perhaps, we could have a
> diagnosis minor upgrade in Quota Manager to traverse all the directories so
> that we could minimize the IO overhead to only one time. Also, we could
> consider returning an error asynchronously (notify the caller to avoid from
> blocking any task, and queue another task to finish the rest of traversing
> on QM IO thread or Background thread as a background task).

Sounds like a very good and clever idea to me!
Assignee

Comment 16

9 months ago
Posted patch p1 (obsolete) — Splinter Review
Assignee

Comment 17

9 months ago
Posted patch p2 (obsolete) — Splinter Review
Assignee

Comment 18

9 months ago
Hi Jan,

This patch defers the failure return while initialing quota clients' origins on Nightly. It also ensures returning the same failure code.

Expected different behavior:
The only different behavior is that it might generate more warnings while there are more than one origins are broken and spending more time on initializing origins when QM found a broken origin.
(Before: QM return immediately; After: QM keep traversing origins until visiting all the origins).

Note:
I think we can return the error code asynchronously by creating another task to QuotaManager to complete the rest of sweeping in Bug 1487779. But, for now, since this patch will only affect Nightly and only happen in the failure cases, I reckon we don't need to do that in this patch.
Attachment #9006045 - Attachment is obsolete: true
Attachment #9006175 - Flags: review?(jvarga)
Assignee

Comment 19

9 months ago
Comment on attachment 9006175 [details] [diff] [review]
Bug 1436188 - P1 - Keep traversing oirgins to track the number of all the broken origins on Nightly;

I found I could improve something, so drop the review request. Sorry about that.
Attachment #9006175 - Flags: review?(jvarga)
Assignee

Comment 20

9 months ago
Hi Jan,

This patch defers the failure return while initializing repository and quota clients' origins on Nightly. It also ensures returning the same failure code.

Expected different behavior:
It spends more time on initializing origins when QM found a broken origin or directory.
(Before: QM return the error code immediately;
 After: QM keep traversing origins until visiting all the origins).

Note:
- I'm a bit not sure if we want to get all the warning message while multiple origins are broken. For now, this patch only reports the first error code as usual.

- I think we can return the error code asynchronously by creating another task to QuotaManager to complete the rest of sweeping in Bug 1487779. But, for now, since this patch will only affect Nightly and only happen in the failure cases, I reckon we don't need to do that in this patch.
Attachment #9006175 - Attachment is obsolete: true
Attachment #9006252 - Flags: review?(jvarga)

Comment 21

9 months ago
Ok, I have more time this week, will look at the patch, thanks.
Assignee

Comment 22

9 months ago
This patch creates a helper class to record and submit the scalar probes. By having four probes for IDB and two probes for Cache, we are able to know how many origins are broken in {default or temporary} and {Cache or IDB}.

Note that Cache shouldn't have a temporary-type origin so that this patch only creates two probes for default-type.
Attachment #9006046 - Attachment is obsolete: true
Attachment #9006254 - Flags: review?(jvarga)
What's our status on this?
Flags: needinfo?(shes050117)
Assignee

Comment 24

8 months ago
I planned to answer the original question (how many QM are broken and how many IDB, Cache are broken) by adding probes in this bug and bug of the minor upgrade (bug 1423917).

Patches were ready to be reviewed, but I think I need to rename the names of telemetry probes since they are not accurate enough and even a bit confusing. The probes here actually get how many broken origin directories block the QM initialization. (Note that this patch tracks broken origin directories by IDB and Cache)

I'm going to rename the probes and maybe revise patches a bit.
Flags: needinfo?(shes050117)
Assignee

Comment 25

8 months ago
Comment on attachment 9006254 [details] [diff] [review]
Bug 1436188 - P2 - Add telemtry probes to track the number of origins which are unbroken and the number of them are broken during storage initialization per DOM Cache and IDB;

Drop the review for modifying the name of probes. Will send it to be reviewed after that
Attachment #9006254 - Flags: review?(jvarga)
Assignee

Comment 26

8 months ago
I'll check this patch again tomorrow morning before sending it to review again.

I guess the only concern is that whether should I only submit telemetry probes when init fail. For now, this patch submits them no matter fail or success.
Attachment #9006254 - Attachment is obsolete: true
Assignee

Comment 27

8 months ago
Jan, this patch adds telemetry probes to track the number of IDB or Cache origins are broken or not during storage initialization. And, it's only tracking data on Nightly Channel. Could you help me to review this patch?

Restate the question we want to answer in this bug:
How many origins are broken in user's profiles? We also want to know in which steps do these origins break the QM (e.g. happens in initialization or after that) and we want to know what kinds of origin are they? (e.g. indexedDB or Cache).

To answer that, I plan to get the percentage of {IDB/Cache} origins are broken during QM initialization in this bug. Also, I plan to get the percentage of {IDB/Cache} origins are broken after initialization (which means their database are broken) in the bug for the minor upgrade.
Attachment #9012233 - Attachment is obsolete: true
Attachment #9012495 - Flags: review?(jvarga)

Comment 28

8 months ago
Comment on attachment 9006252 [details] [diff] [review]
Bug 1436188 - P1 - Keep traversing oirgins to track the number of all the broken origins on Nightly;

Review of attachment 9006252 [details] [diff] [review]:
-----------------------------------------------------------------

Some initial observations.

::: dom/quota/ActorsParent.cpp
@@ +4234,5 @@
>  
>      rv = InitializeOrigin(aPersistenceType, group, origin, timestamp, persisted,
>                            childDirectory);
> +#ifdef NIGHTLY_BUILD
> +    if (NS_FAILED(rv)) {

I think this should warn too:
if (NS_WARN_IF(NS_FAILED(rv))) {

@@ +4235,5 @@
>      rv = InitializeOrigin(aPersistenceType, group, origin, timestamp, persisted,
>                            childDirectory);
> +#ifdef NIGHTLY_BUILD
> +    if (NS_FAILED(rv)) {
> +      repositoryStatus = NS_SUCCEEDED(repositoryStatus) ? rv : repositoryStatus;

I don't see a reason to use "? :" here. Why not just:
if (NS_SUCCEEDED(repositoryStatus)) {
  repositoryStatus = rv;
}

otherwise following iterations would do:
repositoryStatus = repositoryStatus;
if an origin origin failed to initialize in a previous iteration

Comment 29

8 months ago
I have to look at this more deeply, but if I'm reading this right, the patch only records failures in IDB and DOM cache subdir initialization, but one of the biggest issues is the metadata recovery which is mostly handled by the origin parser.
I think the goal is to have a tool that measures the performance and functionality of storage initialization. So if don't cover all initialization failures, then we might think that everything is ok.
Assignee

Comment 30

8 months ago
Per Jan's suggestions in IRC channel:

3:56 PM at the top level we have initialization of temporary storage, and we may want to know if it succeeded or not
3:56 PM then we have particular origin directories, and again we may want to know if it succeeded or not
3:56 PM then we have subdirs for quota clients, and again ...

I'll change the patch to that direction.
Assignee

Updated

8 months ago
Attachment #9012495 - Flags: review?(jvarga)
Assignee

Comment 31

8 months ago
(In reply to Jan Varga [:janv] from comment #28)
I wanted to keep the error message align with what it is right now so that I implemented this patch in this a bit strange way.

> ::: dom/quota/ActorsParent.cpp
> @@ +4234,5 @@
> >  
> >      rv = InitializeOrigin(aPersistenceType, group, origin, timestamp, persisted,
> >                            childDirectory);
> > +#ifdef NIGHTLY_BUILD
> > +    if (NS_FAILED(rv)) {
> 
> I think this should warn too:
> if (NS_WARN_IF(NS_FAILED(rv))) {

If we do that, it might increase the number of warnings. I thought it's better to have the same number of warning so that I did that. Thus, I only send a warning if an origin was broken outside of the loop.

> @@ +4235,5 @@
> >      rv = InitializeOrigin(aPersistenceType, group, origin, timestamp, persisted,
> >                            childDirectory);
> > +#ifdef NIGHTLY_BUILD
> > +    if (NS_FAILED(rv)) {
> > +      repositoryStatus = NS_SUCCEEDED(repositoryStatus) ? rv : repositoryStatus;
> 
> I don't see a reason to use "? :" here. Why not just:
> if (NS_SUCCEEDED(repositoryStatus)) {
>   repositoryStatus = rv;
> }
> 
> otherwise following iterations would do:
> repositoryStatus = repositoryStatus;
> if an origin origin failed to initialize in a previous iteration

I wanted to keep the error message align with what it is right now. For instance, if there are two different broken origins, then the error message would be overwritten.

So, I made "repositoryStatus" only be changed while the first error is popping out.

Comment 32

8 months ago
(In reply to Tom Tung [:tt] from comment #31)
> If we do that, it might increase the number of warnings. I thought it's
> better to have the same number of warning so that I did that. Thus, I only
> send a warning if an origin was broken outside of the loop.

Well, usually only one or a few origin directories are broken.
Anyway, even if all are broken, these warnings show up in debug builds only.


> I wanted to keep the error message align with what it is right now. For
> instance, if there are two different broken origins, then the error message
> would be overwritten.
> 
> So, I made "repositoryStatus" only be changed while the first error is
> popping out.

So you think this:
repositoryStatus = NS_SUCCEEDED(repositoryStatus) ? rv : repositoryStatus;

is better than:
if (NS_SUCCEEDED(repositoryStatus)) {
  repositoryStatus = rv;
}

?
Assignee

Comment 33

8 months ago
(In reply to Jan Varga [:janv] from comment #32)
> Well, usually only one or a few origin directories are broken.
> Anyway, even if all are broken, these warnings show up in debug builds only.

I see and will correct that.
 
> So you think this:
> repositoryStatus = NS_SUCCEEDED(repositoryStatus) ? rv: repositoryStatus;
> 
> is better than:
> if (NS_SUCCEEDED(repositoryStatus)) {
>   repositoryStatus = rv;
> }
> 
> ?

Oh, I see where you are coming from. I misinterpreted your comment.
When I implemented that, I thought it's the same because |repositoryStatus = repositoryStatus| should be optimized and it shortens the code.

However, now, I think it's better to do what you suggested since it's a bit clear.

Comment 34

8 months ago
Comment on attachment 9006252 [details] [diff] [review]
Bug 1436188 - P1 - Keep traversing oirgins to track the number of all the broken origins on Nightly;

Review of attachment 9006252 [details] [diff] [review]:
-----------------------------------------------------------------

I need to see an updated patch.
Attachment #9006252 - Flags: review?(jvarga)
Assignee

Comment 35

8 months ago
Posted patch bug1436188-p1.patch (obsolete) — Splinter Review
Attachment #9006252 - Attachment is obsolete: true
Assignee

Comment 36

8 months ago
Posted patch bug1436188-p1-interdiff.patch (obsolete) — Splinter Review
inter diff for p1
Assignee

Comment 37

8 months ago
Posted patch P1 (obsolete) — Splinter Review
Defer returning failure while failing to initialize repositories, origin, or quota clients.
Attachment #9015445 - Attachment is obsolete: true
Attachment #9015446 - Attachment is obsolete: true
Assignee

Comment 38

8 months ago
Posted patch P2 (obsolete) — Splinter Review
Recording the number of origins which are initialized successfully or unsuccessfully in order to know how many and much percentage of persistence type directories, origin directories, quota clients' directories are broken or not.

After considering, this patch using Scalar::Set rather than Scalar::SetMax because:
A: Telemetry only sending results to the server when the main ping is triggered[1].

B: Assume most users don't touch their profile manually, so if a firefox initialized its QM multiple time with different results of data. I think we want to get the latest result instead of having a result with the most failure or success. (For instance, if a firefox fails to initialize QM at the first time but success at the second time, it indicates these failures are self-recoverable so it's okay to collect the newest data)

[1] https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/data/main-ping.html
Attachment #9012495 - Attachment is obsolete: true
Assignee

Comment 39

8 months ago
Comment on attachment 9015790 [details] [diff] [review]
P1

- Changed "?:" pattern to "if" pattern to reduce useless assignments.
- Since we want to make the collecting data covers the success/fail rate of initializeRepository (metadata recovery), this patch defers the error return in EnsureTemporaryStorageIsInitialized() to ensure QM always initializes both temporary and default repositories on Nightly.
Attachment #9015790 - Flags: review?(jvarga)
Assignee

Comment 40

8 months ago
Comment on attachment 9015791 [details] [diff] [review]
P2

Jan, this patch intends to answer:
- How many repositories are initialized successfully/unsuccessfully?
- How many origins are initialized successfully/unsuccessfully?
- How many quota clients are initialized successfully/unsuccessfully?

Tt uses boolean flags (init_default_repository_broken, and init_temporary_repository_broken) to answer the first question because a profile should only have one default repository and one temporary repository. Boolean flags are enough.

To answer the rest of the questions, it uses two counter for each origin and clients.
Attachment #9015791 - Flags: review?(jvarga)

Comment 41

8 months ago
Comment on attachment 9015790 [details] [diff] [review]
P1

Review of attachment 9015790 [details] [diff] [review]:
-----------------------------------------------------------------

So, the commit message says that we are going to do it on Nightly only, but we also want to do it in Beta and Release eventually.
The main concern is decreased performance. Well, does it actually affect performance ?
Normally, when there is no broken stuff, all repositories (except the permanent one) and all origin directories must be scanned anyway (on all channels).
If there's something broken, we interrupt the initialization process, so the initialization takes less time, but that's an error case. I can imagine only one case when this would be a problem, a case when a user complains that performance decreased because he/she was ok with non-functional storage system, but that's absurd.
I think we can just keep traversing origin directories on all channels.

However, I'm not sure how all that telemetry stuff would affect performance.
Given, that we want to have in long run anyway, we just have to live with that, right ?

Comment 42

8 months ago
Andrew, what do you think about it (comment 41) ?
Flags: needinfo?(bugmail)
I think my thoughts are still along the lines of comment 13.  Namely:
- By definition, if we keep going when previously we stopped, that's more I/O.
- Given the current implementation as I understand it, a failure to initialize QM means we'll _keep_ trying to initialize QM.  Every time a new origin tries to open a quota directory, we'll re-init.  The concern is that this is already a potentially bad situation for the user and potentially adding a ton of additional I/O is not an improvement.
- It's important to keep in mind the difference been actionable information and interesting information.  We already know from telemetry from about:newtab and WebExt's storage.local telemetry that there's a huge problem with QM init on profiles.  We don't need this telemetry to know that we need to fix the problems; we really want it to make sure that the numbers go down and that once they hit zero we can make sure they stay at zero.

As such, I think it's reasonable to only ship the "keep going" logic if either:
A) We believe we're concurrently shipping a fix that will correct QM errors rather than report them.
B) We change the initialization logic to not keep re-running initialization sweeps by either latching the failure state and skipping re-initialization if we've already failed or using something like a time-based back-off to temporarily latch the results for a minute.  (The idea being that the user could potentially clean up their profile while active and it would be nice to recover then.)

Because there's always the risk of 'B' causing additional breakage, I thought it prudent for us to have the telemetry land first before activating 'B'.  (Or preferably just have landed a bunch of fixes so we're also in the 'A' side of things.)
Flags: needinfo?(bugmail)

Comment 44

7 months ago
(In reply to Andrew Sutherland [:asuth] from comment #43)
> - It's important to keep in mind the difference been actionable information
> and interesting information.  We already know from telemetry from
> about:newtab and WebExt's storage.local telemetry that there's a huge
> problem with QM init on profiles.  We don't need this telemetry to know that
> we need to fix the problems; we really want it to make sure that the numbers
> go down and that once they hit zero we can make sure they stay at zero.

Well, I would rather have own telemetry for QM and its clients, long term.

> B) We change the initialization logic to not keep re-running initialization
> sweeps by either latching the failure state and skipping re-initialization
> if we've already failed or using something like a time-based back-off to
> temporarily latch the results for a minute.  (The idea being that the user
> could potentially clean up their profile while active and it would be nice
> to recover then.)

Yeah, this will be the real fix, but I confess we should do it in smaller steps.

Comment 45

7 months ago
Comment on attachment 9015790 [details] [diff] [review]
P1

Review of attachment 9015790 [details] [diff] [review]:
-----------------------------------------------------------------

I think this patch needs to be a bit more complex if we want to get good data from telemetry.
My main concern is that this patch doesn't differentiate between external and internal errors.
By external errors I mean errors coming from failed hardware, network or OS failures, maybe db corruption, etc.
Internal errors are those that come from our origin parser or when IDB finds an unknown file.
In other words, we can't affect/fix external errors, but we can fix our code to eliminate internal errors.
The goal is to eliminate the interal errors as much as possible (ideally 0% in the telemetry), so if don't differentiate
them from external errors, we wouldn't be able to say if we met the goal.
Anyway, we should still track the external errors. If we see a lot of external errors it probably means
that we treat an internal error as an external error, so we need to fix that.
Does that make sense to you ?

::: dom/quota/ActorsParent.cpp
@@ +4251,5 @@
>                                            origin);
>      if (NS_WARN_IF(NS_FAILED(rv))) {
>        return rv;
>      }
>  

So what happens if the GetDirectoryMetadata2WithRestore() call fails ?
Do we still keep initializing directories ?
Attachment #9015790 - Flags: review?(jvarga)
Assignee

Comment 46

7 months ago
(In reply to Jan Varga [:janv] from comment #45)

Thanks for the feedback!

> I think this patch needs to be a bit more complex if we want to get good
> data from telemetry.
> My main concern is that this patch doesn't differentiate between external
> and internal errors.
> By external errors I mean errors coming from failed hardware, network or OS
> failures, maybe db corruption, etc.
> Internal errors are those that come from our origin parser or when IDB finds
> an unknown file.
> In other words, we can't affect/fix external errors, but we can fix our code
> to eliminate internal errors.
> The goal is to eliminate the interal errors as much as possible (ideally 0%
> in the telemetry), so if don't differentiate
> them from external errors, we wouldn't be able to say if we met the goal.
> Anyway, we should still track the external errors. If we see a lot of
> external errors it probably means
> that we treat an internal error as an external error, so we need to fix that.
> Does that make sense to you ?

That sounds reasonable.

So, basically, patches here need to track either the number of external errors additional with overall errors in the current patch or the number of external and internal errors separately. So that we can understand how much space we can improve for initialization by reviewing the number of internal errors and maybe revise the category of external errors if we somehow mistreat an internal error to external.

Will revise the direction.

> ::: dom/quota/ActorsParent.cpp
> @@ +4251,5 @@
> >                                            origin);
> >      if (NS_WARN_IF(NS_FAILED(rv))) {
> >        return rv;
> >      }
> >  
> 
> So what happens if the GetDirectoryMetadata2WithRestore() call fails ?
> Do we still keep initializing directories ?

The current patch stops initializing when it happens. Will make here keep initializing so that we could get complete data.
Assignee

Comment 47

7 months ago
Comment on attachment 9015791 [details] [diff] [review]
P2

Base on Comment 45, I guess this patch definitely needs to be modified. Drop the review request according to that.
Attachment #9015791 - Flags: review?(jvarga)
Assignee

Comment 48

7 months ago
Posted patch P1 (v1) draft (obsolete) — Splinter Review
Attachment #9015790 - Attachment is obsolete: true
Assignee

Comment 49

7 months ago
Posted patch P2 (v1) temp (obsolete) — Splinter Review
Attachment #9015791 - Attachment is obsolete: true
Assignee

Comment 50

7 months ago
Posted patch P1 (v1) (obsolete) — Splinter Review
Jan, this patch defers the time for error returning in InitializeRepository(), and InitializeOrigin(). Could you help me to review it? Thanks!
Attachment #9019006 - Attachment is obsolete: true
Attachment #9019338 - Flags: review?(jvarga)
Assignee

Comment 51

7 months ago
Posted patch P2 (v2) (obsolete) — Splinter Review
Jan, this patch tries to answer:
- How many percentages of profiles are failed to initialize their repositories, origins, asmjscache, cache, and idb?
- Among the errors in InitializeRepository(), and InitializeOrigin(), how many of them are external errors and internal errors?

I'm not so sure whether this patch is in the correct direction. Could you take a glance and give me some feedback? Thanks!
Attachment #9019333 - Attachment is obsolete: true
Attachment #9019340 - Flags: feedback?(jvarga)
Assignee

Comment 52

7 months ago
Comment on attachment 9019338 [details] [diff] [review]
P1 (v1)

I have an idea to make the patch nicer and also it might be better to fold P1 and P2 together. Let me try to do that.

Drop the review request for now
Attachment #9019338 - Flags: review?(jvarga)
Assignee

Comment 53

7 months ago
Posted patch bug1436188.patch (obsolete) — Splinter Review
- Folded two patches (P1, P2) together.
- Rename the variables
- Check more error

This patch tries to answer:
- The success rate of initialization temporary storage:
  A. The success rate of initializing repositories
  B. The success rate of initializing origins
  C. The success rate of initializing asmjscache, indexedDB, cache.

For the failure cases in A and B, it also collects the number of internal and external errors. That means a failure might have many errors behind.

Jan, I want to make sure that the direction of this patch is correct. Could you give me some feedback on this patch? Thanks!
Attachment #9019338 - Attachment is obsolete: true
Attachment #9019340 - Attachment is obsolete: true
Attachment #9019340 - Flags: feedback?(jvarga)
Attachment #9021853 - Flags: feedback?(jvarga)
Assignee

Comment 54

6 months ago
Comment on attachment 9021853 [details] [diff] [review]
bug1436188.patch

Andrew, I wanted to land this patch before the minor upgrade, but there is no hard dependency between this and that. Would you mind give me feedback about this patch if you have time? The comment 53 is still valid. Thanks!

Note that Jan had concerns in Comment 45 and I tried to cover them in this patch. Please let me know if there is anything, we wanted to collect in this telemetry and it's not covered in this patch. Besides, please let me know if the way for handling telemetry in this patch needs to be improved.
Attachment #9021853 - Flags: feedback?(bugmail)
Assignee

Comment 55

6 months ago
Jan suggested that we might need some identifier more specifically. For example, a line number to indicate where it fails exactly. Currently, my patches only are able to distinguish the function name.

Asides from that, I reckon I should also do that in QuotaClients. I might need to take care of the new client (localstorage).
Assignee

Comment 56

5 months ago
Hey Andrew, just for a gentle ping, would you mind skimming attachment 9021853 [details] [diff] [review] when you have time? Thanks!

I want to revise it with comment 55, but it would be great if I can add your feedback about whether the current direction is okay or not or something I neglected together.
Flags: needinfo?(bugmail)
Assignee

Comment 57

5 months ago
(In reply to Tom Tung [:tt, :ttung] from comment #56)

Per private discussion, Andrew is fine with the update. Will modify the patch and change the feedback flag to review tomorrow.
Flags: needinfo?(bugmail)
Assignee

Comment 58

5 months ago
Posted patch patch (obsolete) — Splinter Review
Attachment #9021853 - Attachment is obsolete: true
Attachment #9021853 - Flags: feedback?(jvarga)
Attachment #9021853 - Flags: feedback?(bugmail)
Assignee

Comment 59

5 months ago
Comment on attachment 9032406 [details] [diff] [review]
patch

Andrew, after considering, I think the percentage of origin success or not is not really important. I guess the number of failures and the place where they fail is much matter to us.

So, this patch basically does:
- Create a TelemetryDataClass to collect the errors and submit errors after traversing directories.
- Defer the returns in Nightly for InitializeRepository(), and InitializeOrigin().

I'm planning to track the initialization local storage in another patch or a follow-up bug. If you think it's not necessary, I'll put that in this patch as well.

Could you help me to review it, especially:
- Whether the way to collect the data is appropriate or not. I mean the function name and the keyed Scalar.
- The way I handle the hashtable, and string. I'm a bit afraid they might slow down the initialization.
- The external and internal errors I distinguish is correct or not.
- The way I store the counter of internal and external errors. I'm not so sure if it's okay or not.
Thanks!
Attachment #9032406 - Flags: review?(bugmail)
(In reply to Tom Tung [:tt, :ttung] from comment #59)
> Could you help me to review it, especially:
> - Whether the way to collect the data is appropriate or not. I mean the
> function name and the keyed Scalar.

My understanding from the docs at https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/collection/scalars.html#keyed-scalars are that if we know the set of string keys ahead of time, it's strongly preferred that we use a categorical or enumerated histogram instead https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/collection/histograms.html#categorical

From a storage perspective, I think they're basically equivalent, it's just a question of whether the telemetry dashboards show the labels (categorical) or numbers (enumerated) for us.  As quick example links to the dashboard:
- categorical: https://mzl.la/2Vpg4gF
- enumerated: https://mzl.la/2Vp6j1U

And inspecting `about:telemetry` locally, I have the following data points for the categorical histogram example (created via clicking the "copy" button).
CONTENT_FRAME_TIME_REASON
2370534 samples, average = 0.9, sum = 2049793

0 |#########################  1607639  68%
1 |  16681  1%
2 |###  205530  9%
3 |########  540684  23%
4 |  0  0%


I think this works as to how we've evolved to desire the data.  Restating my understanding of the patch and intent, our goal is to track internal and external errors for initializing:
- each type of repository
- each origin (Quota Manager general, not client specific
- each client (under each origin)

We both want to know how many times errors happened, as well as know where the error happened, which is why you've added string descriptors for each potential point-of-failure in the methods.

I think this means a move from keyed scalars to categorical histograms, and then each of those strings becomes a categorical labels.  (And we set n_buckets to be more than we need so we can add more later.)

> - The way I handle the hashtable, and string. I'm a bit afraid they might
> slow down the initialization.
> - The way I store the counter of internal and external errors. I'm not so
> sure if it's okay or not.

I don't think we actually need TelemetryDataCollector.  Given how we're not capturing and submitting the data, I think we can just directly use the telemetry API via conditionally-defined macros that turn into no-ops if we're not on nightly.  This eliminates the need to pass pointers around and such.

For example:
```
#ifdef NIGHTLY_BUILD
    TelemetryDataCollector::SetTelemetryInfo(telemetryInfo, true,
        NS_LITERAL_STRING("NS_NewLocalFile"));
    aCollector->RecordError(TelemetryDataCollector::eInitRepository,
        telemetryInfo);
#endif
```

might become:
```
    REPORT_INIT_EXTERNAL_ERROR(NewLocalFile);
```
with supporting defines that may need some work:
```
#ifdef NIGHTLY_BUILD
  #define REPORT_INIT_TELEMETRY_ERROR(theLabel) \
    Telemetry::Accumulate(Telemetry::LABELS_QM_INIT_REPOSITORY_EXTERNAL_ERR:: ## theLabel )
#else
  #define REPORT_INIT_TELEMETRY_ERROR(theLabel)  {}
#endif
```

> - The external and internal errors I distinguish is correct or not.

I'll comment on these specifically in the patch.
Er and REPORT_INIT_TELEMETRY_ERROR was meant to be REPORT_INIT_EXTERNAL_ERROR in the macro example.
Comment on attachment 9032406 [details] [diff] [review]
patch

Review of attachment 9032406 [details] [diff] [review]:
-----------------------------------------------------------------

The internal/external choices make sense to me.  One thing I might suggest is that when converting to categorical labels, perhaps allocate distinct labels for each repeated operation.  Like right now you're just using "Append" for all the appends, but if we have a series of operations like:
1. Append
2. Clone
3. Append
4. Append

It could make sense to have the labels be: Append1, Clone1, Append2, Append3.  Then we can treat the enum value as tracking how far we got through the method unambiguously.  Since the labels are distinct per categorical histogram, this should be somewhat reasonable and visually understandable on the dashboard.


Please let me know if there was anything else from the review that you wanted feedback on.  I think there may be have been more intent to the deferral of the telemetry data recording, but in practice I don't think we need it because:
- Although telemetry uses a global mutex for logging the data, we are only logging in error cases and on nightly.  So we don't need to do it to be good thread citizens.
- Submitting the ping data coherently isn't likely to be a big issue, but we could do something like have a separate scalar that we only set to 1 once we've completed the full initialization sweep/etc.

But in general, I think us maintaining our own data accumulation really makes sense if we need to perform some mathematical transform on the data before handing it off to telemetry, and I don't think we want to do that (anymore).
Attachment #9032406 - Flags: review?(bugmail) → feedback+
Assignee

Comment 63

5 months ago
(In reply to Andrew Sutherland [:asuth] from comment #60)
Thanks for the review and feedback!

> (In reply to Tom Tung [:tt, :ttung] from comment #59)
> My understanding from the docs at
> https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/
> telemetry/collection/scalars.html#keyed-scalars are that if we know the set
> of string keys ahead of time, it's strongly preferred that we use a
> categorical or enumerated histogram instead
> https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/
> telemetry/collection/histograms.html#categorical

You are right. I forgot the document suggests that. 

> From a storage perspective, I think they're basically equivalent, it's just
> a question of whether the telemetry dashboards show the labels (categorical)
> or numbers (enumerated) for us.  As quick example links to the dashboard:
> - categorical: https://mzl.la/2Vpg4gF
> - enumerated: https://mzl.la/2Vp6j1U
> 
> And inspecting `about:telemetry` locally, I have the following data points
> for the categorical histogram example (created via clicking the "copy"
> button).
> CONTENT_FRAME_TIME_REASON
> 2370534 samples, average = 0.9, sum = 2049793
> 
> 0 |#########################  1607639  68%
> 1 |  16681  1%
> 2 |###  205530  9%
> 3 |########  540684  23%
> 4 |  0  0%
> 
> 
> I think this works as to how we've evolved to desire the data.  Restating my
> understanding of the patch and intent, our goal is to track internal and
> external errors for initializing:
> - each type of repository
> - each origin (Quota Manager general, not client specific
> - each client (under each origin)
> 
> We both want to know how many times errors happened, as well as know where
> the error happened, which is why you've added string descriptors for each
> potential point-of-failure in the methods.
> 
> I think this means a move from keyed scalars to categorical histograms, and
> then each of those strings becomes a categorical labels.  (And we set
> n_buckets to be more than we need so we can add more later.)

Thanks for explaining detailedly! That really makes sense to me. I'll change that to categorical histograms.

> I don't think we actually need TelemetryDataCollector.  Given how we're not
> capturing and submitting the data, I think we can just directly use the
> telemetry API via conditionally-defined macros that turn into no-ops if
> we're not on nightly.  This eliminates the need to pass pointers around and
> such.
> 
> For example:
> ```
> #ifdef NIGHTLY_BUILD
>     TelemetryDataCollector::SetTelemetryInfo(telemetryInfo, true,
>         NS_LITERAL_STRING("NS_NewLocalFile"));
>     aCollector->RecordError(TelemetryDataCollector::eInitRepository,
>         telemetryInfo);
> #endif
> ```
> 
> might become:
> ```
>     REPORT_INIT_EXTERNAL_ERROR(NewLocalFile);
> ```
> with supporting defines that may need some work:
> ```
> #ifdef NIGHTLY_BUILD
>   #define REPORT_INIT_TELEMETRY_ERROR(theLabel) \
>    
> Telemetry::Accumulate(Telemetry::LABELS_QM_INIT_REPOSITORY_EXTERNAL_ERR:: ##
> theLabel )
> #else
>   #define REPORT_INIT_TELEMETRY_ERROR(theLabel)  {}
> #endif
> ```

The main reason for using a collector is because I wanted to submit data by using the set method after traversing whole the directory. If we used the add method each time we found an error, a broken profile may submit multiple time. (because they never flip the initialized flag in QuotaManager)

However, I agree with this way makes the patch really terrible. If the reason is not strong enough, I preferred using the add method and a macro to update the data. 

(In reply to Andrew Sutherland [:asuth] from comment #62)
> The internal/external choices make sense to me.  One thing I might suggest
> is that when converting to categorical labels, perhaps allocate distinct
> labels for each repeated operation.  Like right now you're just using
> "Append" for all the appends, but if we have a series of operations like:
> 1. Append
> 2. Clone
> 3. Append
> 4. Append
> 
> It could make sense to have the labels be: Append1, Clone1, Append2,
> Append3.  Then we can treat the enum value as tracking how far we got
> through the method unambiguously.  Since the labels are distinct per
> categorical histogram, this should be somewhat reasonable and visually
> understandable on the dashboard.

I'm probably wrong, but I think the implementations of the Append() are the same no matter which Append() function is called. Therefore, errors coming from (1), (2), and (4) should be calculated together.

There is a case, I can think about, might matter: if there is an Append() appends an unexpected string or something like that so that it fails, we might want to locate where exactly the problematic place it is. However, I assume they are taken care by our unit tests. (I might still assume too much, though)
 
> 
> Please let me know if there was anything else from the review that you
> wanted feedback on.  I think there may be have been more intent to the
> deferral of the telemetry data recording, but in practice I don't think we
> need it because:
> - Although telemetry uses a global mutex for logging the data, we are only
> logging in error cases and on nightly.  So we don't need to do it to be good
> thread citizens.
> - Submitting the ping data coherently isn't likely to be a big issue, but we
> could do something like have a separate scalar that we only set to 1 once
> we've completed the full initialization sweep/etc.
> 
> But in general, I think us maintaining our own data accumulation really
> makes sense if we need to perform some mathematical transform on the data
> before handing it off to telemetry, and I don't think we want to do that
> (anymore).

I guess the only question for me is that: Is the number difference of failures we collected important to us?
Because the number of each label would probably be depending on how many time the broken profiles are initialized. For instance, if my profile always fails to be initialized at a certain point. The data which my profile submit through the main ping would be the number of times QM initialized for my profile. 

If the number (how big it is) is not important, then it's better to use the add method.
(In reply to Tom Tung [:tt, :ttung] from comment #63)
> There is a case, I can think about, might matter: if there is an Append()
> appends an unexpected string or something like that so that it fails, we
> might want to locate where exactly the problematic place it is. However, I
> assume they are taken care by our unit tests. (I might still assume too
> much, though)

This is one of the cases I was thinking about when I made my suggestion.  But it's also because there's really two pieces of information about the failure: what method call failed, and how far had we made it through the function before we ran into a problem.  This is all hand-waving and the exact method doesn't matter, but if there's an Append() at the top of a method and then there's an Append() at the bottom of the method, it's very notable whether we got to the bottom of the method successfully or not.

> I guess the only question for me is that: Is the number difference of
> failures we collected important to us?
> Because the number of each label would probably be depending on how many
> time the broken profiles are initialized. For instance, if my profile always
> fails to be initialized at a certain point. The data which my profile submit
> through the main ping would be the number of times QM initialized for my
> profile. 

Yes, this was a nice attribute of the approach since the comment 38 revision.  I see a few options here:
1. Have QuotaManager set per-repository flags after their first initialization that disables from gathering any new histograms until the next restart of Firefox/QuotaManager.  This keeps our histograms to a single set of initialization sweeps.  This gives us a more accurate picture of how many individual errors are in a given profile.  We'd ideally try and clear the flags when telemetry submits the ping and clears its histograms, although it might not really matter.
2. Don't bother to limit the multiple sweeps and accept that some numbers will be very high.  The nice thing about this is that if there are users who are experiencing incredible I/O churn because of repeated QM init sweeps, it will be obvious.  The downside is we don't really know how many sweeps have happened.
3. Same as the 2nd option but we also add a scalar that tracks how many times we have tried to initialize the given repository.  So we invoke ScalarAdd every time and we can normalize the histograms from #2 by dividing by this scalar.

I'd lean towards options 2 or 3.
Assignee

Comment 65

5 months ago
(In reply to Andrew Sutherland [:asuth] from comment #64)
Thanks for explaining! 

> This is one of the cases I was thinking about when I made my suggestion. 
> But it's also because there's really two pieces of information about the
> failure: what method call failed, and how far had we made it through the
> function before we ran into a problem.  This is all hand-waving and the
> exact method doesn't matter, but if there's an Append() at the top of a
> method and then there's an Append() at the bottom of the method, it's very
> notable whether we got to the bottom of the method successfully or not.

I see

> Yes, this was a nice attribute of the approach since the comment 38
> revision.  I see a few options here:
> 1. Have QuotaManager set per-repository flags after their first
> initialization that disables from gathering any new histograms until the
> next restart of Firefox/QuotaManager.  This keeps our histograms to a single
> set of initialization sweeps.  This gives us a more accurate picture of how
> many individual errors are in a given profile.  We'd ideally try and clear
> the flags when telemetry submits the ping and clears its histograms,
> although it might not really matter.
> 2. Don't bother to limit the multiple sweeps and accept that some numbers
> will be very high.  The nice thing about this is that if there are users who
> are experiencing incredible I/O churn because of repeated QM init sweeps, it
> will be obvious.  The downside is we don't really know how many sweeps have
> happened.
> 3. Same as the 2nd option but we also add a scalar that tracks how many
> times we have tried to initialize the given repository.  So we invoke
> ScalarAdd every time and we can normalize the histograms from #2 by dividing
> by this scalar.
> 
> I'd lean towards options 2 or 3.

I think option 2 is good enough for now. 3 is also good, but maybe we can do that when we want the normalized value.
Assignee

Comment 66

5 months ago
This patch uses categorical keyed Historgram to collect data. While the key is
used to determine whether is it an external error or an internal error (Note:
the external error is referred to low level failure, for example: database
corrupt, OS API errors, ... etc; the internal error is referred to errors, like:
not handle file properly, unexpected filenames, ... etc), the labels for
categorical indicates where the error happens.
Furthermore, this patch make QuotaManager keep traversing the profile even if
an error happens so that we can get more information in the telemetry data.
Please note that these things should only happen in the Nightly Channel.
Assignee

Comment 67

4 months ago

Comment on attachment 9032406 [details] [diff] [review]
patch

Remove this patch since I've updated it to Phabricator (D15908)

Attachment #9032406 - Attachment is obsolete: true
Assignee

Comment 68

4 months ago
Posted file data-request.txt
Assignee

Comment 69

4 months ago

Comment on attachment 9036330 [details]
data-request.txt

Hi :chutten,

I'm not sure if it's better to have this form on the Phabricator, so I leave it here.

We want to know how many failures happen during users' storage initialization. Please refer to the code in Phabricator. Could you help me to have a data-review? Thanks!

Attachment #9036330 - Flags: review?(chutten)

Comment on attachment 9036330 [details]
data-request.txt

Preliminary notes:

For question 8 the opt-out for users is to not use Nightly or the standard Telemetry opt-out in the preferences.

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes. This collection is Telemetry so is documented in its definitions file (Histograms.json), the Probe Dictionary, and on telemetry.mozilla.org's Measurement Dashboards.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

N/A this collection expires in Firefox 70.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default-on, prerelease channels only. There is an additional filter that this code (including the collection) should only execute on Nightly builds.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

Yes. :tt is responsible for removing or renewing the collection before it expires in Firefox 70.


Result: datareview+

Attachment #9036330 - Flags: review?(chutten) → review+
Assignee

Comment 71

4 months ago
Posted patch interdiff.patch (obsolete) — Splinter Review

Andrew, I got errors on try and here is the inter-diff patch to fix them. Since they are subtle, I would like to add them to the current patch on the Phabritor if it gets r+

Would you mind review that? Thanks!

Attachment #9036585 - Flags: review?(bugmail)
Attachment #9036585 - Flags: review?(bugmail) → review+
Assignee

Comment 72

4 months ago

Thanks for the extra review and the previous review! I learn a lot here.

Assignee

Comment 73

4 months ago

Comment on attachment 9036585 [details] [diff] [review]
interdiff.patch

Update this patch to P1

Attachment #9036585 - Attachment is obsolete: true
Attachment #9034944 - Attachment description: Bug 1436188 - Add telemetry probes to collect the number of failures during the QuotaManager initialization; → Bug 1436188 - Add telemetry probes to collect the number of failures during the QuotaManager initialization; r=asuth, data-review=chutten

Comment 74

4 months ago
Pushed by shes050117@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/c0db84bf3f57
Add telemetry probes to collect the number of failures during the QuotaManager initialization; r=asuth, data-review=chutten

Comment 75

4 months ago
bugherder
Status: NEW → RESOLVED
Last Resolved: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla66
You need to log in before you can comment on or make changes to this bug.