Ignore errors under the granularity of origin for initialization
Categories
(Core :: Storage: Quota Manager, defect, P1)
Tracking
()
People
(Reporter: tt, Assigned: tt)
References
(Blocks 1 open bug)
Details
This might require marker files for origin directories to be fixed
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 1•3 years ago
|
||
We probably want to have a hash table and mark which origin is broken during initialization.
Comment 2•3 years ago
|
||
We probably need more coordination here and actually have a plan what to do next after fixing unexpected/unknown stuff and adding new telemetry.
Assignee | ||
Comment 3•3 years ago
|
||
(In reply to Jan Varga [:janv] from comment #2)
We probably need more coordination here and actually have a plan what to do next after fixing unexpected/unknown stuff and adding new telemetry.
Yes, in general, I think it'd be better to track errors that we want to fix in the future (e.g. origin parser's error) in some ways. (Even though we decide to ignore them during storage initialization)
I put my initial thoughts at this moment below:
For debugging/diagnosing in the future:
- we might want to show these ignored errors during initialization on an about page. So that users can show the result of that page when they file a storage-related bug
- we might want to let user know the storage for the visiting website is probably broken. So that they might file bugs for us. (Now, I guess most users probably even aren't aware the storage for the visiting website is broken)
For recovering in the future:
- we might want to have a place (e.g. about:preference) for users to get away from these errors by a click or some simple way.
(now general users probably get away of these by resetting their profile)
Note that these thoughts definitely need to be considered more. My intention is to put them here so that we could figure out a better way to fix/address these problems.
Comment 4•3 years ago
|
||
Ok, that can be done in future if there are still some hidden issues. Our goal should be to provide a robust storage system, so users won't have to deal with errors at all.
One source of problems is OriginParser and it would be good to get rid of it completely.
Assignee | ||
Comment 5•3 years ago
|
||
(In reply to Jan Varga [:janv] from comment #4)
Ok, that can be done in future if there are still some hidden issues. Our goal should be to provide a robust storage system, so users won't have to deal with errors at all.
I see, so I was thinking in the wrong direction.
To provide a robust system, I agree with getting rid of OriginParser should help a lot. And we probably want to somehow traverse the files/directories through a list (e.g. an allowList or store a list of accessible files in stroage.sqlite) rather than iterating all of them under profiles to avoid unnecessary IOs for these unknown files/directories.
(In reply to Jan Varga [:janv] from comment #2)
We probably need more coordination here and actually have a plan what to do next after fixing unexpected/unknown stuff and adding new telemetry.
I list the major issues (get > 300 failures from a week) based on telemetry data we collected in the past:
-
Rep_RestoreDirMeta
Bug 1594075 should reduce the number of it. But we might still need to track the number once we reenable the telemetry.
Getting rid of OriginParser should sort of cover the ignored cases in Bug 1594075. -
Ori_UnexpectedFile
Bug 1592204 helps us to get away from this issue by ignoring them. -
LS_CreateConnection
We need to collect more data from it to get more information. -
IDB_GetDBFilenames
A part of it is probably also related to the unknown files/directories. I will handle that. -
Rep_UnexpectedFile
Bug 1592204 helps us to get away from this issue by ignoring them.
Overall, I think we need to add more telemetry on LS_CreateConnection, IDB_GetDBFilenames to have more ideas on what's going on. We might need to also do that for Rep_RestoreDirMeta, but we could also check the data and then make the decision once we renew the telemetry.
One source of problems is OriginParser and it would be good to get rid of it completely.
The only well answer I can think of is that replacing the origin directory name to UUID. So that we don't need to use OriginParser to parse the sanitized origin string name. Another way is to get the origin from the metadata file first instead of from the name of the origin directory, but this is probably not a good idea.
The biggest downside of this is that this might need a major upgrade to change all of the existing origin directories' names to UUIDs. And we might also consider whether to provide a way to downgrade for that.
Comment 6•3 years ago
•
|
||
Ok, so it seems you basically agree with my proposal in bug 1593365 (QuotaManager storage v3) and with the solution in the telemetry interpretation document. And that it should be done after we finish work on adding new telemetry and also after we finish work on ignoring unknown stuff in current storage implementation.
Comment 7•3 years ago
•
|
||
And I think we should try to address all internal and external issues (not sure how you've chosen the number 300) or at least have a fallback mechanism for them. For example, Cache_GetPaddingSize, Ori_IsDirectory and other failures also showed up in the telemetry we collected. And it wasn't negligible amount.
Assignee | ||
Comment 8•3 years ago
|
||
The patches in bug 1521541 might be able to reuse for this issue.
Comment 9•3 years ago
|
||
(In reply to Tom Tung [:tt, :ttung] from comment #8)
The patches in bug 1521541 might be able to reuse for this issue.
Are they able? Can this then be closed?
Updated•3 years ago
|
Comment 10•3 years ago
|
||
If the patches can be reused, why we have two bugs for this ? Why this new bug was created ?
Like Jens, I'm also asking if this can be closed then.
Assignee | ||
Comment 11•3 years ago
|
||
I think they are able to reuse for this bug, but I'm not so sure if there is a better solution for existing issues (because the fallback solution in it is only to estimate the size of the broken origins rather than recovering these origins; these origin cannot be accessed because they are broken).
I will add more patches in bug 1521541 if I find a better way to handle these existing issues (e.g. Cache_GetPaddingSize, Ori_IsDirectory).
Also, maybe we want to check the newer result after bug 1600352 because there a few changes related after 71 (e.g. I change the code around Cache_GetPaddingSize, so maybe there are fewer failures for that now). Besides, we want to check the failures on Rep_RestoreDirMeta after bug 1521541 since it should be affected.
(In reply to Jan Varga [:janv] from comment #10)
If the patches can be reused, why we have two bugs for this ? Why this new bug was created ?
I didn't think too much when I created this. I created this because this should be the last step to reduce the storage initialization failures (so that broken origin won't affect others).
Comment 12•3 years ago
|
||
Ok, I have one recommendation, I would focus on fixing particular issues in client directories, because that shouldn't be too much affected by QM v4 (except the directory scan that is IDB doing when collecting a list of databases).
For example, it's good to focus on Cache_GetPaddingSize because that is a client specific problem or any internal IDB_* or LS_*
And as you mentioned some stuff has changed in the meantime, so after we finish work on ignoring unknown directories, we need new (or renewed) detailed telemetry (especially for client directories).
As I said in bug 1521541 comment 24, we need work on it systematically.
Description
•