Closed Bug 1594080 Opened 5 years ago Closed 4 years ago

Ignore errors under the granularity of origin for initialization

Categories

(Core :: Storage: Quota Manager, defect, P1)

defect

Tracking

()

RESOLVED DUPLICATE of bug 1521541

People

(Reporter: tt, Assigned: tt)

References

(Blocks 1 open bug)

Details

This might require marker files for origin directories to be fixed

Assignee: nobody → ttung
Status: NEW → ASSIGNED

We probably want to have a hash table and mark which origin is broken during initialization.

We probably need more coordination here and actually have a plan what to do next after fixing unexpected/unknown stuff and adding new telemetry.

(In reply to Jan Varga [:janv] from comment #2)

We probably need more coordination here and actually have a plan what to do next after fixing unexpected/unknown stuff and adding new telemetry.

Yes, in general, I think it'd be better to track errors that we want to fix in the future (e.g. origin parser's error) in some ways. (Even though we decide to ignore them during storage initialization)

I put my initial thoughts at this moment below:

For debugging/diagnosing in the future:

  • we might want to show these ignored errors during initialization on an about page. So that users can show the result of that page when they file a storage-related bug
  • we might want to let user know the storage for the visiting website is probably broken. So that they might file bugs for us. (Now, I guess most users probably even aren't aware the storage for the visiting website is broken)

For recovering in the future:

  • we might want to have a place (e.g. about:preference) for users to get away from these errors by a click or some simple way.
    (now general users probably get away of these by resetting their profile)

Note that these thoughts definitely need to be considered more. My intention is to put them here so that we could figure out a better way to fix/address these problems.

Ok, that can be done in future if there are still some hidden issues. Our goal should be to provide a robust storage system, so users won't have to deal with errors at all.

One source of problems is OriginParser and it would be good to get rid of it completely.

(In reply to Jan Varga [:janv] from comment #4)

Ok, that can be done in future if there are still some hidden issues. Our goal should be to provide a robust storage system, so users won't have to deal with errors at all.

I see, so I was thinking in the wrong direction.

To provide a robust system, I agree with getting rid of OriginParser should help a lot. And we probably want to somehow traverse the files/directories through a list (e.g. an allowList or store a list of accessible files in stroage.sqlite) rather than iterating all of them under profiles to avoid unnecessary IOs for these unknown files/directories.

(In reply to Jan Varga [:janv] from comment #2)

We probably need more coordination here and actually have a plan what to do next after fixing unexpected/unknown stuff and adding new telemetry.

I list the major issues (get > 300 failures from a week) based on telemetry data we collected in the past:

  • Rep_RestoreDirMeta
    Bug 1594075 should reduce the number of it. But we might still need to track the number once we reenable the telemetry.
    Getting rid of OriginParser should sort of cover the ignored cases in Bug 1594075.

  • Ori_UnexpectedFile
    Bug 1592204 helps us to get away from this issue by ignoring them.

  • LS_CreateConnection
    We need to collect more data from it to get more information.

  • IDB_GetDBFilenames
    A part of it is probably also related to the unknown files/directories. I will handle that.

  • Rep_UnexpectedFile
    Bug 1592204 helps us to get away from this issue by ignoring them.

Overall, I think we need to add more telemetry on LS_CreateConnection, IDB_GetDBFilenames to have more ideas on what's going on. We might need to also do that for Rep_RestoreDirMeta, but we could also check the data and then make the decision once we renew the telemetry.

One source of problems is OriginParser and it would be good to get rid of it completely.

The only well answer I can think of is that replacing the origin directory name to UUID. So that we don't need to use OriginParser to parse the sanitized origin string name. Another way is to get the origin from the metadata file first instead of from the name of the origin directory, but this is probably not a good idea.

The biggest downside of this is that this might need a major upgrade to change all of the existing origin directories' names to UUIDs. And we might also consider whether to provide a way to downgrade for that.

Ok, so it seems you basically agree with my proposal in bug 1593365 (QuotaManager storage v3) and with the solution in the telemetry interpretation document. And that it should be done after we finish work on adding new telemetry and also after we finish work on ignoring unknown stuff in current storage implementation.

And I think we should try to address all internal and external issues (not sure how you've chosen the number 300) or at least have a fallback mechanism for them. For example, Cache_GetPaddingSize, Ori_IsDirectory and other failures also showed up in the telemetry we collected. And it wasn't negligible amount.

Depends on: 1598555

The patches in bug 1521541 might be able to reuse for this issue.

See Also: → 1521541

(In reply to Tom Tung [:tt, :ttung] from comment #8)

The patches in bug 1521541 might be able to reuse for this issue.

Are they able? Can this then be closed?

Flags: needinfo?(ttung)

If the patches can be reused, why we have two bugs for this ? Why this new bug was created ?
Like Jens, I'm also asking if this can be closed then.

I think they are able to reuse for this bug, but I'm not so sure if there is a better solution for existing issues (because the fallback solution in it is only to estimate the size of the broken origins rather than recovering these origins; these origin cannot be accessed because they are broken).

I will add more patches in bug 1521541 if I find a better way to handle these existing issues (e.g. Cache_GetPaddingSize, Ori_IsDirectory).

Also, maybe we want to check the newer result after bug 1600352 because there a few changes related after 71 (e.g. I change the code around Cache_GetPaddingSize, so maybe there are fewer failures for that now). Besides, we want to check the failures on Rep_RestoreDirMeta after bug 1521541 since it should be affected.

(In reply to Jan Varga [:janv] from comment #10)

If the patches can be reused, why we have two bugs for this ? Why this new bug was created ?

I didn't think too much when I created this. I created this because this should be the last step to reduce the storage initialization failures (so that broken origin won't affect others).

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Flags: needinfo?(ttung)
Resolution: --- → DUPLICATE

Ok, I have one recommendation, I would focus on fixing particular issues in client directories, because that shouldn't be too much affected by QM v4 (except the directory scan that is IDB doing when collecting a list of databases).
For example, it's good to focus on Cache_GetPaddingSize because that is a client specific problem or any internal IDB_* or LS_*
And as you mentioned some stuff has changed in the meantime, so after we finish work on ignoring unknown directories, we need new (or renewed) detailed telemetry (especially for client directories).
As I said in bug 1521541 comment 24, we need work on it systematically.

You need to log in before you can comment on or make changes to this bug.