Closed Bug 1521541 Opened 5 years ago Closed 3 years ago

Continue the initialization even when a file or directory is broken after failing to recover it

Categories

(Core :: Storage: Quota Manager, enhancement, P3)

enhancement

Tracking

()

RESOLVED DUPLICATE of bug 1671932

People

(Reporter: tt, Assigned: tt)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(6 files)

This is a follow-up bug from 1504535 and the discussion from the previous team meeting

The goal here is to keep initialization by ignoring the broken files when finding these broken files.
(Right now, we just break the initialization)

I think we want to do:
QM

  • Add a marker/blacklist indicates whether a client directory is broken so that we won't access the broken client directory.
    • This indicator needs to be removed while the broken directory is removed.

IDB

  • A blacklist to indicate a base name of a set of files can not be accessed due to failing to remove them.

Note that

  • We should have this only when we have tried to recover but fail due to an external error. e.g. fail to remove files

  • The decision of choosing a marker file or an in-memory blacklist depends whether we want to check the broken stuff again in the next initialization. If we don't want to do that, then we can have a marker file as an indicator and somehow fix it in the next upgrade.

Depends on: 1504535

For the continuing initialization on QM if there is a broken client/origin, below is what I want to do in this bug:

  • Initialization
    While initializing and finding a broken origin/client, estimating the size of files (excluding the whitelist cases, e.g. dot-files, os files ...) for the client directory. If it succeeds, just construct the originInfo as usual and record it on the blacklist hashtable. If it fails, abort the initialization as usual.

  • Next access to the broken origin
    Next time when EnsureOriginIsInitialized() is called, check whether the origin is on the list. If it is, re-initialize the origin again. If re-initializing fails, return an error (NS_ERROR_ABORT, maybe?).

  • GetUsage
    While getting the Firefox usage on the Preference, if the client is on the list, just call the function of estimating size rather than calling the general GetUsage.

  • Eviction
    While evicting that origin, we might need some hacks to avoid having a negative value of that originInfo's size. Furthermore, perhaps, raise the priority of being evicted of those origins. (A downside of this is that we might have less chance to find the problem since the evidence might be destroyed due to this)

In this way, we can avoid the size mismatching problem (have a different size between Firefox and OS usage viewer), and a broken origin/client blocks the whole quota. Also, if the origin is somehow suddenly recovered, I believe we don't need to trigger another eviction since we already have an estimated size for that origin.

We might still have a problem if there is a virtual size, which is recorded in the Database, on the broken client/origin. However, I think we might be able to neglect this case as the first step.

Andrew and Jan, do you have any suggestion about this? And, is there a case I miss or might block the implementation? Thank you in advance!

Flags: needinfo?(jvarga)
Flags: needinfo?(bugmail)

This sounds great! Thank you for the excellent description of your plans.

Restating my understanding, each origin can be in one of the following conceptual states:

  • (Conceptual only) Uninitialized: We have not yet attempted to initialize the repository. And so we're not even aware of the existence of the origin.
  • Initialized: The origin successfully initialized.
    • Our usage numbers for the origin are those reported by the clients.
    • Attempts to access the origin will succeed.
  • Broken: The origin failed to initialize because of problems at the root of the origin directory or because one or more clients experienced an error during initialization.
    • A fallback usage calculation is run that tracks the on-disk size of all the files in the origin, but ignoring any of the files on our ignore list (dot-files, OS metadata files, etc.)
    • Each attempt to access the origin will result in us attempting to re-initialize only this origin. If the attempt succeeds, the origin will move to the "Initialized" state and the access attempt will succeed. If the attempt fails, the origin will remain in the broken state.

The overall changes in behavior from the proposed changes are:

  • Repository initialization can succeed despite broken origins.
  • Broken origins will have best-effort disk usages assigned to them so they can be evicted. (Otherwise, we'd keep broken origins around forever if we acted like they didn't exist, which could be a real problem!)

The decisions we are making in terms of behavior right now are:

  • We will keep broken origins around for the time being. Motivating concerns are:
    • It's possible for origin clearing to fail due to weird-behavior from anti-virus software or other software on the system holding locks on files for some reason. It's important for this scenario to not break the QuotaManager. By having a concept of a broken origin we can contain the error so that QuotaManager as a whole doesn't break.
    • We are taking baby-steps towards wiping out corrupted origins. Per the proverb "measure twice and cut once", we want to make sure we understand why origins are broken and attempt to fix the underlying problems rather than falling into a cycle of continually wiping out origins and losing user data, never fixing the underlying problem.
  • We will keep re-trying to initialize an origin when an attempt is explicitly made to access the origin. This is really only useful if we expect that either a) our failure to initialize the origin was due to temporary file-locks held by other processes blocking our successful initialization that will go away or b) we expect the user to manually fix the origin.

I think it might be worth splitting the "broken" state into 2 states: "broken" and "zombie" and further altering our behavior. The basic delineation would be:

  • Broken: Something is wrong with the origin's contents on disk and we couldn't initialize it, but we're still keeping the origin around for now in the hopes we can fix the problem in the future.
    • Usage is based on the fallback estimate.
    • We will attempt to-reinitialize the origin each time.
    • As a future enhancement, we can consider keeping a counter of the number of times an attempt is made to access the broken origin and we fail to move to "initialized". If the counter crosses some (low) threshold, we could automatically decide to clear the origin based on the assumption that the user really wants to use the origin and having QM broken for the origin is not helping.
  • Zombie: We are either currently trying to clear this origin or have already tried and failed to clear the origin.
    • Usage is based on the fallback estimate (which would be re-calculated after each attempt to delete files in the origin).
    • No attempts will be made to re-initialize the origin.
    • Maybe we put a marker file in the directory to mark it as a zombie, and if we detect that during initialization we:
      • Attempt to re-delete the directory.
      • If we fail to completely erase the directory, mark it as a zombie in memory.

We could defer the marker file stuff for zombie directories, but it seems worthwhile to already have distinct states for broken/zombie. (Although they need not have those names. The one thing I think for sure is that in general we are trying to move away from the terms "blacklist" and "whitelist", with "blocklist" and "safelist" replacing them for direct use, or more specific terms if we can come up with them.)

Flags: needinfo?(bugmail)

(In reply to Andrew Sutherland [:asuth] from comment #2)

Thanks for the feedback and detailed explanation! I think we can have initialized/broken/zombie states on originInfo or a table on QM to help us to distinguish them.

This sounds great! Thank you for the excellent description of your plans.

Restating my understanding, each origin can be in one of the following conceptual states:

  • (Conceptual only) Uninitialized: We have not yet attempted to initialize the repository. And so we're not even aware of the existence of the origin.
  • Initialized: The origin successfully initialized.
    • Our usage numbers for the origin are those reported by the clients.
    • Attempts to access the origin will succeed.
  • Broken: The origin failed to initialize because of problems at the root of the origin directory or because one or more clients experienced an error during initialization.
    • A fallback usage calculation is run that tracks the on-disk size of all the files in the origin, but ignoring any of the files on our ignore list (dot-files, OS metadata files, etc.)
    • Each attempt to access the origin will result in us attempting to re-initialize only this origin. If the attempt succeeds, the origin will move to the "Initialized" state and the access attempt will succeed. If the attempt fails, the origin will remain in the broken state.

The conceptual states are great! Before asking, I was thinking have a flag to indicate whether the origin is broken or not. Now, I believe we could have a state to tell is the origin either initialized(normal), broken, or zombie.

The overall changes in behavior from the proposed changes are:

  • Repository initialization can succeed despite broken origins.
  • Broken origins will have best-effort disk usages assigned to them so they can be evicted. (Otherwise, we'd keep broken origins around forever if we acted like they didn't exist, which could be a real problem!)

The decisions we are making in terms of behavior right now are:

  • We will keep broken origins around for the time being. Motivating concerns are:
    • It's possible for origin clearing to fail due to weird-behavior from anti-virus software or other software on the system holding locks on files for some reason. It's important for this scenario to not break the QuotaManager. By having a concept of a broken origin we can contain the error so that QuotaManager as a whole doesn't break.
    • We are taking baby-steps towards wiping out corrupted origins. Per the proverb "measure twice and cut once", we want to make sure we understand why origins are broken and attempt to fix the underlying problems rather than falling into a cycle of continually wiping out origins and losing user data, never fixing the underlying problem.
  • We will keep re-trying to initialize an origin when an attempt is explicitly made to access the origin. This is really only useful if we expect that either a) our failure to initialize the origin was due to temporary file-locks held by other processes blocking our successful initialization that will go away or b) we expect the user to manually fix the origin.

I think it might be worth splitting the "broken" state into 2 states: "broken" and "zombie" and further altering our behavior. The basic delineation would be:

  • Broken: Something is wrong with the origin's contents on disk and we couldn't initialize it, but we're still keeping the origin around for now in the hopes we can fix the problem in the future.
    • Usage is based on the fallback estimate.
    • We will attempt to-reinitialize the origin each time.
    • As a future enhancement, we can consider keeping a counter of the number of times an attempt is made to access the broken origin and we fail to move to "initialized". If the counter crosses some (low) threshold, we could automatically decide to clear the origin based on the assumption that the user really wants to use the origin and having QM broken for the origin is not helping.
  • Zombie: We are either currently trying to clear this origin or have already tried and failed to clear the origin.
    • Usage is based on the fallback estimate (which would be re-calculated after each attempt to delete files in the origin).
    • No attempts will be made to re-initialize the origin.
    • Maybe we put a marker file in the directory to mark it as a zombie, and if we detect that during initialization we:
      • Attempt to re-delete the directory.
      • If we fail to completely erase the directory, mark it as a zombie in memory.

For marker files on QM, I am worried about putting the marker files on the directory of origins (storage/default/) might increase initialization time since a profile might have thousands of origins on the "default" directory. The alternative solution might be putting them somewhere else (e.g. The storage directory [storage/] or a specific directory to mark them), but a downside of them is that they might be not so intuitive (since they don't follow the general implementation [living in the same directory of zombie files]).

We could defer the marker file stuff for zombie directories, but it seems worthwhile to already have distinct states for broken/zombie. (Although they need not have those names. The one thing I think for sure is that in general we are trying to move away from the terms "blacklist" and "whitelist", with "blocklist" and "safelist" replacing them for direct use, or more specific terms if we can come up with them.)

Indeed.

Ah, sorry for didn't notice these two lists are not okay. Will use blocklist/denylist and safelist/allowlist.

(In reply to Tom Tung [:tt, :ttung] from comment #3)

For marker files on QM, I am worried about putting the marker files on the directory of origins (storage/default/) might increase initialization time since a profile might have thousands of origins on the "default" directory. The alternative solution might be putting them somewhere else (e.g. The storage directory [storage/] or a specific directory to mark them), but a downside of them is that they might be not so intuitive (since they don't follow the general implementation [living in the same directory of zombie files]).

I was thinking we could put them in the root of the origin directory. For example, PROFILE/storage/default/https+++example.com/idb-zombie-origin. The rationale for that location is that if we're trying to recursively remove that directory and its children and that operation fails, then by definition the "https+++example.com" origin directory must still exist, since that's the last file/directory deleted. And usually our problems are going to be related to files/directories being locked so that they can't be deleted, rather than an inability to create a file.

(In reply to Andrew Sutherland [:asuth] from comment #4)

I was thinking we could put them in the root of the origin directory. For example, PROFILE/storage/default/https+++example.com/idb-zombie-origin. The rationale for that location is that if we're trying to recursively remove that directory and its children and that operation fails, then by definition the "https+++example.com" origin directory must still exist, since that's the last file/directory deleted. And usually our problems are going to be related to files/directories being locked so that they can't be deleted, rather than an inability to create a file.

I had a similar idea before, but the reason why I gave up is that the way QM delete an origin is: originDir->Remove(true) and we cannot guarantee the last file is "idb-zombie-origin".

However, maybe be we could have our own defined remove function so that we can control the order of the deletion (make sure the last file for the deletion is idb-zombie-origin).

Ah, a very good point. I was thinking we'd only create the zombie file if the Remove(true) call failed, but that leaves more potentially ambiguous states in the event of process/system crash.

Perhaps we could move towards treating the lack of a ".metadata-v2" file as an indication that the origin is flagged for removal? So our first step would be to remove the metadata file, and the second would be to recursively remove the origin directory.

We would still run into a problem if the .metadata-v2 file is unable to be deleted. We could try and deal with this by also creating the sibling "idb-zombie-origin" marker... but that's a lot of complexity for a failure mode that really shouldn't happen. (That is, a file lock shouldn't exist on the very small .metadata-v2 for more than a very short duration.)

The custom deletion function could also work. Especially if we end up wanting some level of retries if the nsFile calls aren't already retrying deletions for us on windows. (I think at least for temporary files the removal process does a retry with a timer?)

(In reply to Andrew Sutherland [:asuth] from comment #6)

Ah, a very good point. I was thinking we'd only create the zombie file if the Remove(true) call failed, but that leaves more potentially ambiguous states in the event of process/system crash.

Yes, I have considered that and gave up because it wouldn't cover the crash case.

Perhaps we could move towards treating the lack of a ".metadata-v2" file as an indication that the origin is flagged for removal? So our first step would be to remove the metadata file, and the second would be to recursively remove the origin directory.

That sounds good! It reduces the need for creating the marker file. Thanks!

Though I am a bit worried that it might be ambiguous if somehow the file is deleted by OS or is deleted during restoring, I cannot find a case now. Thus, I think this approach is okay.

We would still run into a problem if the .metadata-v2 file is unable to be deleted. We could try and deal with this by also creating the sibling "idb-zombie-origin" marker... but that's a lot of complexity for a failure mode that really shouldn't happen. (That is, a file lock shouldn't exist on the very small .metadata-v2 for more than a very short duration.)

This might also happen on other marker file cases. I believe this can be a follow-up enhancement in the future.

The custom deletion function could also work. Especially if we end up wanting some level of retries if the nsFile calls aren't already retrying deletions for us on windows. (I think at least for temporary files the removal process does a retry with a timer?)

Based on current result of the telemetry probes, having unexpected files is the
biggest reason why block the initialization. This patch intends to avoid a
broken origin blocks the initialization.

This patch unbundles the state of initialization for origins. In the past, if
QM is initialized, it indicate all the origins are initialized and they all can
be access. After this patch, we only initialize the unbroken origins and
try to estimate the size of unbroken origins. If all the estimations succeed,
the QM initialization would succeeds. Otherwise, QM wouldn't be initialized as
usual.

Note that these origins are not accessible if they are not initialized. For the
following EnsureOriginIsInitialized() calls, QM will keep trying initializing
the given origin if it's not initialized. Besides, to avoid from being bothered
by these origins all the time, they have a higher priority to be evicted.

Depends on D19156

Attachment #9042488 - Attachment description: Bug 1521541 - Continue initializing even when finding an unexpected files; → Bug 1521541 - P1 - Continue initializing even when finding an unexpected files;

Andrew, I couldn't find operation on Phabricator to request a review again for P1, so ni you here in the case of accidentally missing updated patches on Phabricator. (Sorry for bothering...)

I also update P2, but just for addressing your comment and testing the estimation the size of an unexpected client.

Besides, I want to ask for reviewing P3 again since I changed it only increases the counter for limiting initialization while initializing all repositories. I believe this would be enough to mitigate the shutdown hang.

Flags: needinfo?(jvarga) → needinfo?(bugmail)

(In reply to Tom Tung [:tt, :ttung] from comment #11)
I removed ni to Jan, since I've already had patches, collected feedback from Andrew, and had basic concepts about my question.

There are some r+ patches which didn't land and no activity in this bug for 2 weeks.
:tt, could you have a look please?

Flags: needinfo?(shes050117)

(In reply to Release mgmt bot [:sylvestre / :calixte] from comment #13)

There are some r+ patches which didn't land and no activity in this bug for 2 weeks.
:tt, could you have a look please?

Patches here might need to be rebased after the patches in bug 1423917 (which I just push into auto-land). I will push this to try tomorrow and if they are okay. I will push them into auto-land. Thanks for the notice!

Flags: needinfo?(shes050117)

I need to add more patches since it conflicts with the patches in the bug for getting usage taking too much time.

I'll rebase the patches and probably add a few more patches tomorrow. I plan to fix this in the all hands

Status: NEW → ASSIGNED
Flags: needinfo?(bugmail)

(In reply to Tom Tung [:tt, :ttung] from comment #16)

I'll rebase the patches and probably add a few more patches tomorrow. I plan to fix this in the all hands

Remove ni for now because I intend to ni/send review requests when I complete the work.

In the past, IsTemporaryStorageInitialized() guarantees there is no any broken
origin once it returns true. After these patches, returning true for that
function means in-memory objects like OriginInfo are constructed and it's
possible to have broken origins (but usages for all of them are at least able to
be estimated). This patch adds another checking function
IsBestEffortOriginInitialized(), and returning true for it means the origin is
not broken and has been initialized successfully.

Depends on D19158

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=72a86d6a077c16663b230972946723eaa44dd187&selectedJob=252827459

Note: these patches might be affected by the decision for handling unexpected files (e.g. ignore those files).

See Also: → 1594080

:ttung, is this still something we want to do? There are reviewed but unlanded patches here as well as patches without review. You might want to assign those to our reviewers group?

Flags: needinfo?(ttung)

(In reply to Jens Stutte [:jstutte] from comment #22)

:ttung, is this still something we want to do? There are reviewed but unlanded patches here as well as patches without review. You might want to assign those to our reviewers group?

Yes, I will rebase these patches and maybe extract out the thing we want to have when I work on bug 1594080 because I believe the idea of these is similar.

Flags: needinfo?(ttung)

As I said in bug 1594080, this needs more coordination. Trying to do everything at once can cause more harm than good. We are finishing work on new telemetry in bug 1592934, then we want to land patches for ignoring unknown directories bug 1594075, then there's bug 1536796 for long file path and bug for renewing the detailed telemetry and adding detailed telemetry for upgrade methods bug 1600352.
I would prefer a systematic approach to "The War on Storage Initialization Failures" bug 1482662 and I also believe that many of the patches for this are quite complex and the reviewer needs to have broad understanding of the problem and also of the ultimate goal. I think there are only 3 people which have enough experience with this currently.
Our next projects in 2020 like private browsing for IndexedDB and SessionStorage NG depend on QM v3 and initialization failures can be handled better with QM v3, so I think we should base new complex changes (like in this bug) on that.
QM v3 is not such large change, I can provide some initial patches relatively soon and at the same time Tom and I can finish work on stuff that I described in the first paragraph because that is mostly independent from planned changes for v3.

I will rebase patches here next week if I happen to have time.

I will re-visit this if the failures rates of initialization functions won't be lower after fixing existing suspicious issues which causing initialization failures (bug 1594075, 1536796).

Depends on: 1594075, 1536796

I suspect that most of initialization failures in client directories are caused by unknown files or directories. I have to take a closer look, but it looks like we added support for ignoring unknown files and directories in repositories and origin directories (actually ignoring unknown directories in repositories hasn't landed yet because it depends on some major cleanup and test improvements).
We need to make sure unknown files and directories are ignored in all client directories too. IIRC the main problem is in IndexedDB client directory especially with unknown files.
So I believe this needs to be done first before marking entire origin directories as unusable. And it can be done in parallel to bug 1594075 and bug 1536796

Severity: normal → S3

We will work on improving the initialization rate for both Storage and TemporaryStorage and telemetry stuff in the short-term so we probably won't work on this soon.

However, we still nice this or some parts of it in the long-term because we can benefit from it/them.

Priority: P2 → P3
Flags: needinfo?(jstutte)

The existing patches here do:

  1. Tolerating broken origins during initialization.
  2. Restrict the number of initialization attempts.
  3. Adjust the behavior for GetUsageOp when there is a broken origin.

Other than (2), the idea here will be addressed on bug 1671932 (in a better way).
Since (2) is not that important and can be added once it's needed, I think we can close this bug.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → DUPLICATE
Flags: needinfo?(jstutte)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: