Closed Bug 1689636 Opened 3 years ago Closed 3 years ago

Failure to invalidate the cache shouldn't break (temporary) storage initialization

Categories

(Core :: Storage: Quota Manager, defect, P1)

defect

Tracking

()

RESOLVED FIXED
87 Branch
Tracking Status
firefox87 --- fixed

People

(Reporter: sg, Assigned: sg)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

During both EnsureStorageIsInitialized and EnsureTemporaryStorageIsInitialized, a failure of InvalidateCache will break the initialization completely. This happens when running the DELETE FROM origin query at https://searchfox.org/mozilla-central/rev/f9ad45c76ba50bdee54bebd14e6625ae14d4d085/dom/quota/ActorsParent.cpp#462 with a NS_ERROR_FILE_CORRUPTED error.

Wow, if we get that error for storage.sqlite we are in quite bad situation.

As discussed on Slack, we can't recover from this situation currently, because the QM cache is held in the same database as the QM storage metadata, in particular the QM storage version, and if that's corrupted we can only nuke the entire storage directory.

The fact that the comparatively large QM cache is held in the same database as the comparatively small QM storage metadata increases the chance of database corruption significantly (probably due to uncontrollable external influences such as hardware defects). Therefore, the QM cache should better be stored in a separate database.

Since the QM cache is invalidated regularly anyway, we might apply this change without introducing a new (minor) QM storage version, and just remove a cache in the existing database when it's encountered there.

Jan, did I summarize this accurately? I can go ahead then and implement this.

Flags: needinfo?(jvarga)

Yes, you did summarize it well. I'm still not sure if a storage upgrade would be needed. For now, assume it's not needed. Even if we decide later that a minor upgrade is needed. It shouldn't affect the main patch/patches for this.

Flags: needinfo?(jvarga)

As I understand comment 0, we've already successfully extracted the QM storage version (which is stored in the database header) though? And it's only the subsequent query which ends up looking at the cache pages that indicates corruption?

Assuming we have other low-churn data that is complex (or impossible) to re-derive, perhaps we should be storing that in an atomically written JSON file at the root of the storage directory as a means of trying to avoid correlated failures and keeping resource usage low. The intent would be to assist in the decision tree of:

  • Clear all QM storage.
  • Attempt to rebuild storage.sqlite and QM storage/state from what's on disk, using the contents of the JSON file to help fill in any other gaps.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #4)

As I understand comment 0, we've already successfully extracted the QM storage version (which is stored in the database header) though? And it's only the subsequent query which ends up looking at the cache pages that indicates corruption?

Yes, but we want to lower the chance of corrupting storage.sqlite by moving the cache into a separate location. That way the database will be used only for reading in most of the cases. Only a new storage upgrade would need to update the storage version.
You mentioned JSON below, the cache probably doesn't have to be stored in a SQLite database. We need to read the data in one shot during temporary storage initialization (after startup) and then serialize it back during shutdown. So it seems a (compressed) JSON file would be good for that (no incremental changes need to be saved to disk).

Assuming we have other low-churn data that is complex (or impossible) to re-derive, perhaps we should be storing that in an atomically written JSON file at the root of the storage directory as a means of trying to avoid correlated failures and keeping resource usage low. The intent would be to assist in the decision tree of:

  • Clear all QM storage.
  • Attempt to rebuild storage.sqlite and QM storage/state from what's on disk, using the contents of the JSON file to help fill in any other gaps.

Do you mean to create a backup of storage.sqlite ?
When we discussed about QMv4, we concluded that storage.sqlite would become very important database because all the mappings to directories would be held in the database and when that's lost (corrupted), entire storage gets lost as well. So we proposed to put small JSON files into directories for QMv4 which would be then used for database.sqlite reconstruction.
So, do I understand you correctly that you propose to create a backup file for current storage.sqlite, and initially store only the major and minor storage version there ?

I was a bit on a narrow track here, and Andrew's comment 4 actually opened up another immediate solution: He correctly said that we could successfully read the storage version when we fail to invalidate the cache. So we could recreate a new storage.sqlite file from the information we have, right? That might be even more straightforward to implement. (We can still consider using a separate database or a JSON file for the cache in a second step).

(In reply to Simon Giesecke [:sg] [he/him] from comment #6)

I was a bit on a narrow track here, and Andrew's comment 4 actually opened up another immediate solution: He correctly said that we could successfully read the storage version when we fail to invalidate the cache. So we could recreate a new storage.sqlite file from the information we have, right? That might be even more straightforward to implement. (We can still consider using a separate database or a JSON file for the cache in a second step).

Actually, the summary of this bug is only about corrupted cache. But when I discussed with you on Slack I was trying to suggest to do more. Yes, we were able to load the storage version, but the database is corrupted otherwise and that is quite concerning. So long term, it would be better to isolate the cache because it's completely different kind of data, something which is only used to improve performance. Feel free to file a new bug for that.

The simplest and safest option (at least for now) regarding the corrupted storage.sqlite in this bug would be just stop loading the cache from it (and also unloading of course), but for that we need to isolate the cache. That brings us back to isolating the cache in this bug :)

We can later introduce an upgrade that drops the tables for the cache and if there are any problems with that, we would fallback to a reconstruction of storage.sqlite which would use the storage version from the old database (if it is still accessible).

(In reply to Jan Varga [:janv] from comment #5)

So, do I understand you correctly that you propose to create a backup file for current storage.sqlite, and initially store only the major and minor storage version there ?

Sorta. My main arguments are:

  • We should be treating SQLite as reliable on reliable hardware. We shouldn't be letting the existence of flaky machines out there cause us to erroneously model probabilities such that we're assuming there's a sizable chance of corruption every time we mutate a database and introduce additional databases as an attempt to avoid the problem. That wouldn't accurately reflect the problem space.
  • The current quota cache mechanism which is only written out at successful shutdown and assumes that every origin touched during the last run is suspect was an appropriate stepping stone, but should not be our end goal. Our goal for "storage.sqlite" should be that it accurately reflects the state of quota at all times with the concept of dirty origins reflected in the database in the event of a crash. It also needs to provide important metadata to QM clients and for privacy related use-cases (like the StorageActivityService).
  • It is appropriate to handle "storage.sqlite" being corrupted without loss of all QM-managed data.
    • It makes sense that there may exist global data that either would otherwise be lost in the event of catastrophic corruption of "storage.sqlite" or be much more complicated to re-derive and therefore is appropriate to store in a file that is written to very rarely and whose management is exceedingly simple.
      • For example, when beginning an upgrade process of the QM directory, it could make sense to capture in a JSON file that an upgrade was in process, or that a reconstruction was in process.
      • In terms of dealing with unreliable hardware, we could use the file to track the number of times we've had to regenerate storage.sqlite and/or specific origins. One of the most complicated problems in dealing with bad hardware is realizing you're in that situation. Being able to know that a profile seems doomed could be helpful for telemetry and/or actual UX interventions like telling a user we think there's a problem with their computer.
      • That said, I don't think there's a need for this right now. But it could be appropriate to create this with the introduction of the QMv4 schema where it could serve as a sentinel and disambiguator that the profile has been the process of migration to QMv4.

So, for this bug we will just reduce the cache invalidation failure to a warning. I will file new bugs for separating the cache and a new minor storage upgrade that drops/reconstructs the storage file.

Severity: -- → S2
Priority: -- → P1
Blocks: 1690547
Blocks: 1690548
Assignee: nobody → sgiesecke
Status: NEW → ASSIGNED

I filed Bug 1690547 and Bug 1690548 for the follow-up steps immediately related to the cache. I am not so sure what to do about the more general ideas from comment 9. Should these be carried over to another new bug?

Flags: needinfo?(bugmail)

I think comment 10 is definitely the right course of action for this bug. I think the comment 11 spin-offs are in the opposite direction of where we want to be going for having persistently tracked quota usage, as the changes would need to be explicitly reverted, but I think that's really something where it would make sense for us to have a QuotaManager road map / design document for v4 that we discuss. I've marked the comment in my mail client so I can consult it for any discussions we might have around that.

Flags: needinfo?(bugmail)
Attachment #9200887 - Attachment description: Bug 1689636 - Ignore failure to invalidate the QM cache. r=#dom-workers-and-storage → Bug 1689636 - Ignore failure to delete the QM cache. r=#dom-workers-and-storage
Pushed by sgiesecke@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/aa0bea350f28
Ignore failure to delete the QM cache. r=dom-workers-and-storage-reviewers,janv
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 87 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: