Open Bug 1590640 Opened 5 years ago Updated 2 years ago

LSNG: Database corruption is not handled during data loading

Categories

(Core :: Storage: localStorage & sessionStorage, defect, P2)

defect

Tracking

()

People

(Reporter: janv, Unassigned)

References

(Blocks 2 open bugs)

Details

The error code NS_ERROR_FILE_CORRUPTED is currently checked only after opening a database. However, it can happen that we get this error after running the query for data loading.

I think we should just throw away a corrupted database and create a new one when this happens.

Note that PrepareDatastoreOp::VerifyDatabaseInformation should check the error code too.

Note that in mozStorage bug 1125157 we've discussed adding a mechanism for errors like corruption to be handled via a listener/observer mechanism so that we don't necessarily have to special-case corruption handling at every mozStorage-using call-site. We'll still need LSNG-specific logic, but it would be ideal to avoid making an entirely LSNG-specific solution to the problem.

Agreed that clearing the corrupted database is the only viable course of action and it should be tracked via telemetry. Precedent is that we wouldn't wipe the origin for this scenario, just the LS storage, and that continues to make sense.

See Also: → 1125157

That sounds good.

I believe this now needs to be P1.

Assignee: nobody → jvarga
Status: NEW → ASSIGNED
Priority: P3 → P1

:drh, according to https://www.sqlite.org/lang_attach.html, transactions involving multiple attached databases are atomic. However, it seems we are experiencing a weird database corruption, probably when Firefox unexpectedly crashes and a transaction involving multiple databases hasn't been committed yet. In our case, we have two sqlite databases and they both don't use WAL. Have you heard about corruption that is more likely to happen when multiple databases are involved ? Is there anything we should avoid doing in this setup ?

The corrupted database has these pragmas:
PRAGMA synchronous = FULL;
PRAGMA page_size = 1024;
PRAGMA auto_vacuum = INCREMENTAL;

Flags: needinfo?(drh)

We do not know of any corruption problems in SQLite, involving multiple databases or otherwise. There is nothing special you need to do to avoid corruption when using multiple databases. It should just work.

See also https://www.sqlite.org/howtocorrupt.html for a discussion of the various out-of-band ways that SQLite database files have gone corrupt in the past. I don't think any of these issues apply to FF, but it never hurts to review the list from time to time.

Flags: needinfo?(drh)

So I found a discussion that mentions database corruption like this: http://sqlite.1065341.n5.nabble.com/btreeInitPage-returns-error-code-11-td87095.html

Integrity check on the corrupted database gives me:
Page 4: btreeInitPage() returns error code 11
Page 5: btreeInitPage() returns error code 11

I looked at the hex dump of those pages and they are zeroed.

SQLite writes some content to disk, then invokes fsync() (or the Windows equivalent FlushFileBuffers()) and waits for the OS to guarantee that the content is safely on disk. The OS passes this task off to the SSD controller. But the SSD controller lies and says the information is saved, even though it is still in a volatile cache waiting to be written. Then SQLite does some other changes that depend on the first bits being saved, and the newer changes actually reach non-volatile storage first. Then the power goes out. When the machine comes back up, the first content written comes up as all zeros.

We don't know that this is what happened in your case, but it is frequent hypothesis of people who study this kind of thing.

Priority: P1 → P2
Assignee: jvarga → nobody
Status: ASSIGNED → NEW
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.