Open Bug 1298758 Opened 4 years ago Updated 1 year ago

Investigate if Sync failures caused by "Error(s) encountered during statement execution: database disk image is malformed" self-repair

Categories

(Firefox :: Sync, defect, P3)

defect

Tracking

()

People

(Reporter: markh, Unassigned)

References

Details

(Whiteboard: [data-integrity])

The Sync ping is telling us that one of the most common "unexpected" reasons Sync is failing is due to "Error(s) encountered during statement execution: database disk image is malformed" - around 1300 occurrences on Nightly and Aurora over the last month or so. Sadly we don't have stack traces.

Bug 1220723 comment 1 says "We also have a weekly task that runs in background and tries to detect and fix these corruptions. Doing this daily or such would be too expensive, it's a very costly operation."

This bug is to perform further analysis of the Sync pings to try and find evidence that this weekly task does manage to unstick users that see this error.
Priority: -- → P2
Whiteboard: [data-integrity]
I did some simple analysis, which isn't too difficult while the number of sync pings we have is relatively low - and the news seems good :) Of the ~1300 occurrences, only 8 users are affected, and 6 of those recorded a successful bookmarks sync after having seen this error. This is from a total of 47830 users recording a bookmark sync (success or otherwise).

https://gist.github.com/mhammond/1301489b8ceae7976bfe237681fb236b

What the script does is to look for all sync pings that include a bookmark Sync. It then groups those pings by the uid and sorts by date. If we see a successful bookmark sync after we've seen the specific error, we consider the issue to have been "fixed".

The analysis is flawed for a few reasons - see the "XXX - we should:" comment in the gist for but a few! I believe those listed will tend to under-report the number of "fixed" profiles (tl;dr - there are cases where we already missed, or haven't yet seen, the expected successful Sync after a repair)

It looks like the automatic places repair is working and I don't see any evidence to suggest it is worth further analysis. Bug 1220723 remains for the general corruption error, so I think we can call this done.
 - so I'm closing this.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Assignee: nobody → markh
I spoke a little too soon - a bigger fault in the analysis is that it assumes single desktop-device users - a user with 2 windows devices with one device working correctly will be reported as "fixed". I suspect the repair is working, but it's not quite as clear-cut as I was suggesting, so I think it's worth keeping on the radar for now...
Status: RESOLVED → REOPENED
Depends on: 1288445
Resolution: FIXED → ---
doh!
Depends on: 1299784
No longer depends on: 1288445
Priority: P2 → P1
Since we added the device ID, we can only see 2 devices with the corruption - and while neither of those seemed to be repaired, I don't think we have enough data yet - so moving this back to the backlog.
Assignee: markh → nobody
Status: REOPENED → NEW
Priority: P1 → P3

Note that someone has experienced this using the Livemarks add-on: https://github.com/nt1m/livemarks/issues/147#issuecomment-459876225

You need to log in before you can comment on or make changes to this bug.