Closed Bug 1392078 Opened 8 years ago Closed 8 years ago

Crash in java.lang.IllegalStateException: Decremented incorrect number of change counters at org.mozilla.gecko.sync.repositories.VersioningDelegateHelper.persistSyncVersions(VersioningDelegateHelper.java)

Tracking

(firefox-esr52 unaffected, firefox55 unaffected, firefox56 unaffected, firefox57 fixed)

Status:

RESOLVED FIXED

Milestone:

Firefox 57

Tracking Flags:

Tracking

Status

firefox-esr52

---

unaffected

firefox55

---

unaffected

firefox56

---

unaffected

firefox57

---

fixed

People

(Reporter: calixte, Assigned: Grisha)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, topcrash, Whiteboard: [clouseau])

Crash Data

Attachments

(1 file)

Bug 1392078 - More detailed strong assertions to help narrow down root cause 8 years ago :Grisha Kruglov 59 bytes, text/x-review-board-request	rnewman : review+	Details

Calixte Denizet (:calixte)

Reporter

Description

•

8 years ago

This bug was filed from the Socorro interface and is report bp-650cfd84-d609-4634-9c63-a37440170820. ============================================================= There are 1241 crashes in nightly 57 starting with buildid 20170818100308. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1364644. The signature is ranked #1 in nightly FennecAndroid top-crashers. [1] https://hg.mozilla.org/mozilla-central/rev/517c934fdec5

Flags: needinfo?(gkruglov)

:Grisha Kruglov

Assignee

Comment 1

•

8 years ago

I guess now we know this _does_ happen: https://dxr.mozilla.org/mozilla-central/source/mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/VersioningDelegateHelper.java#108

Nicholas Nethercote [inactive]

Comment 2

•

8 years ago

This is the #1 Fennec topcrash in Nightly 20170818100308, with 587 occurrences so far, which is a *tremendously* high number, making Fennec much less stable than usual. Grisha, please address this ASAP! Backing out bug 1364644 is a reasonable option.

Richard Newman [:rnewman]

Comment 3

•

8 years ago

(In reply to Nicholas Nethercote [:njn] from comment #2) > Grisha, please address this ASAP! Backing out bug 1364644 is a reasonable > option. No it's not; it involves a database schema change.

:Grisha Kruglov

Assignee

Comment 4

•

8 years ago

This can't be backed out easily. Looking at the assertion failure, it seems like there could be a couple of things going on: 1) the bookmarks table changed since the guid/version mapping was constructed, and not all guids that we're trying to update are present. One example I can think of: a bookmark was added locally with (1,0); it was then synced and uploaded, and we request its change to (1,1). Before we can execute the change, bookmark is removed. Since it's at (1,0), it's allowed to be dropped from the database, and so we get a changed count mismatch. 2) the guid/versions map changed after we fired off a call to the content provider, and before we check results of the call. This implies that there's a concurrency/timing issue somewhere in the sessions layer. A lot of things changed in the sessions layer, and so I'm inclined to think that (2) is the more likely culprit here - but I don't see it yet. Also, (1) is possible but improbable due to tight timing required. Due to large volume of assertion failures, (2) is probably the vast majority of these crashes, with perhaps a few of (1) thrown in. In case (1) happens, the impact is that we've uploaded a record, but deleted it locally without storing a tombstone. Bookmarks will have diverged. We can rectify this in two ways: heavy weight, obtain an exclusive lock for some subset of our syncing. And the easy solution: either don't drop versioned records at all if they're at (1,0) and have been deleted, or perhaps don't drop recently added ones. In case (2) happens, the impact is less clear since I don't yet understand what could be happening, but it seems that at best we'll re-upload some records unnecessarily, and at worst we won't upload some records that we should have. For now, to help stabilize nightly: - I'll submit a patch that changes this assertion failure to: if (changed < this.localVersionsOfGuids.size()) {crash!} - we'd probably still crash, but hopefully less and it'll be clear that we're crashing because of (2) - modify record deletion clean-up logic to narrow down the gap that might allow for (1) to happen. - file a bug to keep investigating (2)

Flags: needinfo?(gkruglov)

:Grisha Kruglov

Assignee

Comment 5

•

8 years ago

Actually, (changed < this.localVersionsOfGuids.size()) will still cover both (1) and (2), so that doesn't help. (changed > this.localVersionsOfGuids.size()) might be useful to have as an assertion, since we're supposed to be only adding to the 'localVersionsOfGuids' map, and so that _really_ shouldn't happen.

Comment hidden (mozreview-request)

:Grisha Kruglov

Assignee

Updated

•

8 years ago

Assignee: nobody → gkruglov

Status: NEW → ASSIGNED

:Grisha Kruglov

Assignee

Comment 7

•

8 years ago

Another possibility: duplicate GUIDs present. We have a unique index on GUIDs, but after looking through hg blame I'm unsure when it was added. It seems to have been present for a long while, going back to version 7 or earlier (we're at 39 now). But, it's possible that _if_ the index was added after the duplicate got inserted, it'll remain. The patch I pushed up adds additional assertions that test if: - the version/guid map was modified in size while the UPDATE calls were running, indicating that there is a scheduling bug in how we call 'performCleanup' - more than one record was updated in an UPDATE, indicating that there are duplicate GUIDs - zero records were updated in an UPDATE, indicating that a record was dropped from the database - something else went wrong, above assertions didn't fire, and somehow we still got a wrong count back - added checks to see if it's above or below the expected count I expect that more than one of these are at play. I think it would be worth having another bumpy nightly cycle to get a detailed distribution of problems.

Richard Newman [:rnewman]

Comment 8

•

8 years ago

mozreview-review

Comment on attachment 8899585 [details] Bug 1392078 - More detailed strong assertions to help narrow down root cause https://reviewboard.mozilla.org/r/170876/#review176066 I am very happy for Nightly to be less stable for one day in order to find the root cause for this. That's the price we pay for only having crashes as an error reporting channel. ::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/VersioningDelegateHelper.java:121 (Diff revision 1) > + if (versionMapSizeBeforeUpdate != versionMapSizeAfterUpdate) { > + throw new IllegalStateException("Version/guid map changed during syncVersion update"); > + } > + > + if (changed < versionMapSizeBeforeUpdate) { > + throw new IllegalStateException("Updated less sync versions than expected"); http://www.quickmeme.com/meme/3q3j8v

Attachment #8899585 - Flags: review?(rnewman) → review+

Comment hidden (mozreview-request)

Pulsebot

Comment 10

•

8 years ago

Pushed by gkruglov@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c7e972957ff0 More detailed strong assertions to help narrow down root cause r=rnewman

Julie McCracken (:julie)

Updated

•

8 years ago

Priority: -- → P1

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 11

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/c7e972957ff0

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox57: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → Firefox 57

:Grisha Kruglov

Assignee

Updated

•

8 years ago

Comment 12

•

8 years ago

The first of these assertions popped up in the crash stats. Judging by its early volume reports, I have a feeling it will account for most of the assertion failures recorded here. Also, none of the others assertions I've added above made their way into crash stats yet. Bug 1392716 shows that we're failing to modify syncVersion for a given GUID. Our UPDATE syncVersion...WHERE guid=? statement returned 0 modified, even though we were sure that this GUID will be present a few moments ago while we were constructing a version/guid map. So, few possibilities. 1) Users are modifying (deleting?) bookmarks at the exact wrong time. Judging by comments in the crash logs ("crashed in the background", "crashed on youtube", etc), and the volume of crashes, I think that's pretty unlikely. Generally I think we should look at causes beyond "user did X". 2) We have a race somewhere. Perhaps bookmark insertion is racing syncVersion updates? Either way, something along those lines is more likely, given that a lot of this code changed recently, and it's probably easy enough to mess up in a subtle way.

:Grisha Kruglov

Assignee

Comment 13

•

8 years ago

More crashes are coming in, only for Bug 1392716, so that's the root cause - or rather, whatever's causing _that_ to happen. To get a row changed count, underneath we're probably using sqlite3_changes() (see https://android.googlesource.com/platform/frameworks/base.git/+/master/core/jni/android_database_SQLiteConnection.cpp#459). And so even if we were setting syncVersion to a value that it already has, we'd still get "changed=1" back, and won't crash. At least, that's how changes() behaves on recent versions of SQLite. I also thought through possibility of record deletions _from_ sync somehow messing things up, but I can't think of a faulty scenario. E.g. can we somehow see a record while fetching them from a database which we also downloaded marked as 'deleted'? No, since the deletion would have been committed to the database in the first flow _before_ we ask it for records in the second flow, and RecordsChannel ensures that these operations are well ordered. And besides, we only put GUIDs into the version map either on fetch from the database, or when we reconciled a record while storing remote records. Deletions do not affect this map. It's possible I missed some concurrency related problem - there are a bunch of things going on. As far as I can trace through the execution paths, it all seems to function well enough. And so that leaves us with fennec-proper as the remaining culprit, that is to say, GUIDs disappearing because either user or something else is deleting them. The only way those actions would result in a GUID disappearing is if the record wasn't synced yet. I can reliably reproduce this scenario by pausing a debugger at the right moment (simulating a very slow network, for example), and deleting an unsynced record in the UI. It seems that we'll always have that as a possibility, unless we lock the database from writes while we try to upload bookmarks, which isn't feasible. And so either way, it seems best to _not_ delete bookmark tombstones at all, regardless if they've been synced or not. I'm going to land a patch in Bug 1392716 which disabled tombstone cleanup, and see if that affects the crash rate. If crash rate goes down, perhaps that _was_ the main cause here. If it stays relatively unaffected, something else is at play, at which point we can just disable the crashing assertion since it's not helping us anymore, and unstable nightly isn't great :(

Sylvestre Ledru [:Sylvestre]

Updated

•

8 years ago

Blocks: clouseau

BMO Automation

Updated

•

8 years ago

Product: Android Background Services → Firefox for Android

BMO Automation

Updated

•

5 years ago

Product: Firefox for Android → Firefox for Android Graveyard

You need to log in before you can comment on or make changes to this bug.