941991 - [email/IMAP] Synchronization occasionally falling into synchronizing from the epoch (1970, 1969 time-zone adjusted)

Reporter

Description

•

12 years ago

:gerard-majax reported seeing the following running against his Courier IMAP server on gaia/master: I/Gecko ( 4243): WLOG: syncDateRange: -86400000 null I/Gecko ( 4243): WLOG: Skewed DB lookup. Start: -61200000 Wed, 31 Dec 1969 07:00:00 GMT End: null null I/Gecko ( 4243): WLOG: searchargs: NOT DELETED SINCE 31-Dec-1969 I/Gecko ( 4243): WLOG: SERVER UIDS 0 60 I/Gecko ( 4243): WLOG: Sync Completed! null days 0 messages synced I/Gecko ( 4243): WLOG: folder message count 39907 dbCount 0 syncedThrough null oldest known 0 He indicated that he thinks this has something to do with a network failure case. This could make a lot of sense and explain things. If we are failing and ending up saying 0 is the start of our validity range and accidentally letting that get into the accuracy ranges, that could sink us. It's conceivable that different servers handle us generating search criteria that pre-dates the epoch. In this case, it appears Courier does not handle it so well and returns zero messages. (Although that's probably for the best because we would probably try and synchronize all 39907 messages and we would eventually start purging what we had just synchronized, etc.)

:gerard-majax

Comment 1

•

11 years ago

I'm still being able to see it occasionnally. I haven't yet been able to provide valid STRs, though, and I would say the reproductibility rate is between once and twice per month.

:gerard-majax

Comment 2

•

11 years ago

Since yesterday's master build, I have been able to reproduce this twice. This morning, my device was showing the issue on one email account folder after having been left USB-plugged all the night with Wi-Fi enabled. I deleted the account, reconfigured, resynced successfully. Then I left, Wi-Fi still enabled. Arrived in the train, I'm enabling data (Wi-Fi still enabled) to check my emails. Bug appears again. This time, due to the timeframe, I have been able to collect logs between the first sync after re-adding the account, and the failure. So far, nothing conclusive. It should be noted that at some point during my journey from home to the train station, I'm passing nearby some Wi-Fi SSID that my device can connect to: this may have been triggered a start of syncing process, that would get stuck as soon as I left the covered area. Andrew, if you need, I can share with you the logcat.

Flags: needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 3

•

11 years ago

Please do! Feel free to mail them to me at asuth@mozilla.com. I've got a chunk of time coming up I should be able to use to fix-up many of our bugs like this.

Flags: needinfo?(bugmail)

:gerard-majax

Comment 4

•

11 years ago

I'm still reproducing this, from time to time. It may only reproduce on my courier-imap account.

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 5

•

11 years ago

:ladamski provided some good logs on bug 1018828 and marked it blocking 2.0+, so we'll likely be addressing this in the 2.0 stabilization phase and duping this bug to that one. What's surprising is that in his logs we actually eventually recover from the problem because the bisection logic repeatedly triggers until the underlying numbers wrap around again somehow.

Andrew Sutherland [:asuth] (he/him)

Reporter

Updated

•

11 years ago

Blocks: 1018828

Marcus Cavanaugh [:mcav]

Assignee

Updated

•

11 years ago

Assignee: nobody → m

Target Milestone: --- → 2.0 S4 (20june)

Dylan Oliver [:doliver]

Updated

•

11 years ago

blocking-b2g: --- → 2.0+

:gerard-majax

Comment 6

•

11 years ago

Reproduced again this weekend, running master build on saturday. My phone was sitting on my desk, Wi-Fi enabled.

Marcus Cavanaugh [:mcav]

Assignee

Updated

•

11 years ago

Target Milestone: 2.0 S4 (20june) → 2.0 S5 (4july)

Marcus Cavanaugh [:mcav]

Assignee

Comment 7

•

11 years ago

Attached file pull request — Details

This bug occurred because the sync process occasionally attempts to sliceOpenMostRecent() a dead slice, and we currently have no guard against that happening. Dead slices have .startTS and .endTS set to null, which resulted in math like "null + tzOffset", which leads to bad things getting stored in accuracyRanges. As I mentioned briefly in bug 1025727, the fact that this occurs in the first place seems to be a deeper issue with sync. At this stage I would advocate shipping just this targeted fix, and opening a followup bug to investigate what kind of race condition is causing this to occur. The patch effectively no-ops the slice refresh if the slice is dead. A unit test verifies that without the patch, we see the 1970 bug; with the patch, it works as expected. James, I'm passing r? to you first, and when we think it's ready, I'll set the flag for Andrew. Regarding the followon sync stuff, I don't have a lot of concrete data at this point, other than noticing that 'oncronsyncstop' appears to frequently never get fired. If we decide to open a followup for that, I can gather more data there.

Attachment #8444797 - Flags: review?(jrburke)

Tony Chung [:tchung]

Comment 8

•

11 years ago

Does this bug fix affect bug 1018828? if yes, please mark this bug as a dependency on the other one.

Marcus Cavanaugh [:mcav]

Assignee

Comment 9

•

11 years ago

It blocks 1018828, which this is already marked for.

James Burke [:jrburke]

Comment 10

•

11 years ago

I'm still looking into this code, but some thoughts: * isDead checks for a null _bridgeHandle, which can be set during a die() of a slice, but that does not necessarily set the startTS to null. startTS to null can happen as part of a .reset() though. Maybe a reset followed by a slice that was told to .die()? * Do you notice the missing oncronsyncstop once this fix is in place or before also? If only after this fix, perhaps the skipping of the refresh means the oncronsyncstop pathway does not complete since it depends on a refresh really proceeding, instead of skipping it. I am a bit rusty with this code, so the startTS might follow directly from the die() pathway, but it still feels weird that the skipping of the refresh is done on a missing _bridgeHandle vs a null value in startTS. The missing bridgeHandle is a good sign to use, just concerned there is a window for failure there, between a null startTS and the bridgeHandle going null. Unless the failure case is a reset() followed very closely by a die(). Maybe a cronsync slice that stays around from a previous sync that gets a reset(), but then gets a die() to make way for a new cronsync slice? Really speculating now, have not traced the cronsync slice yet, could be totally wrong. Need to trace the code more tomorrow.

Marcus Cavanaugh [:mcav]

Assignee

Comment 11

•

11 years ago

The behavior I saw regarding 'oncronsyncstop' existed before I changed anything. I just hooked up logs in a bunch of places to understand what was going on, and noticed that 'oncronsyncstop' wasn't always called. Eventually I realized that the slice was dead, but I didn't get to a point where I could understand if it came from a prior slice or not. If you'd like me to gather more details on this part, I can; I just opted to focus initially on finding the cause of the 1970 bug so as to not fall too far down the rabbit hole. I was initially going to guard on startTS being null, but it seemed more hackish than routing on isDead. Admittedly, I also haven't spent the time to fully understand the desired slice lifecycle.

Marcus Cavanaugh [:mcav]

Assignee

Comment 12

•

11 years ago

(Similarly, startTS _was_ null in the failure case here, so that's the pathway it was taking for this particular error.)

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 13

•

11 years ago

I skimmed this since I was doing review-ish things. f=asuth since the isDead check seems inherently sane/safe. I also am a bit confused about what the control flow path would be that would lead to this being relevant. I'll think on it some after resting a bit, but in the meantime don't let me hold you up.

James Burke [:jrburke]

Comment 14

•

11 years ago

Comment on attachment 8444797 [details] [review] pull request r+ with the log suggestion in the pull request, and :mcav and I will continue to do some tracing for the conditions that would have caused a refresh on a dead slice more. My concern is that even with this fix, the _cronSync slice in MailUniverse will effectively be a dead no-op for actual syncing until the mail app is killed. At least we will not get stuck in weirder states with the 1970 issue though.

Attachment #8444797 - Flags: review?(jrburke) → review+

Marcus Cavanaugh [:mcav]

Assignee

Comment 15

•

11 years ago

gelam: https://github.com/mozilla-b2g/gaia-email-libs-and-more/commit/6252b2c6b3bb6cddbe7dabe43f331dd68342f373 master: https://github.com/mozilla-b2g/gaia/commit/515914905d40d171ea6b927f4c8cbbb93c5bde20 Following up with :jrburke on other cronsync gotchas.

Status: NEW → RESOLVED

Closed: 11 years ago

status-b2g-v1.4: --- → unaffected

status-b2g-v2.0: --- → affected

status-b2g-v2.1: --- → fixed

Resolution: --- → FIXED

:gerard-majax

Comment 16

•

11 years ago

Excellent. I'm updating my Desire Z build with current master to check if it fixes on my side. Marcus and James, considering I have a copy of the profile that is currently in this broken state, should I expect that the update will recover the bogus state or it will just avoid getting back into this state in the future?

Flags: needinfo?(m)

Flags: needinfo?(jrburke)

:gerard-majax

Comment 17

•

11 years ago

So, tested current master on my old profile on Desire Z, and I still get "No message in this folder", and logcat shows: > I/Gecko ( 1570): WLOG: syncDateRange: -86400000 null > I/Gecko ( 1570): WLOG: Skewed DB lookup. Start: -61200000 Wed, 31 Dec 1969 07:00:00 GMT End: null null > I/Gecko ( 1570): WLOG: searchargs: NOT DELETED SINCE 31-Dec-1969

(Out Sep 17-23) Ryan VanderMeulen

Comment 18

•

11 years ago

v2.0: https://github.com/mozilla-b2g/gaia/commit/2f60002586aca27570583847674004ec4348f3e8

status-b2g-v2.0: affected → fixed

Marcus Cavanaugh [:mcav]

Assignee

Comment 19

•

11 years ago

I don't think it will automatically recover; it likely requires a new profile / account recreation.

Flags: needinfo?(m)

James Burke [:jrburke]

Comment 20

•

11 years ago

From what I recall of hitting the 1970 bug, I did have to delete the account and then add it back.

Flags: needinfo?(jrburke)

:gerard-majax

Comment 21

•

11 years ago

Now that enough time has passed since landing, I can notice I have not been able to reproduce this bug.