Closed Bug 941991 Opened 11 years ago Closed 10 years ago

[email/IMAP] Synchronization occasionally falling into synchronizing from the epoch (1970, 1969 time-zone adjusted)

Categories

(Firefox OS Graveyard :: Gaia::E-Mail, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(blocking-b2g:2.0+, b2g-v1.4 unaffected, b2g-v2.0 fixed, b2g-v2.1 fixed)

RESOLVED FIXED
2.0 S5 (4july)
blocking-b2g 2.0+
Tracking Status
b2g-v1.4 --- unaffected
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: asuth, Assigned: mcav)

References

Details

Attachments

(1 file)

64 bytes, text/x-github-pull-request
jrburke
: review+
Details | Review
:gerard-majax reported seeing the following running against his Courier IMAP server on gaia/master:

I/Gecko   ( 4243): WLOG: syncDateRange: -86400000 null
I/Gecko   ( 4243): WLOG: Skewed DB lookup. Start:  -61200000 Wed, 31 Dec 1969 07:00:00 GMT End:  null null
I/Gecko   ( 4243): WLOG: searchargs:  NOT DELETED SINCE 31-Dec-1969
I/Gecko   ( 4243): WLOG: SERVER UIDS 0 60
I/Gecko   ( 4243): WLOG: Sync Completed! null days 0 messages synced
I/Gecko   ( 4243): WLOG: folder message count 39907 dbCount 0 syncedThrough null oldest known 0


He indicated that he thinks this has something to do with a network failure case.  This could make a lot of sense and explain things.  If we are failing and ending up saying 0 is the start of our validity range and accidentally letting that get into the accuracy ranges, that could sink us.

It's conceivable that different servers handle us generating search criteria that pre-dates the epoch.  In this case, it appears Courier does not handle it so well and returns zero messages.  (Although that's probably for the best because we would probably try and synchronize all 39907 messages and we would eventually start purging what we had just synchronized, etc.)
I'm still being able to see it occasionnally. I haven't yet been able to provide valid STRs, though, and I would say the reproductibility rate is between once and twice per month.
Since yesterday's master build, I have been able to reproduce this twice. This morning, my device was showing the issue on one email account folder after having been left USB-plugged all the night with Wi-Fi enabled. I deleted the account, reconfigured, resynced successfully.

Then I left, Wi-Fi still enabled. Arrived in the train, I'm enabling data (Wi-Fi still enabled) to check my emails. Bug appears again.

This time, due to the timeframe, I have been able to collect logs between the first sync after re-adding the account, and the failure. So far, nothing conclusive. It should be noted that at some point during my journey from home to the train station, I'm passing nearby some Wi-Fi SSID that my device can connect to: this may have been triggered a start of syncing process, that would get stuck as soon as I left the covered area.

Andrew, if you need, I can share with you the logcat.
Flags: needinfo?(bugmail)
Please do!  Feel free to mail them to me at asuth@mozilla.com.  I've got a chunk of time coming up I should be able to use to fix-up many of our bugs like this.
Flags: needinfo?(bugmail)
I'm still reproducing this, from time to time. It may only reproduce on my courier-imap account.
:ladamski provided some good logs on bug 1018828 and marked it blocking 2.0+, so we'll likely be addressing this in the 2.0 stabilization phase and duping this bug to that one.  What's surprising is that in his logs we actually eventually recover from the problem because the bisection logic repeatedly triggers until the underlying numbers wrap around again somehow.
Blocks: 1018828
Assignee: nobody → m
Target Milestone: --- → 2.0 S4 (20june)
blocking-b2g: --- → 2.0+
Reproduced again this weekend, running master build on saturday. My phone was sitting on my desk, Wi-Fi enabled.
Target Milestone: 2.0 S4 (20june) → 2.0 S5 (4july)
Attached file pull request
This bug occurred because the sync process occasionally attempts to sliceOpenMostRecent() a dead slice, and we currently have no guard against that happening. Dead slices have .startTS and .endTS set to null, which resulted in math like "null + tzOffset", which leads to bad things getting stored in accuracyRanges.

As I mentioned briefly in bug 1025727, the fact that this occurs in the first place seems to be a deeper issue with sync. At this stage I would advocate shipping just this targeted fix, and opening a followup bug to investigate what kind of race condition is causing this to occur.

The patch effectively no-ops the slice refresh if the slice is dead. A unit test verifies that without the patch, we see the 1970 bug; with the patch, it works as expected.

James, I'm passing r? to you first, and when we think it's ready, I'll set the flag for Andrew. Regarding the followon sync stuff, I don't have a lot of concrete data at this point, other than noticing that 'oncronsyncstop' appears to frequently never get fired. If we decide to open a followup for that, I can gather more data there.
Attachment #8444797 - Flags: review?(jrburke)
Does this bug fix affect bug 1018828?   if yes, please mark this bug as a dependency on the other one.
It blocks 1018828, which this is already marked for.
I'm still looking into this code, but some thoughts:

* isDead checks for a null _bridgeHandle, which can be set during a die() of a slice, but that does not necessarily set the startTS to null. startTS to null can happen as part of a .reset() though. Maybe a reset followed by a slice that was told to .die()?
* Do you notice the missing oncronsyncstop once this fix is in place or before also? If only after this fix, perhaps the skipping of the refresh means the oncronsyncstop pathway does not complete since it depends on a refresh really proceeding, instead of skipping it.

I am a bit rusty with this code, so the startTS might follow directly from the die() pathway, but it still feels weird that the skipping of the refresh is done on a missing _bridgeHandle vs a null value in startTS. The missing bridgeHandle is a good sign to use, just concerned there is a window for failure there, between a null startTS and the bridgeHandle going null. Unless the failure case is a reset() followed very closely by a die(). Maybe a cronsync slice that stays around from a previous sync that gets a reset(), but then gets a die() to make way for a new cronsync slice? Really speculating now, have not traced the cronsync slice yet, could be totally wrong. Need to trace the code more tomorrow.
The behavior I saw regarding 'oncronsyncstop' existed before I changed anything. I just hooked up logs in a bunch of places to understand what was going on, and noticed that 'oncronsyncstop' wasn't always called. Eventually I realized that the slice was dead, but I didn't get to a point where I could understand if it came from a prior slice or not. If you'd like me to gather more details on this part, I can; I just opted to focus initially on finding the cause of the 1970 bug so as to not fall too far down the rabbit hole.

I was initially going to guard on startTS being null, but it seemed more hackish than routing on isDead. Admittedly, I also haven't spent the time to fully understand the desired slice lifecycle.
(Similarly, startTS _was_ null in the failure case here, so that's the pathway it was taking for this particular error.)
I skimmed this since I was doing review-ish things.  f=asuth since the isDead check seems inherently sane/safe.  I also am a bit confused about what the control flow path would be that would lead to this being relevant.  I'll think on it some after resting a bit, but in the meantime don't let me hold you up.
Comment on attachment 8444797 [details] [review]
pull request

r+ with the log suggestion in the pull request, and :mcav and I will continue to do some tracing for the conditions that would have caused a refresh on a dead slice more. 

My concern is that even with this fix, the _cronSync slice in MailUniverse will effectively be a dead no-op for actual syncing until the mail app is killed. At least we will not get stuck in weirder states with the 1970 issue though.
Attachment #8444797 - Flags: review?(jrburke) → review+
gelam: https://github.com/mozilla-b2g/gaia-email-libs-and-more/commit/6252b2c6b3bb6cddbe7dabe43f331dd68342f373

master: https://github.com/mozilla-b2g/gaia/commit/515914905d40d171ea6b927f4c8cbbb93c5bde20

Following up with :jrburke on other cronsync gotchas.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Excellent. I'm updating my Desire Z build with current master to check if it fixes on my side.

Marcus and James, considering I have a copy of the profile that is currently in this broken state, should I expect that the update will recover the bogus state or it will just avoid getting back into this state in the future?
Flags: needinfo?(m)
Flags: needinfo?(jrburke)
So, tested current master on my old profile on Desire Z, and I still get "No message in this folder", and logcat shows:

> I/Gecko   ( 1570): WLOG: syncDateRange: -86400000 null
> I/Gecko   ( 1570): WLOG: Skewed DB lookup. Start:  -61200000 Wed, 31 Dec 1969 07:00:00 GMT End:  null null
> I/Gecko   ( 1570): WLOG: searchargs:  NOT DELETED SINCE 31-Dec-1969
I don't think it will automatically recover; it likely requires a new profile / account recreation.
Flags: needinfo?(m)
From what I recall of hitting the 1970 bug, I did have to delete the account and then add it back.
Flags: needinfo?(jrburke)
Now that enough time has passed since landing, I can notice I have not been able to reproduce this bug.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: