Closed Bug 1018828 Opened 10 years ago Closed 6 years ago

[meta] Email sync fails, never recovers without restart

Categories

(Firefox OS Graveyard :: Gaia::E-Mail, defect)

ARM
Gonk (Firefox OS)
defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: ladamski, Unassigned)

References

Details

(Keywords: meta)

Attachments

(9 files, 1 obsolete file)

I run into this problem constantly (i.e. dozen+ times a day) trying to use the email app on Flame and other devices.  I have 2 accounts setup (1 activesync moz, 1 gmail).  When using the devices normally--on the move--email inevitably fails after a few syncs and then never recovers.

New emails are never retrieved.  I've attached logcat and email log for an example where I managed to reproduce it after just a few minutes of trying to my computer.  The phone otherwise has connectivity and can retrieve data normally.  This issue also occurs in 1.3, haven't tried 1.4 due to broken RIL but I suspect it will occur there too.

Even worse, in this state outbound emails appear to send (including the related sound effect) but instead vanish.  They don't come through and there are not sent when the app is restarted.  I've attached a logcat of that as well.

build id: 20140601160201
blocking-b2g: --- → 2.0+
Summary: Email sync fails, never recovers without restart → Email sync fails, never recovers, outbound emails disappear
Summary: Email sync fails, never recovers, outbound emails disappear → Email sync fails, never recovers without restart
Comment on attachment 8432340 [details]
logcat of second failed sync

This logcat is full of useful stuff, thank you very much for making the effort to capture it and the others.  While I sincerely hope you never have cause to gather logcats for the email app again, but if you do have to, it'd be great if you could also pass the arg "-v time" so we can get timestamps.

Aside: Although Mozilla's Zimbra can speak ActiveSync, autoconfig is mapped to use IMAP for mozilla.com.  You would need to manually configure your account to use ActiveSync, and per the log you have not done this.  In the abstract I would advise against doing that because ActiveSync is a much more limited protocol, but it does not / cannot have the time warp bug.  (Although that looks like that's the gmail account)

Other thing: Do you know what the deal is with all the failures to save to the 'sdcard' DeviceStorage?

=== Key notes from "logcat of second failed sync":

- The time warp/1970 bug occurred!  And it seemed to "fix" itself!  (The bisection logic apparently freaks out at the prospect of 63,201 message and keeps trying until it somehow ends up with a legal-ish value again.  We need to look into this more via the structured logs, but it looks like there is indeed a correlation with connection loss.

- Many misleading errors that are just the CSS parser in the sanitizer freaking out.

- Probably non-element click target issues.  I think we fixed an issue like this at some point recently, not sure if this is a new/different one or it's a question of the build?  Error is:
onerror reporting: TypeError: t.classList is undefined @ app://email.gaiamobile.org/js/mail_app.js : 846
Ah ok, I had just assumed ActiveSync. FWIW my gmail account is small (a few hundred emails) but yeah the Moz one is large. :)

'sdcard' DeviceStorage failures are due to having the SD card mounted on the computer when trying to save the log file.  Doesn't work until you unmount, so that's all me.

BTW I may have restarted the app during one of those logs, I'm not sure.  Let me know if you need more data, I run into it constantly so I can get another logcat with -v.
Attaching debug log dump from a failed sync today. I have two accounts set up, @mozilla and a google apps. I first noticed the sync failure when switching over to the @google account.
Attachment #8433502 - Attachment description: debug dump - doliver failed sync → debug dump - doliver failed sync; is just the log metadata; the email app either was not restarted after enabling logging or logging was not enabled
Attachment #8433502 - Attachment is obsolete: true
I've noticed this recently too, as early as a few weeks ago when I started dogfooding full-time. I have logging enabled, so I'll see if I can grab full logs soonish.
Comment on attachment 8432343 [details]
logcat of vanished outbound email

This logcat indicates the 1970 bug (bug 941991) is occurring, again appearing to relate to connection loss.

In terms of message sending, the log indicates that the message sent successfully on account 1.  I'm not 100% sure which account is account 1, but since we didn't queue an append job, I think this means it was the google account and it indicates that any failure to append to the Sent folder would be a failure on the google server side.
(In reply to Marcus Cavanaugh [:mcav] <mcav@mozilla.com> from comment #9)
> I've noticed this recently too, as early as a few weeks ago when I started
> dogfooding full-time. I have logging enabled, so I'll see if I can grab full
> logs soonish.

Judging from the logs on this bug, I think we are primarily dealing with bug 941991 (the 1970 bug) and the tendency of the cronsync mechanism to permanently leave us in a wedged state (newly filed bug 1025727).  I'm not sure there is much more in terms of logs/investigation we need to ideally close this out, especially since the logs do provide a good idea of what's causing the 1970 problem.

Specifically, it looks like if we lose our connection while an IMAP sync was on the stack a 0 (or something that coerces to a 0 or that new Date() treats as a 0 when passed it) gets into our accuracy range information so a subsequent sync tries using that.  I would suggest tracing the error code flow paths first.

:mcav, if you could delve into the 1970 bug, that would be great.  I am definitely the appropriate reviewer for that and the review should likely be easy since the error will likely be obvious in retrospect and there will definitely need to be a unit test.  Feel free to request review from me on that despite the vacay thing.
Depends on: 941991
(In reply to Andrew Sutherland [:asuth] (important r/f/ni reqs only thru Jun 30) from comment #10)
> In terms of message sending, the log indicates that the message sent
> successfully on account 1.  I'm not 100% sure which account is account 1,
> but since we didn't queue an append job, I think this means it was the
> google account and it indicates that any failure to append to the Sent
> folder would be a failure on the google server side.

NB: For IMAP we *do not* currently locally put the message in the Sent folder, ever.  We depend on it to be resynchronized from the server.  So if you don't sync with the server in the sent folder, it will look like the message didn't get sent.  This is because of IMAP/UID complexity and us not having had time to spend on enhancing this yet.  (And the gmail case is somewhat non-trivial since it requires reconciliation logic unless google's SMTP implementation provides us with a UID in some type of comment on send.  The non-gmail case can be optimized when we begin using UIDPLUS on servers which support it.)
(In reply to Andrew Sutherland [:asuth] (important r/f/ni reqs only thru Jun 30) from comment #11)
> :mcav, if you could delve into the 1970 bug, that would be great.  I am
> definitely the appropriate reviewer for that and the review should likely be
> easy since the error will likely be obvious in retrospect and there will
> definitely need to be a unit test.  Feel free to request review from me on
> that despite the vacay thing.

Sure thing. Appreciate the clarity on the vacay, though I'd better not see a continuous stream of bugmail from you throughout your trip. :)
No longer blocks: 1008046
Assigning to mcav to investigate. Might be fixed by https://bugzilla.mozilla.org/show_bug.cgi?id=941991
Assignee: nobody → m
Target Milestone: --- → 2.0 S5 (4july)
mcav and I did a pair investigation. We added a bunch of logging, tracked in this branch:

https://github.com/jrburke/gaia/tree/instrumented

and can be run using this zip file:

http://jrburke.com/work/gaia/email-cron-7.zip

and it looks like the sync writes out to the imap connection, but just does not hear anything back. The problem manifested like so:

* Syncing on cell network only, no wifi.
* One imap account configured (was a zimbra account), syncing on 5 minutes.
* The email app was kept running after initial account setup, the phone was just left alone to go to a lock state.

The first account setup sync showed the logs that are part of an IMAP connection, but the next sync did not show the connection setup logs, so it looks like the IMAP connection was reused.

This was not 100% reproducible. Notably, the cell network provider or possibly location to cell towers seem to be in effect. For me, I could not get the it to fail on my cell connection in Canada at the home office. This was an encouraging data point as it helped to indicate that was not something fundamentally broken in the sync slice code.

So, that is the current understanding of the issue. We expect more people will see this issue, particularly with the Flame devices, since it has more RAM and so can keep email around longer.

Possible fixes:

1) Figure out if we can get some better signal from mozTCPSocket or something when the connection goes bad, so we can just shut down the one we have and start fresh next time.

2) If we are syncing, and in the background, at the end of the sync, just shut down the connection. This is probably a good idea anyway. Even though connection startup has a cost, it probably pales to the full app startup cost and likely has some downsides as far as network/radio utilization if the connection is held open for the 5 minutes between sync intervals.

Ideally we could get both. However, for a 2.0 timeframe, if we do #2, that is likely a good enough fix.

Note also that bug 1025727 is now in 1.3 and greater branches, which means it will kill the email app if the wake locks for sync reach their timeout of 45 seconds. So, in a way, it "fixes" the connection problem now by just shutting down the app. However, it is a pretty extreme fix, and a nicer fix would be best.

For testing a fix, you may need to disable the app close introduced with bug 1025727 to see it working well. The log branch above did that so that we could observe and trace the problem. On the other hand, with a proper fix, there should be no "cronsync wake locks expired, force closing app" in the logs either.

The instrumented code included the "don't refresh if slice is dead" 1970 fix in bug 941991. However, the instrumented build used a throw in the isDead case instead of just a log, so that if it did happen, we could get a handle on the code path.

But, we did not get 1970 bug for those instrumented tests, so the root cause of why a dead slice was trying to refresh was not discovered in this test round.
I'm 0/3 for sending emails this weekend.  Each time the sent just hangs with the spinner and never completes, email is not sent and draft is lost.  So I'm not sure bug 1025727 will help as it still results in data loss on send.

Another frustrating symptom with this bug is that the email client flat-out lies about connection state.  It'll say things like "last sync: 5 minutes ago" yet if you restart the app you'll suddenly get hours worth of new emails.
(In reply to Lucas Adamski [:ladamski] (plz needinfo) from comment #16)
> I'm 0/3 for sending emails this weekend.

Is this still on the original build, or master? If it's a newer build, could you pass along another logcat?
Flags: needinfo?(ladamski)
Its on Aurora (2.0), 201406300000201.

Pretty easy to reproduce on 3G, just hit refresh a few time and then try to send an email, and you get stuck.  Attaching new logs that I just pulled.
Flags: needinfo?(ladamski)
Attached file logcat of stuck send
(In reply to Lucas Adamski [:ladamski] (plz needinfo) from comment #16)
> I'm 0/3 for sending emails this weekend.  Each time the sent just hangs with
> the spinner and never completes, email is not sent and draft is lost.  So
> I'm not sure bug 1025727 will help as it still results in data loss on send.

The log indicates that our SMTP connection never got to the point like we felt like we could send the message.  Our logging is not verbose enough to indicate whether this is a network issue or something weird happening in the state machine.

The email secret debug log is boring because the failsafe mechanism shut the app down, taking any interesting tidbits with it.  Although there would be no interesting tidbits since all our interesting SMTP logging at this point is just console.log-ish stuff.

We should definitely up the logging for SMTP.  If this involves timeouts/other things, we're in that unpleasant situation, however where we really want to upgrade to the email.js versions of the libs rather than using/modifying the node.js version we're currently using.  Which has uplift complications/ramifications for v2.0.

> Another frustrating symptom with this bug is that the email client flat-out
> lies about connection state.  It'll say things like "last sync: 5 minutes
> ago" yet if you restart the app you'll suddenly get hours worth of new
> emails.

Do you know which account(s) you're referring to here?  gmail's IMAP implementation has some weird caching heuristics that we might be running afoul of.  It's conceivable we are continually syncing against the same checkpoint/view of the gmail folder state.
Er, but I should indicate that according to the log, the draft appears to have saved correctly.  Are you sure you're looking in the right "local drafts" folder for the drafts?
That weekend was mostly Zimbra, but I'm pretty sure I've run into that problem with both.  Also, there was not a copy in Local Drafts or anywhere else that I could find.
Hm, the draft issue implies the IndexedDB write is not hitting the disk.  Which is theoretically possible since the write is asynchronously triggered in a blind (but coherent) fashion following the draft saving op's completion.  But I would expect a structured clone error to be logged or our IndexedDB generalized/fatal error handler to trigger and log.  Maybe it has a bug or we somehow have a gap somewhere.  Alternately, the virtual list widget could be getting faked out, but I would expect that to affect all users.

I just checked into the lastSyncedAt being a lie.  The IMAP code lacks test coverage for this, which is not surprising because it appears there may also be an IMAP bug that we update our sync timestamp regardless of whether we failed to sync or not.  ActiveSync is okay.  POP3's behaviour seems wrong and like it might do some sketchy stuff like multiply invoke an unguarded callback.  I'll spin off a bug on the IMAP case now and an investigation POP3 bug and mark them as blockers against this bug.

I'm putting increasing the logging around the DB writes and the draft saves on the to-do for this bug along with the more extensive SMTP logging.
While filing the POP3 bug I dug a little deeper and noticed it does not have lastSyncedAt issues.  Some error-handling improvements could be made so I filed bug 1035578 but am not making it block this bug.
Faramarz has run into this too (cc'ed).  I certainly don't know the code here, but the naive observed behavior is that when sync "breaks", getting, sending and reading (i.e. retrieving email contents) all fails... even when sync claims to succeed it doesn't retrieve any emails.  

I really haven't noticed any observed difference between Mozilla and Gmail accounts... when sync breaks, I believe both actually stop working but I'll verify that next time it happens.
Target Milestone: 2.0 S5 (4july) → 2.0 S6 (18july)
I think we now have an improved handle on what's happening and some fixes ready to land soon.

I think bug 987931 may be related to an apparent lack of sync progress in cases where we are self-closing.  I have a fix for this up on bug 921050 that can be segregated out for uplift.


End-of-day brain-dump mainly for me:

While looking into that I was able to investigate the IMAP sync control flow a bit more, especially as it relates to us providing misleading claims about when we last synchronized.  It looks like there are some potential complicated failure modes in which IMAP sync will trigger the abort/deathback path but then we still might subsequently fire the success path.  While we will avoid double-firing the completion notification, we oddly would still call markSyncRange before bailing.  This might be able to explain a 1970 bug; some additional code-reading required.

An additional set of things to investigate/consider is the potential for sync requests to get backlogged in the case where the email app does not close itself at cronsyncs.  Also, a pathological (although hopefully impossible) situation would be that each sync request ends up with a basically dead TCP connection.  The sync asks for a connection, gets a dead one, then fails.  Our retry logic fires up a new connection which succeeds, but has no consumer.  That connection then goes stale, so the next time we fire up a sync request it gets a new stale connection.  While we likely do want to grow some retrying for sync (without reusing closures!), adopting a stale connection assumption as proposed by jrburke/mcav at least for 2.0 seems appropriate.

Existing failure log traces may ideally provide some quick feedback as it relates to these theories.  The fault injecting socket mock can probably also be quickly adapted to imitate stale connections and can give us our desired permutations.
Target Milestone: 2.0 S6 (18july) → 2.1 S1 (1aug)
Depends on: 1042190
Depends on: 1042197
Following investigations into the sync issues, we are converting this to a meta bug and will be nominating the dependent bugs instead as the individual units of work that will improve the sync behavior for 2.0.
blocking-b2g: 2.0+ → ---
Keywords: meta
Assignee: m → nobody
For tracking purposes: Bug 1042190 and bug 1042197 have both been landed, implementing fixes to IMAP connection handling discussed elsewhere in this thread. They are awaiting uplift to v2.0.
Target Milestone: 2.1 S1 (1aug) → 2.1 S3 (29aug)
blocking-b2g: --- → backlog
Whiteboard: [priority]
Target Milestone: 2.1 S3 (29aug) → ---
[priority] --> tracking-b2g:+ conversion
tracking-b2g: --- → +
We're still seeing bugs of the general family where periodic sync does not reliably work being filed.  Without logs it's hard to know what's wrong to fix it, so we've filed bug 1090481 to make it significantly more tractable for dogfooders/the email developers/etc. to be able to provide useful logs.

Since this bug is already more of a meta bug, I'm marking it as such and we're going to transfer the tracking+ status to bug 1090481 which we will work as soon as possible.  (Note: don't read anything into removing [priority], it was previously used for what tracking+ is used for now and is deprecated in the email component now.)
blocking-b2g: backlog → ---
tracking-b2g: + → ---
Summary: Email sync fails, never recovers without restart → [meta] Email sync fails, never recovers without restart
Whiteboard: [priority]
Firefox OS is not being worked on
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: