Closed Bug 1220406 Opened 9 years ago Closed 9 years ago

[email] Possible sync wedging when two syncs overlap

Categories

(Firefox OS Graveyard :: Gaia::E-Mail, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(b2g-v2.5 fixed, b2g-master fixed)

RESOLVED FIXED
Tracking Status
b2g-v2.5 --- fixed
b2g-master --- fixed

People

(Reporter: jrburke, Assigned: jrburke)

Details

Attachments

(2 files)

If two different background syncs occur at nearly the same time, it can lead to the app not completing the cronsync.syncAccount callback for one of the accounts.

I have not log-instrumented syncAccount further yet, but plan to. My first pass was just logging before syncAccount is called in cronsync and in the callback for it to see if it completes.

Using a case that involved ActiveSync, the wedging may happen before the full set of network activities complete: in the success cases, that account would print message like this:

"This site makes use of a SHA-1 Certificate; it's recommended you use certificates with signature algorithms that use hash functions stronger than SHA-1."

However, in the failure case, these messages were not seen.

I could also get it to occur with two IMAP accounts one set at 30 sec syncs, the other at 60 sec syncs, when the syncs lined up, which was not often.

This situation is not dire due to the following mitigations:

* We have a wakelock timeout mechanism that kills the app in this case, and later syncing is fine.

* If another sync comes in before that wakelock timeout, syncing continues. It might be a bit wobbly on that first one, but after that normal sync continued for me.

* The syncs have to line up really close to each other. Due to the use of mozAlarms now for syncing, even if the user chooses a multiple of another sync, they are not fired at exactly the same time. Over a period of time, their rotations do cross eventually, then phase out.

So the worse case in these tests just seemed to be a missed sync or two, in the case of rapid syncs, which an end user is unlikely to be able to trigger.

Due to these mitigations though, it can be hard to observe the failure case.

Found during testing of the fix for bug 1219853. Since this state is reached at a part of the code unrelated to whether request-sync or mozAlarms are used, seems like an unrelated, pre-existing issue to that bug change.

Assigning to myself for now to get a more accurate diagnosis of the wedge condition.
I forgot the other mitigation: if the user selects multiple accounts to all sync at the same interval, our code already coalesces those into one alarm and sync pass. I tested this with three accounts all on the same interval, 2 IMAP one ActiveSync, and it worked fine.

So the issue is only for two sync intervals that are different but where one divides evenly to the other, and the mozAlarms for those two actually end up firing near the same time.
At some point not too long ago when mhoye was experiencing sync problems I had looked into things and thought we should be handling syncs running into each other / overlapping acceptably.  Which is not to contradict what you're seeing (it only takes 1 bug! ;), but more to say that we should also be ensuring that it's not just a lower level sync hanging.  In particular, XHR requests can conceivably end up waiting for their 30 second timeouts and/or might not be tracking errors the best.

Running with super-verbose realtime logging is likely the way to go when analyzing, although it does produce a lotta logcat.
After capturing logs and :asuth providing the insight, the issue is with the _killSlices/_activeSlices usage in GELAM's cronsync.js: specifically:

* in onAlarm, _killSlices kills any active slices. This was an effort to prevent long-running/stuck syncs from zombifying and consuming resources.
* However, when the second sync comes in via the onAlarm call while the first legitimate one is running, it will kill off the first sync that started, so the completion callback on that first sync will never get run.
* This was not an issue with accounts on the same sync interval since we group them, and process them together, after _killSlices was run.

The motivations for the _killSlices logic are met now in different ways:

* There are connection-level close-on-idle cleanup logic which is more extensive.
* Our wake_lock timeout will kill the app if background sync is really frozen, and the app has not been made visible to the user.

So the solution will be to just remove the _killSlices mechanism and tracking the slices in _activeSlices.
Attached file Gelam pull request
I goofed the pull request title at first, so manually attaching pull request, notes in the pull request.
Attachment #8682888 - Flags: review?(bugmail)
Comment on attachment 8682888 [details] [review]
Gelam pull request

Woo!
Attachment #8682888 - Flags: review?(bugmail) → review+
Comment on attachment 8682913 [details] [review]
[gaia] jrburke:bug1220406-email-bg-sync-wedge > mozilla-b2g:master

Carrying r+ from gelam to the gaia change since they match.
Attachment #8682913 - Flags: review+
[Blocking Requested - why for this release]:

Merged in gaia master:
https://github.com/mozilla-b2g/gaia/commit/4d4dd9591fdaed0bb20263fac8acae2cdeb9f718

Asking for blocking 2.5+, as I would like to uplift this for the 2.5 branch for better behavior when more than one account sync interval is involved. If it goes to 2.5+ then I will ask for approval.
Status: ASSIGNED → RESOLVED
blocking-b2g: --- → 2.5?
Closed: 9 years ago
Resolution: --- → FIXED
Looking now at https://github.com/mozilla-b2g/gaia/commits/v2.5 it looks like this change made it in, so removing the blocking flag. I also forgot we can just ask for approval without the blocking flag.
blocking-b2g: 2.5? → ---
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: