Closed Bug 1025727 Opened 11 years ago Closed 11 years ago

[email/cronysync] Failsafe abort/window.close after (multiple?) failed cronsync(s)

Tracking

(blocking-b2g:1.3+, b2g-v1.3 fixed, b2g-v1.3T fixed, b2g-v1.4 fixed, b2g-v2.0 fixed, b2g-v2.1 fixed)

Status:

RESOLVED FIXED

Milestone:

2.0 S5 (4july)

Project Flags:

blocking-b2g

1.3+

Tracking Flags:

Tracking

Status

b2g-v1.3

---

fixed

b2g-v1.3T

---

fixed

b2g-v1.4

---

fixed

b2g-v2.0

---

fixed

b2g-v2.1

---

fixed

People

(Reporter: asuth, Assigned: jrburke)

References

Details

(Whiteboard: [cert])

Attachments

(1 file)

GitHub pull request 11 years ago James Burke [:jrburke] 46 bytes, text/x-github-pull-request	mcav : review+ asuth : feedback+ bajaj : approval-gaia-v1.3+	Details \| Review

Andrew Sutherland [:asuth] (he/him)

Reporter

Description

•

11 years ago

In various bugs we're seeing the email app fail during a cronsync in a way that retains the Inbox mutex and never completes. The job/operation queue may also end up broken. The cronsync auto-window.close mechanism also never triggers (if applicable). The result is that the email app is broken for at least the Inbox until the user explicitly closes the email app or it gets reaped by the low-memory killer. Although we should endeavour to fix the root causes of the problems, as a fail-safe we should have the email app force a window.close if it's clear that the cronsync mechanism went off the rails. The benefits to this are: - Non-persisting probabilistic failures get recovered from. (By non-persisting I mean ones that don't screw up our DB state. The 1970 bug sometimes may be persisting.) - By closing, persistent/deterministic failures that reproduce will reproduce when the user tries to use the email app, which is also when they're likely to be trying to provide us with a logcat. In contrast, errors that happened in the past are likely to have been cycled out of the circular buffer. This makes determining the failure modes much more likely. We do want to avoid gratuitous message composition data-loss with this fix, but there's also serious potential for draft saving to be busted (as reported in bug 1018828), so any guard for that should be time-bounded too. Ideally we'll be in the background and can window.close without the user noticing so much.

Andrew Sutherland [:asuth] (he/him)

Reporter

Updated

•

11 years ago

Blocks: 1008046

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 1

•

11 years ago

Requesting 1.3? since this is believed to be the root cause of that current 1.3+ blocker bug.

blocking-b2g: --- → 1.3?

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 2

•

11 years ago

Gah, over copy-and-paste. This bug is believed to be why breakage ends up being near-permanent in cases where the email app does not get killed in the background or by the user.

James Burke [:jrburke]

Assignee

Updated

•

11 years ago

Assignee: nobody → jrburke

Target Milestone: --- → 2.0 S5 (4july)

Vance Chen [:vchen][skype:vance.lucida][vchen@mozilla.com]

Comment 3

•

11 years ago

Cert blocker for DT

James Burke [:jrburke]

Assignee

Comment 4

•

11 years ago

Attached file GitHub pull request — Details

This pull request introduces a failsafe close mechanism based on the cronsync mechanism. The bulk of the work is done in wake_locks, since we ideally need the wake locks still active if some other UI data actions is occurring, like a mid-stream draft save or message send, and the wake lock timeout gives us a strong signal that something has gone wrong. For the wake_lock timeout, it now uses a new function, closeIfNoDataOps, to decide if the app should be closed. If there are no in-process UI data actions, then it will proceed with closing the app if the app is currently in a hidden state. It was previously confirmed that the system will free any locks automatically for closed apps. If the app is visible when the wake lock timeout occurs, then the locks are just cleared, and we wait for the next sync pass or regular other-app use to OOM email. compose triggers uiDataOperationStart and uiDataOperationStop events to indicate in-process UI data actions: either saving a draft or sending a message. If there are in-process UI data actions, there is a 5 second grace period to finish before the failsafe kills the app. This is done because one of those UI data actions could be contributing to the reason sync has gone bad, and it is more important to just shut down and reset. The odds of reaching that state should be very low, the UI data actions will likely complete in time, but we need an ultimate failsafe pathway just in case. This will likely need to be revisited with background send. If it is an email send with a large attachment, that could also be a problem, but only if email is in the background when that happens. The most likely cause of getting into this state though, when sync fails so that this pathway is considered is if the network connections have gotten into a weird state, so sending is not likely to complete anyway. Tests were performed by putting in an early return in sync.js at the top of its api.oncronsyncstop function, at line 108. This made it appear that syncing did not complete. Then the following things were tried, using a 60 second test interval: * regular background sync with email closed. Saw the "force closing app" message and confirmed email was not running after that. * regular background sync with email, but then tapping email icon so it is visible. Saw "app still visible" and email kept running. * Changed compose to not emit the uiDataOperationStop event, then waited for a sync to start, opened a compose and saved draft, then went back to home screen. Saw the "waiting on email data operations" message with "saveDraft" mentioned, then 5 seconds later the app was closed. Asking mcav for the formal review, but f? asuth in case he has other feedback, hopefully less burden than a full review.

Attachment #8443760 - Flags: review?(m)

Attachment #8443760 - Flags: feedback?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 5

•

11 years ago

From your detailed description of the patch (thank you!), this sounds like a very good strategy in terms of trying to avoid killing the app when it's visible and adding some protection related to draft saving and sending (which can fail because draft saving uses the operation queue which can get wedged; however "loss" of the inbox mutex because of a hung sync won't affect draft saving/etc.). I feel like a good balance is being struck, although we know that we'd like to generally clean things up a lot more in the future both with the heartbeats mcav has introduced for background send, plus better task tracking and ideally wrapping all async events so they are tracked to a specific task/operation that we'll abort if something bad happens. (ex: Since we already check out IMAP connections, any exception thrown while delivering the ondata event would be cause us to terminate the task much like connection loss currently causes us to trigger the "errback" of the current owner of connection.) This is likely a great topic of discussion for both us and our calendar peeps at our upcoming in-person meeting. Especially since mcav has just finished doing some related stuff for the sending time-outs, I think he is indeed a great/the right reviewer for this. I'd like to do a quick skim after mcav completes his review or does the bulk of his first review pass just to make sure we've considered any weird interaction patterns, but your testing and analysis seem quite thorough already.

Jason Smith [:jsmith]

Updated

•

11 years ago

blocking-b2g: 1.3? → 1.3+

Whiteboard: [cert]

Marcus Cavanaugh [:mcav]

Comment 6

•

11 years ago

Comment on attachment 8443760 [details] [review] GitHub pull request This seems okay from a failsafe perspective, so r+ for the purpose of uplift and being safer than we have been before. However, I think we should revisit the sync logic at some point; in working on the 1970-date-sync bug, I'm seeing some other sync-related weirdness that I don't fully understand yet. More specifically, cronsyncs occur at the correct intervals, but they never fire the "oncronsyncstop" signal. I don't currently know if that behavior results in any user-facing errors; I'll open/make a note in a more relevant bug when I can better distill what's going on. My observations there seem like a different issue than the one we're seeing here, since sync continues to work. So I think this is good to have, and the implementation seems sound and worth adding here, with a mental note that we should revisit sync at some point to make sure we don't cover over and forget about any underlying issues.

Attachment #8443760 - Flags: review?(m) → review+

Marcus Cavanaugh [:mcav]

Comment 7

•

11 years ago

Also, in terms of background send: It's worthwhile to have a failsafe mechanism like this, even if we have a flawless sync mechanism, just to make sure we don't hose ourselves in the future. This patch is compact and doesn't change much for background send. So, looking back at the thread here, I agree with Andrew that the workweek might be a good time to sanity-check ourselves with regard to making syncing a more robust/job-tracked/analyzable thing.

James Burke [:jrburke]

Assignee

Comment 8

•

11 years ago

:asuth indicated a desire to look at the patch one more time after :mcav's review. We'll give him a day to comment (or say he wants more time), but after that, if there is pressure to land as a cert blocker, then I will likely just land it as-is.

Flags: needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 9

•

11 years ago

I should have completed my feedback pass by your tomorrow morning; expecting to review ~12-13 hours from now. Do land without me if I miss that.

Andrew Sutherland [:asuth] (he/him)

Reporter

Updated

•

11 years ago

Attachment #8443760 - Flags: feedback?(bugmail) → feedback+

Flags: needinfo?(bugmail)

James Burke [:jrburke]

Assignee

Comment 10

•

11 years ago

Merged in master: https://github.com/mozilla-b2g/gaia/commit/b4472a4633fdf5014e97500f1e50481dd01c76cf from pull request: https://github.com/mozilla-b2g/gaia/pull/20828

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

James Burke [:jrburke]

Assignee

Comment 11

•

11 years ago

Comment on attachment 8443760 [details] [review] GitHub pull request This bug is marked as a cert blocker. [Bug caused by] (feature/regressing bug #): Not a regression, but a failsafe shutdown if something goes really wrong. [User impact] if declined: User could get stuck with email unable to sync, not able to fetch new mail. [Testing completed]: See pull request comments: on device, by forcing some code conditions that would trigger the code pathway. [Risk to taking this patch] (and alternatives if risky): Low. It only activates if the wake lock timeout is triggered after 45 seconds, which means that the app is probably stuck anyway. [String changes made]: none

Attachment #8443760 - Flags: approval-gaia-v1.3?

bhavana bajaj [:bajaj]

Updated

•

11 years ago

Attachment #8443760 - Flags: approval-gaia-v1.3? → approval-gaia-v1.3+

Ryan VanderMeulen [:RyanVM]

Comment 12

•

11 years ago

v2.0: https://github.com/mozilla-b2g/gaia/commit/c80c6ae1b3214a60e8ac5b19e8206334737e3af3 v1.4: https://github.com/mozilla-b2g/gaia/commit/e68f589db4de9355c3ab1896e6f70d63d7d82809 v1.3: https://github.com/mozilla-b2g/gaia/commit/e5f7bbd6b8ee0b0014d2fef73a4ef59b620acaed

status-b2g-v1.3: --- → fixed

status-b2g-v1.3T: --- → affected

status-b2g-v1.4: --- → fixed

status-b2g-v2.0: --- → fixed

status-b2g-v2.1: --- → fixed

Ryan VanderMeulen [:RyanVM]

Comment 13

•

11 years ago

Reverted from v1.3 and v1.4 for Gaia integration test failures that James is already looking into. v1.4: https://github.com/mozilla-b2g/gaia/commit/297a069d987abe220dd4b925d44e52474c6a9670 v1.3: https://github.com/mozilla-b2g/gaia/commit/2090c498f4e6586e77b20a4ca2542dd3f0ca511a

status-b2g-v1.3: fixed → affected

status-b2g-v1.4: fixed → affected

James Burke [:jrburke]

Assignee

Comment 14

•

11 years ago

Fixed the merge issue: v1.4: https://github.com/mozilla-b2g/gaia/commit/b51199ee4238340c9229d37e382fb399e7fb0afe v1.3: https://github.com/mozilla-b2g/gaia/commit/4fc9eab905ea0951e7003cdce2c39d7dfc02f6ec

status-b2g-v1.3: affected → fixed

status-b2g-v1.4: affected → fixed

Ryan VanderMeulen [:RyanVM]

Updated

•

11 years ago

status-b2g-v1.3T: affected → fixed

James Burke [:jrburke]

Assignee

Updated

•

11 years ago

Blocks: 1031296

James Burke [:jrburke]

Assignee

Comment 15

•

11 years ago

Reverted v2.0 commit, needs similar manual merge as 1.3 and 1.4: https://github.com/mozilla-b2g/gaia/commit/032f436bb7ae1c5c21bf7f0c17271a4b004e5a0b Will apply a new one shortly.

status-b2g-v2.0: fixed → affected

James Burke [:jrburke]

Assignee

Comment 16

•

11 years ago

v2.0: https://github.com/mozilla-b2g/gaia/commit/b561073ac1c5d42fd1ea4d1497c04121e8250567

status-b2g-v2.0: affected → fixed

You need to log in before you can comment on or make changes to this bug.