Closed
Bug 1025727
Opened 11 years ago
Closed 11 years ago
[email/cronysync] Failsafe abort/window.close after (multiple?) failed cronsync(s)
Categories
(Firefox OS Graveyard :: Gaia::E-Mail, defect)
Tracking
(blocking-b2g:1.3+, b2g-v1.3 fixed, b2g-v1.3T fixed, b2g-v1.4 fixed, b2g-v2.0 fixed, b2g-v2.1 fixed)
People
(Reporter: asuth, Assigned: jrburke)
References
Details
(Whiteboard: [cert])
Attachments
(1 file)
|
46 bytes,
text/x-github-pull-request
|
mcav
:
review+
asuth
:
feedback+
bajaj
:
approval-gaia-v1.3+
|
Details | Review |
In various bugs we're seeing the email app fail during a cronsync in a way that retains the Inbox mutex and never completes. The job/operation queue may also end up broken. The cronsync auto-window.close mechanism also never triggers (if applicable). The result is that the email app is broken for at least the Inbox until the user explicitly closes the email app or it gets reaped by the low-memory killer.
Although we should endeavour to fix the root causes of the problems, as a fail-safe we should have the email app force a window.close if it's clear that the cronsync mechanism went off the rails. The benefits to this are:
- Non-persisting probabilistic failures get recovered from. (By non-persisting I mean ones that don't screw up our DB state. The 1970 bug sometimes may be persisting.)
- By closing, persistent/deterministic failures that reproduce will reproduce when the user tries to use the email app, which is also when they're likely to be trying to provide us with a logcat. In contrast, errors that happened in the past are likely to have been cycled out of the circular buffer. This makes determining the failure modes much more likely.
We do want to avoid gratuitous message composition data-loss with this fix, but there's also serious potential for draft saving to be busted (as reported in bug 1018828), so any guard for that should be time-bounded too. Ideally we'll be in the background and can window.close without the user noticing so much.
| Reporter | ||
Comment 1•11 years ago
|
||
Requesting 1.3? since this is believed to be the root cause of that current 1.3+ blocker bug.
blocking-b2g: --- → 1.3?
| Reporter | ||
Comment 2•11 years ago
|
||
Gah, over copy-and-paste. This bug is believed to be why breakage ends up being near-permanent in cases where the email app does not get killed in the background or by the user.
| Assignee | ||
Updated•11 years ago
|
Assignee: nobody → jrburke
Target Milestone: --- → 2.0 S5 (4july)
Cert blocker for DT
| Assignee | ||
Comment 4•11 years ago
|
||
This pull request introduces a failsafe close mechanism based on the cronsync mechanism.
The bulk of the work is done in wake_locks, since we ideally need the wake locks still active if some other UI data actions is occurring, like a mid-stream draft save or message send, and the wake lock timeout gives us a strong signal that something has gone wrong.
For the wake_lock timeout, it now uses a new function, closeIfNoDataOps, to decide if the app should be closed. If there are no in-process UI data actions, then it will proceed with closing the app if the app is currently in a hidden state. It was previously confirmed that the system will free any locks automatically for closed apps.
If the app is visible when the wake lock timeout occurs, then the locks are just cleared, and we wait for the next sync pass or regular other-app use to OOM email.
compose triggers uiDataOperationStart and uiDataOperationStop events to indicate in-process UI data actions: either saving a draft or sending a message.
If there are in-process UI data actions, there is a 5 second grace period to finish before the failsafe kills the app. This is done because one of those UI data actions could be contributing to the reason sync has gone bad, and it is more important to just shut down and reset. The odds of reaching that state should be very low, the UI data actions will likely complete in time, but we need an ultimate failsafe pathway just in case.
This will likely need to be revisited with background send. If it is an email send with a large attachment, that could also be a problem, but only if email is in the background when that happens. The most likely cause of getting into this state though, when sync fails so that this pathway is considered is if the network connections have gotten into a weird state, so sending is not likely to complete anyway.
Tests were performed by putting in an early return in sync.js at the top of its api.oncronsyncstop function, at line 108. This made it appear that syncing did not complete. Then the following things were tried, using a 60 second test interval:
* regular background sync with email closed. Saw the "force closing app" message and confirmed email was not running after that.
* regular background sync with email, but then tapping email icon so it is visible. Saw "app still visible" and email kept running.
* Changed compose to not emit the uiDataOperationStop event, then waited for a sync to start, opened a compose and saved draft, then went back to home screen. Saw the "waiting on email data operations" message with "saveDraft" mentioned, then 5 seconds later the app was closed.
Asking mcav for the formal review, but f? asuth in case he has other feedback, hopefully less burden than a full review.
Attachment #8443760 -
Flags: review?(m)
Attachment #8443760 -
Flags: feedback?(bugmail)
| Reporter | ||
Comment 5•11 years ago
|
||
From your detailed description of the patch (thank you!), this sounds like a very good strategy in terms of trying to avoid killing the app when it's visible and adding some protection related to draft saving and sending (which can fail because draft saving uses the operation queue which can get wedged; however "loss" of the inbox mutex because of a hung sync won't affect draft saving/etc.).
I feel like a good balance is being struck, although we know that we'd like to generally clean things up a lot more in the future both with the heartbeats mcav has introduced for background send, plus better task tracking and ideally wrapping all async events so they are tracked to a specific task/operation that we'll abort if something bad happens. (ex: Since we already check out IMAP connections, any exception thrown while delivering the ondata event would be cause us to terminate the task much like connection loss currently causes us to trigger the "errback" of the current owner of connection.) This is likely a great topic of discussion for both us and our calendar peeps at our upcoming in-person meeting.
Especially since mcav has just finished doing some related stuff for the sending time-outs, I think he is indeed a great/the right reviewer for this. I'd like to do a quick skim after mcav completes his review or does the bulk of his first review pass just to make sure we've considered any weird interaction patterns, but your testing and analysis seem quite thorough already.
Updated•11 years ago
|
blocking-b2g: 1.3? → 1.3+
Whiteboard: [cert]
Comment 6•11 years ago
|
||
Comment on attachment 8443760 [details] [review]
GitHub pull request
This seems okay from a failsafe perspective, so r+ for the purpose of uplift and being safer than we have been before. However, I think we should revisit the sync logic at some point; in working on the 1970-date-sync bug, I'm seeing some other sync-related weirdness that I don't fully understand yet. More specifically, cronsyncs occur at the correct intervals, but they never fire the "oncronsyncstop" signal. I don't currently know if that behavior results in any user-facing errors; I'll open/make a note in a more relevant bug when I can better distill what's going on. My observations there seem like a different issue than the one we're seeing here, since sync continues to work.
So I think this is good to have, and the implementation seems sound and worth adding here, with a mental note that we should revisit sync at some point to make sure we don't cover over and forget about any underlying issues.
Attachment #8443760 -
Flags: review?(m) → review+
Comment 7•11 years ago
|
||
Also, in terms of background send: It's worthwhile to have a failsafe mechanism like this, even if we have a flawless sync mechanism, just to make sure we don't hose ourselves in the future. This patch is compact and doesn't change much for background send. So, looking back at the thread here, I agree with Andrew that the workweek might be a good time to sanity-check ourselves with regard to making syncing a more robust/job-tracked/analyzable thing.
| Assignee | ||
Comment 8•11 years ago
|
||
:asuth indicated a desire to look at the patch one more time after :mcav's review. We'll give him a day to comment (or say he wants more time), but after that, if there is pressure to land as a cert blocker, then I will likely just land it as-is.
Flags: needinfo?(bugmail)
| Reporter | ||
Comment 9•11 years ago
|
||
I should have completed my feedback pass by your tomorrow morning; expecting to review ~12-13 hours from now. Do land without me if I miss that.
| Reporter | ||
Updated•11 years ago
|
Attachment #8443760 -
Flags: feedback?(bugmail) → feedback+
Flags: needinfo?(bugmail)
| Assignee | ||
Comment 10•11 years ago
|
||
Merged in master:
https://github.com/mozilla-b2g/gaia/commit/b4472a4633fdf5014e97500f1e50481dd01c76cf
from pull request:
https://github.com/mozilla-b2g/gaia/pull/20828
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
| Assignee | ||
Comment 11•11 years ago
|
||
Comment on attachment 8443760 [details] [review]
GitHub pull request
This bug is marked as a cert blocker.
[Bug caused by] (feature/regressing bug #):
Not a regression, but a failsafe shutdown if something goes really wrong.
[User impact] if declined:
User could get stuck with email unable to sync, not able to fetch new mail.
[Testing completed]:
See pull request comments: on device, by forcing some code conditions that would trigger the code pathway.
[Risk to taking this patch] (and alternatives if risky):
Low. It only activates if the wake lock timeout is triggered after 45 seconds, which means that the app is probably stuck anyway.
[String changes made]:
none
Attachment #8443760 -
Flags: approval-gaia-v1.3?
Updated•11 years ago
|
Attachment #8443760 -
Flags: approval-gaia-v1.3? → approval-gaia-v1.3+
Comment 12•11 years ago
|
||
v2.0: https://github.com/mozilla-b2g/gaia/commit/c80c6ae1b3214a60e8ac5b19e8206334737e3af3
v1.4: https://github.com/mozilla-b2g/gaia/commit/e68f589db4de9355c3ab1896e6f70d63d7d82809
v1.3: https://github.com/mozilla-b2g/gaia/commit/e5f7bbd6b8ee0b0014d2fef73a4ef59b620acaed
status-b2g-v1.3:
--- → fixed
status-b2g-v1.3T:
--- → affected
status-b2g-v1.4:
--- → fixed
status-b2g-v2.0:
--- → fixed
status-b2g-v2.1:
--- → fixed
Comment 13•11 years ago
|
||
Reverted from v1.3 and v1.4 for Gaia integration test failures that James is already looking into.
v1.4: https://github.com/mozilla-b2g/gaia/commit/297a069d987abe220dd4b925d44e52474c6a9670
v1.3: https://github.com/mozilla-b2g/gaia/commit/2090c498f4e6586e77b20a4ca2542dd3f0ca511a
| Assignee | ||
Comment 14•11 years ago
|
||
Updated•11 years ago
|
| Assignee | ||
Comment 15•11 years ago
|
||
Reverted v2.0 commit, needs similar manual merge as 1.3 and 1.4:
https://github.com/mozilla-b2g/gaia/commit/032f436bb7ae1c5c21bf7f0c17271a4b004e5a0b
Will apply a new one shortly.
| Assignee | ||
Comment 16•11 years ago
|
||
You need to log in
before you can comment on or make changes to this bug.
Description
•