Not all of our code paths are sufficiently protected against unexpected failures using try/catch blocks. A lot of the core/expected/got-burnt-already paths have try/catch blocks, but there are always new ways for an exception to get thrown and not be caught by our application logic and end up in the general window.onerror handler where it's too late for us to trace back what thing actually failed. node.js has a concept of (error) domains at http://nodejs.org/api/domain.html that ensure that any exceptions thrown by code in that domain goes back to a single handler. And domains get associated with the many callback entry points. For email the goal is to be able to ensure that when we get some TCPSocket I/O and our code throws inside the ondata handler that we then are able to say: "okay, the connection was in use by this job-op and so let's abort that job-op and possibly let it retry again in the future". We already have logic in place to eventually give up on a job-op because it keeps failing and this logic could then inform that logic. (We would also want to give up more quickly in these situations.) Things we may want to do as part of this logic: - Be able to mark a message as "bad". We have an IMAP bug for this on bug 943802, but this is also appropriate for ActiveSync and POP3. The idea is effectively that we may experience deterministic errors for certain messages and it's better for all involved to just ignore those. This would apply to both cases where we avoid trying to sync them (IMAP where it might be just the UID we treat as bad), or things related to fetching the body. There could be corresponding UI for this. For example, for ActiveSync we currently are seeing complexities related to text/rtf messages. - We might want a versioned concept of bad. For Thunderbird's gloda we designed it so that bad messages would no longer be considered bad upon upgrade since we might have addressed the underlying bug. - Assisted error reporting of bad messages. - When we activate this failsafe, potentially have some UI to help the user report this as a problem to us along with some additional meta-information, ideally sanitized. For example, the stack and maybe us providing excerpts of the source from that area (partners may fork) could be useful.
Update, tl;dr is: have gameplan, prioritizing the integration test reactivation to minimize bit-rot issues between the work for this bug and mcav's changes for us to use email.js and because I believe this is uplift eligible but the integration stuff impacts integration test infrastructure and so needs to land on v2.1 sooner rather than later. (Of particular bit-rot concern are changes to the back-end tests.) Gameplan: I ran a survey of the prior art/existing libs on this during the latter half of last week. Other than node domains and some google closure library internals that did not seem directly suitable, there wasn't a lot that got even close to what we want. But then I found Kris Kowal's gtor opus at https://github.com/kriskowal/gtor. It includes a task abstraction implementation at http://kriskowal.github.io/gtor/docs/task.html and it looks quite appealing when used in conjunction with us wrapping our entry-points. I think the right course of action is for generic APIs like setTimeout/setZeroTimeout to have a module exposing explicitly wrapped versions rather than transparently/magically wrapped setTimeout/etc. For many existing callback idioms we can just convert the call structure over to promises which are automatically wrapped by the implementation to convert to a rejection (modulo the bit where the successback throwing won't trigger its twin errback). For TCPSocket which is streamy and we remote it anyways, I think the answer is it just knows about the task. I want to avoid changing semantics there since the raw socket standard is also bringing us an explicit stream abstraction. I think this ends up as a net win since it: - sorta gets the error handling for free-ish; if something bad happens, as long as promises are linked into the task chain, all the wrappers are already in place to catch and reject - will pay down some technical debt related to callbacks by just converting them to promises - has formal cancellation semantics that are less GELAM-specific. - can help make some of the job/op implementations less weird and more composable. Instead of tossing callbacks around internally they could be separate functions, etc. (We've already started doing this with the platform abstractions and latch, of course.) I am planning to limit things a bit and keep control flow consistent and primarily limit changes to making sure that error cases are reasonable for humans to trace. So although gtor has a neat thing on using PromiseQueues as semaphores that can limit activity to N active things (and some other similar constructs), I don't think we want to go there yet. Especially because one of the reasons I avoided using promises early on in GELAM was experience had shown the emergent control flow could be a nightmare, especially without "long" stacks.
Created attachment 8480862 [details] [review] Link to Github pull-request: https://github.com/mozilla-b2g/gaia-email-libs-and-more/pull/332 The real action happens in js/disaster-recovery.js. When cleaning up a socket exception, we clean up any running job op, release any mutex, and/or the socket itself. Unit test in test/unit/test_disaster_recovery.js, to test the case where (a) we have a mutex but no job op (e.g. during sync), and (b) we have a mutex and a job op that need to be cleaned up.
Attachment #8480862 - Flags: review?(bugmail)
Comment on attachment 8480862 [details] [review] Link to Github pull-request: https://github.com/mozilla-b2g/gaia-email-libs-and-more/pull/332 r=asuth with the suggested fixes I pushed to your pull request. First, awesome work, per usual. (And super-fast turnaround on this patch after landing your also awesome emailjs mega-patch!) My primary concern without the review fixes is the multiplicity of mutexes; each job-op can acquire multiple mutexes, such as the move operation, so tracking a single mutex per account doesn't quite work. My fix was to have disaster-recovery.js not bother tracking mutexes since the postJobCleanup mechanism already tracked all outstanding mutexes and released them. (The pull request's move test was failing for me, and one of the reported things was the attempt to double-release the mutex. Which promises/friends could address, of course, but... future!) The mutexes, however, arguably do matter for sync purposes since syncs are not jobs. However, we run into other wrinkles there since it's therefore entirely possible for a sync to be operating concurrently in a different folder (with a different mutex) from whatever the job-op is getting up to. The good news is that since catchSocketExceptions closes the connection, the "deathback" will be invoked and the failed sync will clean itself up correctly. And the tests verify this. There is some potential for overkill where a sync failure could end up nuking a legitimate job-op, but I think that's an acceptable trade-off. Specifically, although we could try and ensure that the connection generating the error is associated with the job-op, we run into the multiplicity problem again; the move does hold two connections at the same time and we don't have a trivially easy way to map from the connection to the job right now. I think we want that in a fancier future, but I think it's worth waiting on that until we can switch to doing more tasks/promises stuff. Given how rare we expect these failures to be, overkill is not a major concern, but accidentally not cleaning up would be defeat much of the purpose here. To support those tests I made runMutexed's completion log via 'slog' and had us propagate errors to the release so we can verify the reason for the release. I tried to improve the feature-set of LogChecker and to, I think, make its semantics conform more to the long-term plan. Also, it now logs the objects it's checking too, which is nice if you get a failure. I think these changes make sense, but I'm open to alternate strategies. I do appreciate that disaster-recovery was going in somewhat more of a hygienic hypervisor type direction than the arguably more cooperative-resource-management of the job-ops' use of _accessFolderForMutation. But I think _accessFolderForMutation is actually pretty hygienic if you ignore our weird mix-in idiom. I do think we can be more hygienic in the task future, though. Assuming we still have mutexes (which I'd prefer not), for example we could say: "gimme the mutex, this is my task, and here's optionally a promise that I can resolve to release the mutex before that. I know that you will do Promise.race() against both of them so that the mutex most def. gets released." 1: https://www.youtube.com/watch?v=RY7S6EgSlCI
Attachment #8480862 - Flags: review?(bugmail) → review+
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Hi Andrew - I knew the fix of this bug is landed on master, but this bug will probably become an IOT blocker for some 2.0 devices, so I am wondering, is it possible to uplift the fix on 2.0 as well? Thanks Vance
(In reply to Vance Chen [:vchen][firstname.lastname@example.org] from comment #5) > I knew the fix of this bug is landed on master, but this bug will probably > become an IOT blocker for some 2.0 devices, so I am wondering, is it > possible to uplift the fix on 2.0 as well? It's probably not something I'd suggest trying to do. This is a preventative fix that doesn't directly address any known bugs and is layered on top of other changes that really can't be uplifted to v2.0 without significant risk. Specific bug scenarios should likely be mitigated on their own or we need to make it possible to update the mail app via the marketplace.
(In reply to Andrew Sutherland [:asuth] from comment #6) > (In reply to Vance Chen [:vchen][email@example.com] from comment #5) > > I knew the fix of this bug is landed on master, but this bug will probably > > become an IOT blocker for some 2.0 devices, so I am wondering, is it > > possible to uplift the fix on 2.0 as well? > > It's probably not something I'd suggest trying to do. This is a > preventative fix that doesn't directly address any known bugs and is layered > on top of other changes that really can't be uplifted to v2.0 without > significant risk. Specific bug scenarios should likely be mitigated on > their own or we need to make it possible to update the mail app via the > marketplace. Hi Andrew Understood. But still partner wants to know here if there is anything from this patch they can cheery-pick into their own 2.0 branch to enhance the error handle mechanism. So could you kindly help to comment? Thanks Vance
(In reply to Vance Chen [:vchen][firstname.lastname@example.org] from comment #7) > Understood. But still partner wants to know here if there is anything from > this patch they can cheery-pick into their own 2.0 branch to enhance the > error handle mechanism. So could you kindly help to comment? If the partner wants us to be able to support their v2.0-based email app, then they need to stick to what is in our v2.0 branch. They absolutely should not arbitrarily cherry-pick patches from v2.1 into v2.0. It's a bad idea. The email app is complex and our various patches build on top of each other in non-trivial ways. What is preferable is if they see failures in v2.0, they file a bug here, against our v2.0 code-base, and we see what we can do on v2.0 specifically.
You need to log in before you can comment on or make changes to this bug.