Closed Bug 888807 Opened 11 years ago Closed 7 years ago

[email/backend] Refactor use of account operations and mutexes to avoid apparent UI delays or per-folder hangs

Categories

(Firefox OS Graveyard :: Gaia::E-Mail, defect, P3)

x86_64
Linux
defect

Tracking

(tracking-b2g:+)

RESOLVED WONTFIX
tracking-b2g +

People

(Reporter: asuth, Unassigned)

References

Details

(Keywords: perf, Whiteboard: [priority][p=13][c= p= s= u=][LibGLA,TD80220,WW,B])

User actions are translated into jobs / 'operations' which may contain a local component and a server component. I will refer to them as jobs for the remainder of this bug because it's less ambiguous. Jobs are scheduled (via MailUniverse._queueAccountOp) and tracked on a per-account basis. When user actions want to manipulate the state of messages, they must acquire a mutex on the (per-folder) FolderStorage they are interested in. These requests are serialized and issued in a FIFO order. Both jobs and synchronization requests acquire mutexes in order to do their work. Only jobs are allowed to acquire more than one mutex at a time. This, and the current lack of support for doing cross-account things (like moving messages between accounts), means that there is no potential for deadlock from multiple actors trying to acquire overlapping sets of resources in an inconsistent ordering. Job running logic maintains two separate queues for jobs: local jobs and server jobs. Local jobs do not require network communication and can run as fast as local I/O allows *provided they don't have to wait for any resources like mutexes*. Server jobs require network I/O and inherently may block for long periods of time. Whenever it comes time to run a job, we take the first enqueued local job, or if there are none, the first enqueued server job. [Aside: Previously, local jobs did not serialize their execution and instead were run as quickly as possible. This had to change once message moves were implemented since it is has complicated side-effects.] Synchronization logic also acquires mutexes, for both online purposes (talking to the server to synchronize) and for local-only purposes (fetching message headers from the database. The net result of all of this synchronization is that local operations such as flagging a message or scrolling the list of messages may be held up by network I/O. Another serious problem is that if a job 'dies' so that it never completes and never releases its mutexes, the folders whose mutexes it holds are basically dead until the app is restarted. While it's hard to completely avoid that class of problem, we could at least fail less horribly when that occurs. (And then add a dead-man timer to mitigate.) We don't want to throw everything away from this implementation; it's still important to serialize and order execution of operations if only so we don't parallelize a thousand things at once and cause an out-of-memory error that kills our app. These are the bad things we are afraid of that mutexes protect us from: A) Retrieving message state, holding onto that perceived state, then yielding control flow and having some other asynchronous state change that logic out from under us. B) Disagreements about the set of messages in a folder between our offline logic and our synchronization logic. We are mainly worried that we might be performing a local message move between two folders while a synchronization is active in one of those folders. The (IMAP) sync logic, at a high level, figures out the difference between the set of messages in knows about in a given time range and what the server claims is there. The state of the database is currently latched when the sync for that time range is initiated. It doesn't expect messages to go away. The local move logic probably would also get upset if the sync logic messed with what it was touching. It's important to note that currently our approach for offline moved messages already plays poorly with sync. (See https://bugzilla.mozilla.org/show_bug.cgi?id=839273#c2). The code was written assuming we wouldn't sync a folder until all offline moves targeting that folder were already played, but we don't actually do that. So things just break. And that design decision is now clearly wrong because it means we have to delay our syncs. I think the right solution for this particular case ('B') is: - for the sync logic to ignore the speculatively moved header when syncing the target folder before the offline header has actually been completely moved - for the offline move logic to leave behind a 'tombstone' for the moved message so that the online sync logic will not re-synchronize that message back into existence. The tombstone would be a specially marked header with no body. - make sure that we don't end up with these special-case headers lying around all over the place. This would be primarily handled by having the server operation failing to actually perform the move operation, then nuking these headers. But we would also want to consider a fail-safe (that logs a very loud error) that would do something like put a generation id on these speculative messages. Whenever we fully clear out our job list, we'd bump the generation number, and then whenever we see speculative messages of an older generation, we'd know they're indication of a bug/failure and they should be cleaned out. If we make the changes I discuss in 'B', we would make changes so that the structure of the back-end works like so: - Local jobs are still run serialized amongst themselves, but they do not block on anything else. The source message of a local move is atomically converted into a tombstone by FolderStorage. The target message is added so that sync logic doesn't really see it. - Header fetches for scrolling and such run without blocking on anything. - The FolderStorage mutex is converted into a server mutex for the folder. Both server jobs and folder synchronization logic needs to acquire this mutex. This is a reasonable course of action because IMAP connections are stateful and can only be in one folder at a time and we do already use mutexes to help manage connection life-cycles. This is also a good idea because it helps eliminate potential state inconsistencies that could result if we overlapped things like flag mutations at the same time we are synchronizing. We would eventually converge, but in practice it seems needlessly confusing and unlikely to net us huge performance wins. Because there will always be potential asynchrony relating to asynchronous fetching of database blocks / message headers, we will continue to maintain a strictly linearized view of database mutations on a per-folder basis. (See _deferredCalls on FolderStorage). Also, we want to favor atomic header mutations where we have FolderStorage trigger the mutation under its control so it can ensure there is no situation where stale state gets written into the database, clobbering more up-to-date information. We may need to do this for the body too, although at first glance it seems like body mutations can just require the owner to be holding the server mutex. (Local moves still complicate things since their new ability to run concurrently does moot our previous invariant that we would have rename info available to us, etc. That requires more thought, but if we just detect the failure and retry, we do will converge thanks to our rename/move maps and that previously valid guarantee.) The other issues that come to mind that could fall-out from this are: - Cache flushing. Currently we flush the cache when the FolderStorage mutex gets released. Since local jobs won't acquire a mutex, we need a new way to make sure things get evicted. Since we will probably keep our helper logic that knows what folder the local job is working in, we can know to flush that folder's cache when we clean up our job. But we would check if the folder's server mutex is active. If it is, we would avoid triggering the flush. (Note: Because of the atomic modification logic noted above, we should ideally be immune to logic problems even if we go around flushing the cache willy nilly... see source comments for more on that.) - Database coherency. Our actual guarantees are actually a bit weak right now, but the general idea is that if we 1) have more things happening in parallel, but 2) are still using a shared database persistence model (our blocks), it's very possible for a local job and a server job to interleave in such a way that an atomic commit might leave one of them in an inconsistent state. This needs more thought, but it's worth noting that since the server is our ground truth, it's really a question of minimizing data-loss of user actions that have not yet been reflected to the server more than anything. Thanks to our 'job'/operation model, this is very feasible, we just need to make sure to start writing those out in their own transactions. The right thing for this might be to queue up the persistence of the operation completion in the same database transaction in which the folder mutations it triggered also hit the disk.
Keywords: perf
Whiteboard: [c= p= s= u=][perf-reviewed]
Target Milestone: --- → 1.2 FC (16sep)
Whiteboard: [c= p= s= u=][perf-reviewed] → [c= p= s= u=]
Target Milestone: 1.2 FC (16sep) → ---
blocking-b2g: --- → backlog
Whiteboard: [c= p= s= u=] → [priority][p=13]
Whiteboard: [priority][p=13] → [priority][p=13][c= p= s= u=]
Priority: -- → P3
Whiteboard: [priority][p=13][c= p= s= u=] → [priority][p=13][c= p= s= u=][LibGLA,TD80220,WW,B]
[priority] --> tracking-b2g:+ conversion
tracking-b2g: --- → +
blocking-b2g: backlog → ---
Blocks: 1165275
Firefox OS is not being worked on
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.