Closed
Bug 1070429
Opened 10 years ago
Closed 10 years ago
[email/IMAP] getBody returning null, looks like possible fetch corruption [Problem handling message type: gotBody TypeError: body is null]
Categories
(Firefox OS Graveyard :: Gaia::E-Mail, defect)
Tracking
(blocking-b2g:2.1+, b2g-v2.0 affected, b2g-v2.1 fixed, b2g-v2.2 fixed)
People
(Reporter: asuth, Assigned: asuth)
References
Details
Attachments
(2 files)
On trunk (and probably v2.1) I've been seeing some errors of the form: 09-19 21:09:26.352 I/GeckoDump( 4105): ERR: Problem handling message type: gotBody TypeError: body is null The code in question is setting body.onChange in the getBody return. getBody returning null indicates a very bad problem. In this case, I see that the snippet appears to consist of the headers for the message, suggesting that there's a parser error of some type occurring. Previous times when this has happened, the phenomenon is that I get a valid body when the body is first fetched in its entirety, but the body is invalid on subsequent database loads. This would be consistent with the default stub body being valid, but things getting badly screwed up if the message parse doesn't work out right. Testing has been against gmail in these cases. I think I need to re-enable protocol dumps via slog somehow or use standard ArbPL loggers or something.
Assignee | ||
Updated•10 years ago
|
Summary: [email/IMAP] getBody returning null, looks like possible fetch corruption → [email/IMAP] getBody returning null, looks like possible fetch corruption [Problem handling message type: gotBody TypeError: body is null]
Assignee | ||
Comment 3•10 years ago
|
||
(blocking flag collision, nuking my comment) Gregor, can you indicate what domain/server type you experienced this error on? I've seen it on gmail so far, but I'd somewhat expect you to be using mozilla.com?
Flags: needinfo?(anygregor)
Comment 4•10 years ago
|
||
(In reply to Andrew Sutherland [:asuth] from comment #3) > (blocking flag collision, nuking my comment) > > Gregor, can you indicate what domain/server type you experienced this error > on? I've seen it on gmail so far, but I'd somewhat expect you to be using > mozilla.com? This is on my mozilla email. The strange thing is that I could see the content already. I probably closed the app and once I wanted to go back I couldn't get the content any more.
Flags: needinfo?(anygregor)
Comment 5•10 years ago
|
||
Blocker justification: broken functionality, needs to be fixed
blocking-b2g: 2.1? → 2.1+
Updated•10 years ago
|
Target Milestone: --- → 2.1 S6 (10oct)
Assignee | ||
Comment 6•10 years ago
|
||
I've determined the problem and have a fix in places and am currently cleaning up all the potentially similar hazards that exist while removing the footgun that found us here. == Root Cause The specific cause is that IMAP logic (both sync and downloadBodyReps) was issuing async database calls without specifying callbacks. Instead it was using runAfterDeferredCalls to try and make sure that its aggregate callback didn't fire until after all of those calls had succeeded. This would have been sound if runAfterDeferredCalls decided whether to operate based on _pendingLoads.length like all the internal functions do, but it didn't. Instead it was gated on _deferredCalls which only would get populated if a second database call was issued after a call when async. Generally speaking, this problem could result in the database being saved in a non-coherent state where the metadata block-info structures were partially updated but the actual body block data structure was not saved to disk. These are race conditions that I believe have existed since at least v1.1 (And maybe got uplifted to v1.0.x?), but is mitigated by various things. == Repro Details The repro case is basically this: - updateMessageBody is called on a body that is the only message in its block. This is most likely to occur for HTML messages or messages with at least 4k of content whose adjacent messages are similarly at least 4k so they will end up in their own blocks. - The adjacent block(s) are not in the block cache when updateMessageBody is called. What happens is: - We call deleteMessageBody followed by addMessageBody for simplicity in dealing with block sizing. (We didn't have to write delta-logic.) - Given the above, we end up nuking the block that contains the message. - When we go to add the message back, we find the adjacent block which we will try and insert it into. However, the adjacent block is not in memory so we need to load it from disk. At this point we have modified the metadata for the block we are about to load; it will indicate that the message body can be found there when in fact it is not there. Although !!_pendingLoads.length, !_deferredCalls.length, so downloadBodyReps calls its callback prematurely in this inconsistent state. - The save happens! - The block finishes loading and we insert the body, almost certainly resulting in a block split occurring. The block is appropriately marked as dirty and our in-memory state is consistent. If the app then quits, the problem happens because our database state was that inconsistent state. If the app stays open and the user does almost anything else (read a message, flag a message, trigger a sync), we will trigger an account save and the correct database state will be written out. == Mitigation, Expected Failure Cases In general we expect for this body download case: - we would "corrupt" at most one message per session. The user would only notice this if they go back to look at the message again later. - this to be most likely for users who manually close the email app or are on limited memory devices - this to happen much less on pre-v1.4 devices since we only made the body block eviction logic more aggressive starting in the infinite scroll patch (which we uplifted to v1.4) - this to be significantly more rare on pre-v1.3 devices since I only updated the block sizes to 8k from 32k in v1.3 as part of bug 871897 - this to not happen for users dealing with smaller message bodies Note that there are also might be periodic sync ramifications. Analytically that's a complex one to think about since there is just so much more going on there. The good news is that the "so much" means that runAfterDeferredCalls probably was doing the right thing there, and in the cases where it wasn't, we probably already had the blocks in memory anyways. == Fix Discussions I'm working on the trunk and v2.1 fix right now where I am removing our uses of runAfterDeferredCalls and the method itself and replacing them with proper callback dependencies. runAfterDeferredCalls was a footgun with sketchy semantics that usually required a "I swear I know what I'm doing" comment right before its use, or no-comment from cargo-culted uses from the before times. runAfterDeferredCalls was somewhat of a hack that grew out of time pressure, a lack of callbacks in the high-level FolderStorage operations, and a lack of allback.latch. (Note that POP3, being newer and giving us the great allback.latch(), was already doing the right thing, although it still cargo-culted the runAfterDeferredCalls use-case.) I'm torn on whether we should try and defer the manipulations of the bodyBlockInfo structure until the block has been loaded. In the updateMessageBody case, we already have a window of time where the body doesn't exist in the database at all, so the serialized database manipulation calls are still required for correctness, so I'm leaning towards no. (Noting that our gameplan to eliminate mutexes is about not blocking access on network communication; our serialized 'actor' model for the FolderStorage is still appropriate.) For v2.0 if we can take the fix, I think the right course of action is just a minimal change to runAfterDeferredCalls to check for _pendingLoads.length.
Assignee | ||
Comment 7•10 years ago
|
||
The core of this patch is removing runAfterDeferredCalls and replacing its usages with usages of allback.latch() (or similar) so that we properly depend on the actual database operations completing. This involved replacing next() idioms. In some cases, next() was accumulating an error flag, I added a helper function to allback to extract the first error it saw (and a unit test) to keep parity. This also meant that those uses of latch had to defer(name) instead of just using defer() which has potential GC ramifications since we'll hold onto return values somewhat, but I don't think there's any actual problems from that. The main things to keep in mind when sanity-checking latch is that latch.then() passes the aggregating object as its first arg, so care needs to be taken when using latch.then() with external callbacks that it does't get interpreted as an error/etc. I did manage to screw that up a few times, but the good news is the tests caught all of then. In many cases I've used bind to explicitly ensure that the specified N arguments to the callback are strongly controlled. I used bind because a lot of the time we were in a loop idiom where there was some argument that a lexical closure would not latch properly unless we were using "let". There's a spot or two where I may have also transformed an existing next() idiom where we weren't using runAfterDeferredCalls but where I was otherwise cleaning the file up and it seemed good to make progress on that too. I also transformed uses of self to 'this' via bind where the changes required were minimal and were already in hunks being modified. The only real ugly part is in searchfilter.js where the control flow was already ugly. I think the implementation is a slight improvement on what we had before, but we really need to clean up the iteration idiom to let bodies be provided in concert with headers more cleanly. (Maybe when we can use generators.) I ran the tests locally a bunch of times to make sure we got some run permutations (in case of races) and I'm feeling good that existing test coverage helps ensure there are no new regressions. Travis note: - activesync:fake failures are the gecko regression bug 1075302 whose re-landed fix has not yet made it to mozilla-central or builds. Other notable things in here which are of the testing/debugging-support variety: - The changes to test_folder_storage.js and its helper logic in folder_storage_shared.js consist of additional tightening of the test constraints because I initially thought we had a bug where we were failing to dirty pages rather than a race. It turns out that logic was already correct but it can't really heart to have those improved constraints. - New logged "generatePersistenceInfo" for super-verbose logging to indicate what dirty pages a save is writing to disk. - I updated mail-fakeservers with some slipstream activesync debugging support pre-emptively landed in https://github.com/mozilla-b2g/mail-fakeservers/pull/23 per the rationale I've been using where if you don't like the change I can fix it and just push a new build but this way travis and your testing are easier. This change was originally made for the bug 1075302 regression. - similar deal with jsas to give slightly better error messages when debugging things like bug 1075302. very trivial, already landed on https://github.com/mozilla-b2g/jsas/pull/20/files but not yet taken until we bump our git rev.
Attachment #8501593 -
Flags: review?(jrburke)
Assignee | ||
Comment 8•10 years ago
|
||
Assignee | ||
Comment 9•10 years ago
|
||
(In reply to Andrew Sutherland [:asuth] from comment #6) > I'm torn on whether we should try and defer the manipulations of the > bodyBlockInfo structure until the block has been loaded. In the > updateMessageBody case, we already have a window of time where the body > doesn't exist in the database at all, so the serialized database > manipulation calls are still required for correctness, so I'm leaning > towards no. (Noting that our gameplan to eliminate mutexes is about not > blocking access on network communication; our serialized 'actor' model for > the FolderStorage is still appropriate.) So :khuey potentially has an instance of header block corruption in another bug, and :jrburke raised some good points in his excellent review and while I did some hand-waving there, I'm now thinking I want to go ahead and do this. (Specifically, in my hand-waving I pointed out some other risks that are quite difficult to mitigate in small pieces, but we can address them fairly conclusively here.) Specifically, I am proposing that blockInfo manipulations would only occur at the same time that their corresponding block is updated. I need to look into how to make the code not particularly horrible. Because existing test coverage goes for covering all branches I think I'll probably update the existing test_folder_storage tests with the following idiom: - We stub out the block load so that we control it. - Perform a manipulation that impacts a block that is (allegedly) not currently in memory. - We check that _loadBlock was called and that the block info structures have not been altered yet. - We release the block. - We verify that now the block info structures have been altered. I believe this will address the saveAccountState hazards of periodic sync operating in conjunction with other user actions that I discuss on the review at https://github.com/mozilla-b2g/gaia-email-libs-and-more/pull/343#issuecomment-58448408
Assignee | ||
Comment 10•10 years ago
|
||
As discussed on IRC, I spent some time investing sync failures experienced by :khuey and determined that they do not seem to overlap with the problem here. As such, we're going ahead with this fix as it stands. :jrburke, I've pushed the revised logic and a corresponding test so testing/review can conclude. If you don't apply the mailslice.js changes in the patch the test will accordingly fail. (Or if you move the update application up a little. That's how I verified; I don't think you need to unless you really feel like it.) I implemented the check by relying on the new dirty-checking assertions I added; we now fake transfer the dirty blocks to another dictionary when resetting dirty state so a load step is compelled. The insertBody/deleteBody helper logic is then able to prove state equivalence of the block info structures before and after the insert call but before releasing the block-load through gratuitous use of JSON.stringify and string equivalence. I'm not sure the test contortions are entirely worth it, but it's in test support code and it does feel nice to be verifying these things since it reduces the "what could be going wrong in the database?" permutation space.
Comment 11•10 years ago
|
||
Comment on attachment 8501593 [details] [review] GELAM PR: remove runAfterDeferredCalls and replace usages with proper callback dependencies Tested the final state of this pull request on flame master (taking this pull request, then make install-into-gaia to create modified email app), trying an activesync and imap account. I was able to reproduce the getBody null with the imap account before applying the patch, then trying the same steps after the patch, it did not fail. The test was to open up an HTML email, then once loaded, close the email app, then reopen the app and try to view the message. I also tried a smoke test of search for both accounts. Those both worked similar to the current master state. I think I found a display bug in the activesync search case, but it also occurs on master, so I think it is just an event timing bug with vscroll, and that the email account only has two messages in the inbox. So I will file a separate bug for that once I get more details. In any case, not caused by this changeset.
Attachment #8501593 -
Flags: review?(jrburke) → review+
Assignee | ||
Comment 12•10 years ago
|
||
landed on gaia-email-libs-and-more/master: https://github.com/mozilla-b2g/gaia-email-libs-and-more/pull/343 https://github.com/mozilla-b2g/gaia-email-libs-and-more/commit/3ad8f0353340b0b98be1a7ab32f10d1f81c5ea2e landed on gaia/master: https://github.com/mozilla-b2g/gaia/pull/24927 https://github.com/mozilla-b2g/gaia/commit/717ad4e8b7fc10ab8248500d00ba5ba0977fa8ab
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 13•10 years ago
|
||
Comment on attachment 8501594 [details] [review] GAIA PR that is just the GELAM PR installed into gaia (propagating r=jrburke from the backend GELAM pull request; note that that's also where the automated test coverage is) [Approval Request Comment] [Bug caused by] (feature/regressing bug #): Longstanding bug. [User impact] if declined: Inability to view most recent message body/contents viewed prior to closing app. [Testing completed]: Back-end tests and manual testing by friendly humans. [Risk to taking this patch] (and alternatives if risky): Relatively little risk. Not taking this potentially could lead to a situation later on where if we need a v2.1 uplift that we end up having to take this later or end up in a scary amalgam that does not have back-end tests. [String changes made]: No strings touched.
Attachment #8501594 -
Flags: review+
Attachment #8501594 -
Flags: approval-gaia-v2.1?
Updated•10 years ago
|
Attachment #8501594 -
Flags: approval-gaia-v2.1? → approval-gaia-v2.1+
Comment 14•10 years ago
|
||
v2.1: https://github.com/mozilla-b2g/gaia/commit/7061d510208ae9552daba2e092e46266151a4e47
You need to log in
before you can comment on or make changes to this bug.
Description
•