Closed Bug 1145473 Opened 6 years ago Closed 6 years ago

[email/backend] MailAPI's bridge-processing logic may break the UI if a sequence of events involving dead slices and contact lookups interacts. ["Mismatched message completion!"]

Categories

(Firefox OS Graveyard :: Gaia::E-Mail, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(b2g-v2.2 fixed, b2g-master fixed)

RESOLVED FIXED
2.2 S12 (15may)
Tracking Status
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: asuth, Assigned: mcav)

Details

Attachments

(5 files, 1 obsolete file)

Gregor experienced total breakage of the email UI in a debug build and provided a log.  The log shows this most-smoking gun:
  ERR: onerror reporting: Error: Mismatched message completion! @ app://email.gaiamobile.org/js/ext/main-frame-setup.js : 1

Once we see this error, the back-end will never be able to respond to any of the front-end's requests.  But the front-end will be able to seem like it's working enough to trick people.  Happily, the problem is resolved by restarting the mail app.  And since we effectively leak everything the back-end tells us until we get OOM-killed, this will eventually happen as 't' goes to infinity.

This was more subtle than it first appeared because it also turned out that the usually harmless "batchSlice Error: Received message about nonexistent slice: 8" errors can in fact be serious errors but I of course initially discounted them since they are harmless on their own when at the top-level.

In a nutshell, there are two interacting problems:

1) When the contacts lookup completes, it synchronously calls all of its callbacks in a row, one of those callbacks is _doneProcessingMessage and then it synchronously spins a call to _processMessage.  (The good news is no horrible recursion-of-doom happens or can happen.)  It clears its callback list after issuing all the calls.

2) All of the error-handling code along these code paths looks like it logs and safely returns control flow but it's lies.  LIES.
  - _recv_batchSlice calls unexpectedBridgeDataError and then does "return true".
  - _processMessage has an explicit catch and calls internalError and then returns.  (So surely even if _recv_batchSlice somehow threw...)
  - both internalError and unexpectedBridgeDataError are aliases to reportError which calls console.error and throws an exception.
  - So if we receive a slice splice that the ContactCache is not able to resolve synchronously and then we receive a slice splice for a dead slice before the contacts API completes, that dead slice splice events breaks the ContactCache callback loop; it will fail to clear its list of callbacks and the _processMessage state machine invariant will be broken the next time a valid slice splice is received that needs ContactCache to do some async stuff, resulting in the "Mismatched message completion" error we see.
  - As the author of this error reporting footgun, I helpfully included a prescient comment that indicated we probably want different behaviors under different runtime situations.  I got wrong the part about "By funneling all errors through one spot, we help reduce inadvertent breakage later on."  It's arguably a bad idea to hide a "throw" in a helper function, although I'm almost certain my intent was that the "throw" only be active during unit testing as a workaround for the limitation that we did not instrument the front-end/mailapi.js with loggest logging infrastructure which knows how to fail the unit test.  (And I avoided boobytrapping console.error.)  The one consolation to my ego is that this specific bug was introduced significantly later by the batching logic retrofit that was the victim of these heinous error reporting footguns.


A minimal non-test fix for this is simple enough; we just need to not throw exceptions in reportError and audit that there is no code in there whose correctness depends on a throw.  I just did the audit.  All the code is erroneously assuming the error methods are not footguns.  It's safe.

The slightly more complex question is what's the best way to do this without hurting the back-end unit test coverage.  Options would seem to be:
- Some kind of build-time preprocessing.  Risky/dangerous/ugly.
- Put it in a module and use require.js magic to use a testing-friendly one only at test time and a safe one at runtime.  (The error functions wildly pre-date the use of AMD modules for main-page-context code...)
- Have the testing infrastructure monkeypatch/otherwise meddle to make those functions testing-friendly.

If we're doing testing-specific stuff, using an explicit loggest logger (probably a lazy logger) to explicitly trigger an error would actually give us a nice testing coverage improvement.


We want to fix this soon to get the fix on v2.2, but I'd like to stew on this for a day or two before pursuing a fix.  Thoughts from :jrburke and :mcav appreciated.  NI'ing myself to make sure it doesn't fall off my radar.
Flags: needinfo?(bugmail)
Flags: needinfo?(bugmail)
Summary: [email/backend] The "Received message about nonexistence slice" error message message-processing state machine with "Mismatched message completion!" → [email/backend] MailAPI's bridge-processing logic may break the UI if a sequence of events involving dead slices and contact lookups interacts. ["Mismatched message completion!"]
Flags: needinfo?(bugmail)
Assignee: nobody → m
Given uplift considerations, I'd like to propose the following and see if this solution makes sense to you:

Logic already includes a function called `logic.fail(err)`, which fails the currently-running test with the given error -- and only logs an error if not in a test. Essentially, that's the behavior we'd like to see here; a require.js solution and/or monkeypatch would just duplicate that behavior.

However, since 2.2 doesn't include logic, we can't just uplift that. But for production code (i.e. anyone not running the GELAM tests directly), we're never in a test; we merely want to remove the throw.

I suggest we:

- For master, hook in `logic.fail()` in `reportError()`, so that we fail when in a test, but only log an error in production. We do _not_ uplift this one.

- For v2.2, create a separate patch that just replaces/removes the throw and does nothing else. Since we don't include GELAM tests in Gaia and no one will go to the trouble to run obsolete GELAM unit tests, the relevant `logic`-failing portions of the test would be excluded from Gaia anyway.

Thoughts?
Flags: needinfo?(bugmail)
[apparently bugzilla doesn't let me clear and reset the needinfo flag in one go.]
Flags: needinfo?(bugmail)
Sounds good!
Flags: needinfo?(bugmail)
Attachment #8600009 - Flags: review?(bugmail)
Attachment #8600010 - Flags: review?(bugmail)
Attachment #8600010 - Flags: review?(bugmail) → review+
Comment on attachment 8600009 [details] [review]
[gaia-email-libs-and-more] mcav:reportError-master > mozilla-b2g:master

Clearing review until dependency issue is addressed.  Maybe temporarily create a synthetic failure and briefly test on device/in b2g-desktop.
Attachment #8600009 - Flags: review?(bugmail)
Comment on attachment 8600009 [details] [review]
[gaia-email-libs-and-more] mcav:reportError-master > mozilla-b2g:master

Guilty as charged; I was being sloppy. I deserve an r- for that one; sorry.

As penance I've added a test that verifies the behavior change, using a fake message passed to MailAPI (which calls reportError if it sees an invalid message). I had to make a couple further changes to ensure `logic.fail` triggers when called immediately during the test run, as well as a couple module relativization changes to satisfy the Gaia build / require.js.

The app appears to work correctly on-device with these changes.
Attachment #8600009 - Flags: review?(bugmail)
Comment on attachment 8600009 [details] [review]
[gaia-email-libs-and-more] mcav:reportError-master > mozilla-b2g:master

Excellent, thanks.  You've successfully turned that r- into a r++.
Attachment #8600009 - Flags: review?(bugmail) → review+
Attachment #8600010 - Attachment is obsolete: true
Green TC result for both Gaia and GELAM.

gaia master: https://github.com/mozilla-b2g/gaia/commit/1e655fa7c16707cd1c6ab84a2b9d9c984b8f35c5
gelam master: https://github.com/mozilla-b2g/gaia-email-libs-and-more/commit/81386a419ffed42f08a0434ac019a3faefa0e310
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Comment on attachment 8600421 [details] [review]
[gaia] mcav:reportError-v2.2 > mozilla-b2g:v2.2

[Approval Request Comment]
[Bug caused by] (feature/regressing bug #): always present
[User impact] if declined:

In a race condition, our error handling logic may throw an exception that breaks the state of the app until a subsequent restart.

[Testing completed]:

manual + automated


[Risk to taking this patch] (and alternatives if risky):

Super low risk; just removing a throw in an error-handling function.

[String changes made]:
Attachment #8600421 - Flags: approval-gaia-v2.2?
Attachment #8600421 - Flags: approval-gaia-v2.2? → approval-gaia-v2.2+
Target Milestone: --- → 2.2 S12 (15may)
You need to log in before you can comment on or make changes to this bug.