[Tarako][email] OOM death/email killed from message scrolling with snippet fetching at a message density of 166 messages/day

VERIFIED FIXED in 2.0 S1 (9may)

Status

defect
VERIFIED FIXED
5 years ago
5 years ago

People

(Reporter: nhirata, Assigned: asuth)

Tracking

unspecified
2.0 S1 (9may)
ARM
Gonk (Firefox OS)
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:1.3T+, b2g-v1.3T fixed, b2g-v1.4 unaffected, b2g-v2.0 unaffected)

Details

(Whiteboard: [tarako_only][c= p=1])

Attachments

(4 attachments)

Posted file email.log
1. launch email app
2. set it up for your mozilla/work email (or whatever has a lot of emails)
3. keep scrolling

Expected: the app doesn't get killed
Actual: the app gets killed

Note: 
Gaia      f0872318d46781bb59d0a13a2e1fffb115b8ca2b
Gecko     https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/32bb713e6d0d
BuildID   20140418014001
Version   28.1
ro.build.version.incremental=eng.cltbld.20140418.052257
ro.build.date=Fri Apr 18 05:23:05 EDT 2014
Tarako
blocking-b2g: --- → 1.3T?
Summary: [Tarako] Email app can get LMK just browsing emails → [Tarako] Email app can get killed just browsing emails
We can uplift the body block cache management improvements from the virtual scroll patch on bug 796474 which should help since the log shows we are triggering snippet fetches which will result in body blocks being touched and thereby cached.  Between this and the snippet fetching rate of one batch every 2 seconds generating transient garbage, it probably gets bad.

The log indicates per-day messages of 167 and 166 which I would classify as medium-high density.
Summary: [Tarako] Email app can get killed just browsing emails → [Tarako][email] OOM death/email killed from message scrolling with snippet fetching at a message density of 166 messages/day
(aside: your logcat somehow lacked the pretty ANSI color codes that look cool; if it's possible to tee the file or otherwise retain the color codes, it's appreciated for the time being since our logging overhaul doesn't happen until 2.0).
ni? asuth
how risky for you think bug 796474 for tarako? since we do want to wrap up by the end of the month
or should we consider something like limiting the number of emails in the email app? 
what's your suggestion if we want to limit the number of emails? 
Thanks
Flags: needinfo?(bugmail)
Bug 796474 itself cannot be taken on v1.3T; it is way too risky.  The block cache management improvements are not risky and can be extracted and prepared as a v1.3T-specific patch.  (We could also land it on v1.3 if we wanted, too, I suppose.)
Flags: needinfo?(bugmail)
hi Andrew, if you believe block cache management can help this bug, can we try it? Thanks
Flags: needinfo?(bugmail)
Sure, we can try it, but we're busy with blockers and this is not a blocker yet.  (And in fact the code that we would uplift has not yet been reviewed; it's part of a 1.4 blocker.)  I'll leave the needinfo active so this is on the shortlist to focus on as email hackers become available.
let's have bug 796474 reviewed and landed first and then we can discuss further on whether the block cache management part can be extracted and prepared for 1.3T. thanks
given that bug 796474 has landed, Andrew, is it possible to try the block cache management improvement on 1.3T? thanks
Yes, I will prepare a branch and give some numbers to help justify approving it for uplift.  I'm nearly certain we'll want the change anyways, but the numbers will be very useful for any other 1.3T email decisions we have to make.
Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Flags: needinfo?(bugmail)
Here's an uplift of the block handling patch.

=== Pre-patch baseline block usage
I established a baseline / verified prior block caching badness using just debug logging here:
https://github.com/asutherland/gaia/compare/mozilla-b2g:v1.3t...debug-report-block-usage?expand=1

Such a log looks like:
  WLOG: post flush blocks. header: 6 (27950 bytes) body: 20 (119155 bytes)

Note that the byte size estimate is actually an on-disk size estimate.  We expect memory usage to exceed that a bit.

Also note that was basically the high water mark for just scrolling.  Because of how the pre bug 796474 logic works, we effectively end up with an upper bound on the number of body blocks that can be in memory at a time.  But if someone went through their list of messages and displayed them all in the reader, it could end up much worse.  MUCH worse.  Thankfully there's no next/previous UI in v1.3.

=== Post-patch block usage
And then I ran with the fix and the debug logging on top:
https://github.com/asutherland/gaia/compare/mozilla-b2g:v1.3t...v1.3t-reduce-block-mem-withlog?expand=1

Looks like:
  WLOG: post flush blocks. header: 9 (44720 bytes) body: 1 (7545 bytes)


=== Bad news
We still die eventually.  I ran a bunch of "MOZ_IGNORE_NUWA_PROCESS=1 tools/get_about_memory.py" invocations.  What I basically found is:

- Sometimes having to load a font really betrays us.  We lose 2.88 MB of memory when we have to render Chinese characters.  Bug 997007 tracks :ting's investigation of this.

- What most seems to trigger the email app's death is just after performing "load more messages".  This will tend to be a relative peak memory usage time since we've got the following stuff that's either recent garbage or not yet garbage:
  - all the mozTCPSocket stuff from the sync process
  - the dirty header and (tiny) body blocks in the worker, plus they end up on the main page too because we have to manually proxy our IndexedDB
  - we've got that IndexedDB transaction which includes both the serialization and then the SQLite transaction/dirty pages/etc and includes our aggregate bundle of block-infos for all folders which is going to be larger.

- Second place killer is body fetching.  This is also going to generate some about of IndexedDB overhead.


=== Land the uplift?
Yeah, I think we should land the patch.  It's a clear win and it has worst-case protection built-in.  It might be buying us some time before we get killed in Naoki's use-case here, it's definitely going to help us survive longer if the user is reading all the messages they got.


=== What else can we do?
Not a lot.  We could figure out what the IndexedDB-related transient memory spike is but overhauling email's IndexedDB schema or IndexedDB's schema/etc. do not seem likely to be in the cards for v1.3t.
Attachment #8416326 - Flags: review+
Improved news, everyone!

I was wondering what the SQLite transaction delta was so I made a little IndexedDB patch to log the data at the start/end of the transaction.  Installing this resulted in all the b2g processes getting reset.  I did not run any about:memory runs, meaning that any side-effecty bloat from that was avoided.

I managed to scroll down through the ~220 already-synced messages, and then "load more messages" and have background snippet fetching happen while I scrolled through another ~230 to bring us up to 450 messages synchronized.  We did, sadly, die at that point.  I revised my IDB patch, turned off marionette support and rebooted.  The next time we got to 621 messages

Note that I was using an account that's basically just subscribed to dev-gaia, so message density is about ~20 messages a day, but our sync sizes ended up being somewhat in the same ballpark of what Naoki was seeing since we'd end up in the 24-59 range usually.

Reported SQLite transactional memory swing (which is process-wide, not scoped to a connection) seemed to be 167K-ish most of the time, with the occasional example like the below where it was 246K.  I'll attach the log in the same attachment.

I was a little worried that a lot of our calls to saveAccountState do not wait for the write transaction to complete and so we might be causing build-up of non-trivial IndexedDB requests.  However, from the logging it appears we're actually doing okay and my (naive) reading of the kernel state dump when it killed us is that there was no file-system write backlogs built up.  (Those would in fact be hard to accumulate since IndexedDB-on-SQLite will be triggering fsync's.)
Note that I realize I misspoke about our sync sizes being comparable to what Naoki was testing with.  sync sizes of 60 are not comparable to sync sizes of 166 given the limited amount of memory we have available.

So I tested with a test account furnished by :doliver that's subscribed to an awful lot of bugzilla.

It had 667 messages in the last day as of sync time.  I triggered sync immediately following account setup.  We managed to get to SYNCDONE, but the UI (which unfortunately receives all 667 of them on v1.3 before deciding it really only needs about 45 of them) got pretty badly swamped.  We managed to get the writes to IndexedDB issued but ended up dead and so our transaction got backed out.

Hilariously, the cookie cache made it seem like the sync had succeeded when I reopened the email app.  Since the sync was an initial sync again we did a 3 day fetch that bisected down to one day and now wanted 670 messages.  This sync succeeded and checkpointed to the database.  However, the sync then proceeded to try and synchronize the day before that while cramming the results into the UI.  The sync died.

I reopened the app again; now sane scalability happened.  I had all 670 messages and could even trigger a refresh.  The neat thing is that I think v2.0 or v1.4 on Tarako might actually be able to survive that situation, especially if we optimize the virtual list in v2.0 to shrink more or have an explicit fixed window size around an anchorpoint.
Blocks: 1004326
Target Milestone: --- → 2.0 S1 (9may)
triage: 1.3T+
ni? Andrew, is this ready to land? i supposed this is [tarako_only]
blocking-b2g: 1.3T? → 1.3T+
Whiteboard: [tarako_only]
Flags: needinfo?(bugmail)
landed on v1.3t:
https://github.com/mozilla-b2g/gaia/pull/18875
https://github.com/mozilla-b2g/gaia/commit/3492deabe5db9e1f96cb44f0e4f727f0e1c244ac
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(bugmail)
Resolution: --- → FIXED
Whiteboard: [tarako_only] → [tarako_only][c= p=1]
Over 700 mails to test OOM issue.

[Environment]
Gaia      dbc2b00f538df1d93960794fbb52aa84f6f19483
Gecko     https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/3b6c97b5fe8b
BuildID   20140514164005
Version   28.1
ro.build.version.incremental=eng.cltbld.20140514.202602
ro.build.date=Wed May 14 20:26:10 EDT 2014

[Result]
PASS
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.