Closed Bug 998395 Opened 8 years ago Closed 8 years ago
[Tarako][email] OOM death/email killed from message scrolling with snippet fetching at a message density of 166 messages/day
109.44 KB, text/x-log
46 bytes, text/x-github-pull-request
|Details | Review|
2.31 KB, patch
|Details | Diff | Splinter Review|
48.30 KB, text/plain
1. launch email app 2. set it up for your mozilla/work email (or whatever has a lot of emails) 3. keep scrolling Expected: the app doesn't get killed Actual: the app gets killed Note: Gaia f0872318d46781bb59d0a13a2e1fffb115b8ca2b Gecko https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/32bb713e6d0d BuildID 20140418014001 Version 28.1 ro.build.version.incremental=eng.cltbld.20140418.052257 ro.build.date=Fri Apr 18 05:23:05 EDT 2014 Tarako
8 years ago
We can uplift the body block cache management improvements from the virtual scroll patch on bug 796474 which should help since the log shows we are triggering snippet fetches which will result in body blocks being touched and thereby cached. Between this and the snippet fetching rate of one batch every 2 seconds generating transient garbage, it probably gets bad. The log indicates per-day messages of 167 and 166 which I would classify as medium-high density.
Summary: [Tarako] Email app can get killed just browsing emails → [Tarako][email] OOM death/email killed from message scrolling with snippet fetching at a message density of 166 messages/day
(aside: your logcat somehow lacked the pretty ANSI color codes that look cool; if it's possible to tee the file or otherwise retain the color codes, it's appreciated for the time being since our logging overhaul doesn't happen until 2.0).
ni? asuth how risky for you think bug 796474 for tarako? since we do want to wrap up by the end of the month or should we consider something like limiting the number of emails in the email app? what's your suggestion if we want to limit the number of emails? Thanks
Bug 796474 itself cannot be taken on v1.3T; it is way too risky. The block cache management improvements are not risky and can be extracted and prepared as a v1.3T-specific patch. (We could also land it on v1.3 if we wanted, too, I suppose.)
hi Andrew, if you believe block cache management can help this bug, can we try it? Thanks
Sure, we can try it, but we're busy with blockers and this is not a blocker yet. (And in fact the code that we would uplift has not yet been reviewed; it's part of a 1.4 blocker.) I'll leave the needinfo active so this is on the shortlist to focus on as email hackers become available.
let's have bug 796474 reviewed and landed first and then we can discuss further on whether the block cache management part can be extracted and prepared for 1.3T. thanks
given that bug 796474 has landed, Andrew, is it possible to try the block cache management improvement on 1.3T? thanks
Yes, I will prepare a branch and give some numbers to help justify approving it for uplift. I'm nearly certain we'll want the change anyways, but the numbers will be very useful for any other 1.3T email decisions we have to make.
Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Here's an uplift of the block handling patch. === Pre-patch baseline block usage I established a baseline / verified prior block caching badness using just debug logging here: https://github.com/asutherland/gaia/compare/mozilla-b2g:v1.3t...debug-report-block-usage?expand=1 Such a log looks like: WLOG: post flush blocks. header: 6 (27950 bytes) body: 20 (119155 bytes) Note that the byte size estimate is actually an on-disk size estimate. We expect memory usage to exceed that a bit. Also note that was basically the high water mark for just scrolling. Because of how the pre bug 796474 logic works, we effectively end up with an upper bound on the number of body blocks that can be in memory at a time. But if someone went through their list of messages and displayed them all in the reader, it could end up much worse. MUCH worse. Thankfully there's no next/previous UI in v1.3. === Post-patch block usage And then I ran with the fix and the debug logging on top: https://github.com/asutherland/gaia/compare/mozilla-b2g:v1.3t...v1.3t-reduce-block-mem-withlog?expand=1 Looks like: WLOG: post flush blocks. header: 9 (44720 bytes) body: 1 (7545 bytes) === Bad news We still die eventually. I ran a bunch of "MOZ_IGNORE_NUWA_PROCESS=1 tools/get_about_memory.py" invocations. What I basically found is: - Sometimes having to load a font really betrays us. We lose 2.88 MB of memory when we have to render Chinese characters. Bug 997007 tracks :ting's investigation of this. - What most seems to trigger the email app's death is just after performing "load more messages". This will tend to be a relative peak memory usage time since we've got the following stuff that's either recent garbage or not yet garbage: - all the mozTCPSocket stuff from the sync process - the dirty header and (tiny) body blocks in the worker, plus they end up on the main page too because we have to manually proxy our IndexedDB - we've got that IndexedDB transaction which includes both the serialization and then the SQLite transaction/dirty pages/etc and includes our aggregate bundle of block-infos for all folders which is going to be larger. - Second place killer is body fetching. This is also going to generate some about of IndexedDB overhead. === Land the uplift? Yeah, I think we should land the patch. It's a clear win and it has worst-case protection built-in. It might be buying us some time before we get killed in Naoki's use-case here, it's definitely going to help us survive longer if the user is reading all the messages they got. === What else can we do? Not a lot. We could figure out what the IndexedDB-related transient memory spike is but overhauling email's IndexedDB schema or IndexedDB's schema/etc. do not seem likely to be in the cards for v1.3t.
Attachment #8416326 - Flags: review+
Improved news, everyone! I was wondering what the SQLite transaction delta was so I made a little IndexedDB patch to log the data at the start/end of the transaction. Installing this resulted in all the b2g processes getting reset. I did not run any about:memory runs, meaning that any side-effecty bloat from that was avoided. I managed to scroll down through the ~220 already-synced messages, and then "load more messages" and have background snippet fetching happen while I scrolled through another ~230 to bring us up to 450 messages synchronized. We did, sadly, die at that point. I revised my IDB patch, turned off marionette support and rebooted. The next time we got to 621 messages Note that I was using an account that's basically just subscribed to dev-gaia, so message density is about ~20 messages a day, but our sync sizes ended up being somewhat in the same ballpark of what Naoki was seeing since we'd end up in the 24-59 range usually. Reported SQLite transactional memory swing (which is process-wide, not scoped to a connection) seemed to be 167K-ish most of the time, with the occasional example like the below where it was 246K. I'll attach the log in the same attachment. I was a little worried that a lot of our calls to saveAccountState do not wait for the write transaction to complete and so we might be causing build-up of non-trivial IndexedDB requests. However, from the logging it appears we're actually doing okay and my (naive) reading of the kernel state dump when it killed us is that there was no file-system write backlogs built up. (Those would in fact be hard to accumulate since IndexedDB-on-SQLite will be triggering fsync's.)
Note that I realize I misspoke about our sync sizes being comparable to what Naoki was testing with. sync sizes of 60 are not comparable to sync sizes of 166 given the limited amount of memory we have available. So I tested with a test account furnished by :doliver that's subscribed to an awful lot of bugzilla. It had 667 messages in the last day as of sync time. I triggered sync immediately following account setup. We managed to get to SYNCDONE, but the UI (which unfortunately receives all 667 of them on v1.3 before deciding it really only needs about 45 of them) got pretty badly swamped. We managed to get the writes to IndexedDB issued but ended up dead and so our transaction got backed out. Hilariously, the cookie cache made it seem like the sync had succeeded when I reopened the email app. Since the sync was an initial sync again we did a 3 day fetch that bisected down to one day and now wanted 670 messages. This sync succeeded and checkpointed to the database. However, the sync then proceeded to try and synchronize the day before that while cramming the results into the UI. The sync died. I reopened the app again; now sane scalability happened. I had all 670 messages and could even trigger a refresh. The neat thing is that I think v2.0 or v1.4 on Tarako might actually be able to survive that situation, especially if we optimize the virtual list in v2.0 to shrink more or have an explicit fixed window size around an anchorpoint.
triage: 1.3T+ ni? Andrew, is this ready to land? i supposed this is [tarako_only]
blocking-b2g: 1.3T? → 1.3T+
Over 700 mails to test OOM issue. [Environment] Gaia dbc2b00f538df1d93960794fbb52aa84f6f19483 Gecko https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/3b6c97b5fe8b BuildID 20140514164005 Version 28.1 ro.build.version.incremental=eng.cltbld.20140514.202602 ro.build.date=Wed May 14 20:26:10 EDT 2014 [Result] PASS
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.