Closed Bug 998395 Opened 10 years ago Closed 10 years ago

[Tarako][email] OOM death/email killed from message scrolling with snippet fetching at a message density of 166 messages/day

Tracking

(blocking-b2g:1.3T+, b2g-v1.3T fixed, b2g-v1.4 unaffected, b2g-v2.0 unaffected)

Status:

VERIFIED FIXED

Milestone:

2.0 S1 (9may)

Project Flags:

blocking-b2g

1.3T+

Tracking Flags:

Tracking

Status

b2g-v1.3T

---

fixed

b2g-v1.4

---

unaffected

b2g-v2.0

---

unaffected

People

(Reporter: nhirata, Assigned: asuth)

References

Details

(Whiteboard: [tarako_only][c= p=1])

Attachments

(4 files)

email.log 10 years ago Naoki Hirata :nhirata (please use needinfo instead of cc) 109.44 KB, text/x-log		Details
Uplift of improved block handling 10 years ago Andrew Sutherland [:asuth] (he/him) 46 bytes, text/x-github-pull-request	asuth : review+	Details \| Review
indexeddb SQLite3 high water mark logging 10 years ago Andrew Sutherland [:asuth] (he/him) 2.31 KB, patch		Details \| Diff \| Splinter Review
dmesg death, logcats from run with previous idb patch after clean reboot 10 years ago Andrew Sutherland [:asuth] (he/him) 48.30 KB, text/plain		Details

Naoki Hirata :nhirata (please use needinfo instead of cc)

Reporter

Description

•

10 years ago

Attached file email.log — Details

1. launch email app
2. set it up for your mozilla/work email (or whatever has a lot of emails)
3. keep scrolling

Expected: the app doesn't get killed
Actual: the app gets killed

Note: 
Gaia      f0872318d46781bb59d0a13a2e1fffb115b8ca2b
Gecko     https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/32bb713e6d0d
BuildID   20140418014001
Version   28.1
ro.build.version.incremental=eng.cltbld.20140418.052257
ro.build.date=Fri Apr 18 05:23:05 EDT 2014
Tarako

Naoki Hirata :nhirata (please use needinfo instead of cc)

Reporter

Updated

•

10 years ago

blocking-b2g: --- → 1.3T?

status-b2g-v1.3T: --- → affected

Summary: [Tarako] Email app can get LMK just browsing emails → [Tarako] Email app can get killed just browsing emails

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 1

•

10 years ago

We can uplift the body block cache management improvements from the virtual scroll patch on bug 796474 which should help since the log shows we are triggering snippet fetches which will result in body blocks being touched and thereby cached.  Between this and the snippet fetching rate of one batch every 2 seconds generating transient garbage, it probably gets bad.

The log indicates per-day messages of 167 and 166 which I would classify as medium-high density.

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

10 years ago

Summary: [Tarako] Email app can get killed just browsing emails → [Tarako][email] OOM death/email killed from message scrolling with snippet fetching at a message density of 166 messages/day

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 2

•

10 years ago

(aside: your logcat somehow lacked the pretty ANSI color codes that look cool; if it's possible to tee the file or otherwise retain the color codes, it's appreciated for the time being since our logging overhaul doesn't happen until 2.0).

Joe Cheng [:jcheng] (please needinfo)

Comment 3

•

10 years ago

ni? asuth
how risky for you think bug 796474 for tarako? since we do want to wrap up by the end of the month
or should we consider something like limiting the number of emails in the email app? 
what's your suggestion if we want to limit the number of emails? 
Thanks

Flags: needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 4

•

10 years ago

Bug 796474 itself cannot be taken on v1.3T; it is way too risky.  The block cache management improvements are not risky and can be extracted and prepared as a v1.3T-specific patch.  (We could also land it on v1.3 if we wanted, too, I suppose.)

Flags: needinfo?(bugmail)

Joe Cheng [:jcheng] (please needinfo)

Comment 5

•

10 years ago

hi Andrew, if you believe block cache management can help this bug, can we try it? Thanks

Flags: needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 6

•

10 years ago

Sure, we can try it, but we're busy with blockers and this is not a blocker yet.  (And in fact the code that we would uplift has not yet been reviewed; it's part of a 1.4 blocker.)  I'll leave the needinfo active so this is on the shortlist to focus on as email hackers become available.

Joe Cheng [:jcheng] (please needinfo)

Comment 7

•

10 years ago

let's have bug 796474 reviewed and landed first and then we can discuss further on whether the block cache management part can be extracted and prepared for 1.3T. thanks

Joe Cheng [:jcheng] (please needinfo)

Comment 8

•

10 years ago

given that bug 796474 has landed, Andrew, is it possible to try the block cache management improvement on 1.3T? thanks

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 9

•

10 years ago

Yes, I will prepare a branch and give some numbers to help justify approving it for uplift.  I'm nearly certain we'll want the change anyways, but the numbers will be very useful for any other 1.3T email decisions we have to make.

Assignee: nobody → bugmail

Status: NEW → ASSIGNED

Flags: needinfo?(bugmail)

Joe Cheng [:jcheng] (please needinfo)

Comment 10

•

10 years ago

thanks Andrew

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 11

•

10 years ago

Attached file Uplift of improved block handling — Details

Here's an uplift of the block handling patch.

=== Pre-patch baseline block usage
I established a baseline / verified prior block caching badness using just debug logging here:
https://github.com/asutherland/gaia/compare/mozilla-b2g:v1.3t...debug-report-block-usage?expand=1

Such a log looks like:
  WLOG: post flush blocks. header: 6 (27950 bytes) body: 20 (119155 bytes)

Note that the byte size estimate is actually an on-disk size estimate.  We expect memory usage to exceed that a bit.

Also note that was basically the high water mark for just scrolling.  Because of how the pre bug 796474 logic works, we effectively end up with an upper bound on the number of body blocks that can be in memory at a time.  But if someone went through their list of messages and displayed them all in the reader, it could end up much worse.  MUCH worse.  Thankfully there's no next/previous UI in v1.3.

=== Post-patch block usage
And then I ran with the fix and the debug logging on top:
https://github.com/asutherland/gaia/compare/mozilla-b2g:v1.3t...v1.3t-reduce-block-mem-withlog?expand=1

Looks like:
  WLOG: post flush blocks. header: 9 (44720 bytes) body: 1 (7545 bytes)


=== Bad news
We still die eventually.  I ran a bunch of "MOZ_IGNORE_NUWA_PROCESS=1 tools/get_about_memory.py" invocations.  What I basically found is:

- Sometimes having to load a font really betrays us.  We lose 2.88 MB of memory when we have to render Chinese characters.  Bug 997007 tracks :ting's investigation of this.

- What most seems to trigger the email app's death is just after performing "load more messages".  This will tend to be a relative peak memory usage time since we've got the following stuff that's either recent garbage or not yet garbage:
  - all the mozTCPSocket stuff from the sync process
  - the dirty header and (tiny) body blocks in the worker, plus they end up on the main page too because we have to manually proxy our IndexedDB
  - we've got that IndexedDB transaction which includes both the serialization and then the SQLite transaction/dirty pages/etc and includes our aggregate bundle of block-infos for all folders which is going to be larger.

- Second place killer is body fetching.  This is also going to generate some about of IndexedDB overhead.


=== Land the uplift?
Yeah, I think we should land the patch.  It's a clear win and it has worst-case protection built-in.  It might be buying us some time before we get killed in Naoki's use-case here, it's definitely going to help us survive longer if the user is reading all the messages they got.


=== What else can we do?
Not a lot.  We could figure out what the IndexedDB-related transient memory spike is but overhauling email's IndexedDB schema or IndexedDB's schema/etc. do not seem likely to be in the cards for v1.3t.

Attachment #8416326 - Flags: review+

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 12

•

10 years ago

Attached patch indexeddb SQLite3 high water mark logging — Details — Splinter Review

Improved news, everyone!

I was wondering what the SQLite transaction delta was so I made a little IndexedDB patch to log the data at the start/end of the transaction.  Installing this resulted in all the b2g processes getting reset.  I did not run any about:memory runs, meaning that any side-effecty bloat from that was avoided.

I managed to scroll down through the ~220 already-synced messages, and then "load more messages" and have background snippet fetching happen while I scrolled through another ~230 to bring us up to 450 messages synchronized.  We did, sadly, die at that point.  I revised my IDB patch, turned off marionette support and rebooted.  The next time we got to 621 messages

Note that I was using an account that's basically just subscribed to dev-gaia, so message density is about ~20 messages a day, but our sync sizes ended up being somewhat in the same ballpark of what Naoki was seeing since we'd end up in the 24-59 range usually.

Reported SQLite transactional memory swing (which is process-wide, not scoped to a connection) seemed to be 167K-ish most of the time, with the occasional example like the below where it was 246K.  I'll attach the log in the same attachment.

I was a little worried that a lot of our calls to saveAccountState do not wait for the write transaction to complete and so we might be causing build-up of non-trivial IndexedDB requests.  However, from the logging it appears we're actually doing okay and my (naive) reading of the kernel state dump when it killed us is that there was no file-system write backlogs built up.  (Those would in fact be hard to accumulate since IndexedDB-on-SQLite will be triggering fsync's.)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 13

•

10 years ago

Attached file dmesg death, logcats from run with previous idb patch after clean reboot — Details

Note that I realize I misspoke about our sync sizes being comparable to what Naoki was testing with.  sync sizes of 60 are not comparable to sync sizes of 166 given the limited amount of memory we have available.

So I tested with a test account furnished by :doliver that's subscribed to an awful lot of bugzilla.

It had 667 messages in the last day as of sync time.  I triggered sync immediately following account setup.  We managed to get to SYNCDONE, but the UI (which unfortunately receives all 667 of them on v1.3 before deciding it really only needs about 45 of them) got pretty badly swamped.  We managed to get the writes to IndexedDB issued but ended up dead and so our transaction got backed out.

Hilariously, the cookie cache made it seem like the sync had succeeded when I reopened the email app.  Since the sync was an initial sync again we did a 3 day fetch that bisected down to one day and now wanted 670 messages.  This sync succeeded and checkpointed to the database.  However, the sync then proceeded to try and synchronize the day before that while cramming the results into the UI.  The sync died.

I reopened the app again; now sane scalability happened.  I had all 670 messages and could even trigger a refresh.  The neat thing is that I think v2.0 or v1.4 on Tarako might actually be able to survive that situation, especially if we optimize the virtual list in v2.0 to shrink more or have an explicit fixed window size around an anchorpoint.

Jason Smith [:jsmith]

Updated

•

10 years ago

Blocks: 1004326

Candice Serran (:cserran)

Updated

•

10 years ago

Target Milestone: --- → 2.0 S1 (9may)

Joe Cheng [:jcheng] (please needinfo)

Comment 14

•

10 years ago

triage: 1.3T+
ni? Andrew, is this ready to land? i supposed this is [tarako_only]

blocking-b2g: 1.3T? → 1.3T+

Whiteboard: [tarako_only]

Joe Cheng [:jcheng] (please needinfo)

Updated

•

10 years ago

Flags: needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 15

•

10 years ago

landed on v1.3t:
https://github.com/mozilla-b2g/gaia/pull/18875
https://github.com/mozilla-b2g/gaia/commit/3492deabe5db9e1f96cb44f0e4f727f0e1c244ac

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

status-b2g-v1.3T: affected → fixed

status-b2g-v1.4: --- → unaffected

status-b2g-v2.0: --- → unaffected

Flags: needinfo?(bugmail)

Resolution: --- → FIXED

Whiteboard: [tarako_only] → [tarako_only][c= p=1]

Edward Chen[:edchen]

Comment 16

•

10 years ago

Over 700 mails to test OOM issue.

[Environment]
Gaia      dbc2b00f538df1d93960794fbb52aa84f6f19483
Gecko     https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/3b6c97b5fe8b
BuildID   20140514164005
Version   28.1
ro.build.version.incremental=eng.cltbld.20140514.202602
ro.build.date=Wed May 14 20:26:10 EDT 2014

[Result]
PASS

Status: RESOLVED → VERIFIED

You need to log in before you can comment on or make changes to this bug.