Closed Bug 1120875 Opened 9 years ago Closed 9 years ago

[email] Email app OOM-killed while downloading ~4M attachment on 319MB-configured v2.2 flame device

Categories

(Firefox OS Graveyard :: Gaia::E-Mail, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-b2g:-, b2g-v2.1 affected, b2g-v2.2 fixed, b2g-master fixed)

RESOLVED FIXED
2.2 S4 (23jan)
blocking-b2g -
Tracking Status
b2g-v2.1 --- affected
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: yue.xia, Assigned: asuth)

References

Details

Attachments

(9 files)

Attached file logcat
[1.Description]:
[Flame][v2.1&v2.2][Email]Log in an mail account and  it received a mail which contains a 4M video/music. Download the attachment, about 2~3 minutes later, the device fails to download the attachment and it automatically goes back to Home screen.
See attachment: logcat_0828.txt & Video2.MP4
Found time: 08:28

[2.Testing Steps]: 
Pre: log in an mail account and the wifi network is poor.
1. Launch Mail.
2. Received a mail witch contain a 4M video/music in it.
3. Tap the mail to enter  detail view.
4. Tap the download button beside the 4M video/music.

[3.Expected Result]: 
4. About 2~3 minutes later, the attachment could be downloaded successfully.

[4.Actual Result]: 
4. About 2~3 minutes later, the device fails to download the attachment and it automatically goes back to Home screen.

[5.Reproduction build]: 
Flame 2.1 build:
Gaia-Rev        1975241ac29f723479e6c60b2bf74ebed54da91a
Gecko-Rev       https://hg.mozilla.org/releases/mozilla-b2g34_v2_1/rev/0863fe4b75c3
Build-ID        20150112001215
Version         34.0

Flame 2.2 build:
Gaia-Rev        7c5b27cad370db377b18a742d3f3fdb0070e899f
Gecko-Rev       https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/2c37b89bdd86
Build-ID        20150112153951
Version         37.0a2

[6.Reproduction Frequency]: 
occasionally Recurrence,2/5

[7.TCID]: 
Free test
Attached video Video
Attached file logcat
Hi Eric,
I found that in Flame 2.1/2.2 browser, pinch to zoom in webpage, device will auto exit browser and go back to Home screen, maybe this is the root case, could you help with it, thanks!
STR:
1. Launch Browser and link to any webpage.
2. pinch to zoom in screen several times.
** Device will auto exit browser and go back to Home screen.
See attachments: logcat_1736.txt & Browser_video.MP4
Found time: 17:36
Rate: 5/5

Flame 2.1 build:
Gaia-Rev        6957ac8a322234ec99c8abb7cc18dc6a2e0176db
Gecko-Rev       https://hg.mozilla.org/releases/mozilla-b2g34_v2_1/rev/6600eba54256
Build-ID        20150113161204
Version         34.0
Device-Name     flame
FW-Release      4.4.2
FW-Incremental  eng.cltbld.20150113.192836
FW-Date         Tue Jan 13 19:28:47 EST 2015
Bootloader      L1TC000118D0

Flame 2.2 build:
Gaia-Rev        7c5b27cad370db377b18a742d3f3fdb0070e899f
Gecko-Rev       https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/748b20315f75
Build-ID        20150113162504
Version         37.0a2
Device-Name     flame
FW-Release      4.4.2
FW-Incremental  eng.cltbld.20150113.194114
FW-Date         Tue Jan 13 19:41:25 EST 2015
Bootloader      L1TC000118D0
Flags: needinfo?(echang)
Attached video Browser video
(In reply to Shine from comment #0)
> Created attachment 8548079 [details]
> logcat
> 

01-13 08:28:38.026 I/GonkMemoryPressure(  211): Checking to see if memory pressure is over.
01-13 08:28:38.026 I/GonkMemoryPressure(  211): Memory pressure is over.
01-13 08:28:38.026 I/GonkMemoryPressure( 3478): Checking to see if memory pressure is over.
01-13 08:28:38.026 I/GonkMemoryPressure( 3478): Memory pressure is over.

I saw these information.
(In reply to Shine from comment #2)
> Created attachment 8548689 [details]
> logcat
> 
> Hi Eric,
> I found that in Flame 2.1/2.2 browser, pinch to zoom in webpage, device will
> auto exit browser and go back to Home screen, maybe this is the root case,
> could you help with it, thanks!
> STR:
> 1. Launch Browser and link to any webpage.
> 2. pinch to zoom in screen several times.
> ** Device will auto exit browser and go back to Home screen.
> See attachments: logcat_1736.txt & Browser_video.MP4
> Found time: 17:36
> Rate: 5/5
> 
> Flame 2.1 build:
> Gaia-Rev        6957ac8a322234ec99c8abb7cc18dc6a2e0176db
> Gecko-Rev      
> https://hg.mozilla.org/releases/mozilla-b2g34_v2_1/rev/6600eba54256
> Build-ID        20150113161204
> Version         34.0
> Device-Name     flame
> FW-Release      4.4.2
> FW-Incremental  eng.cltbld.20150113.192836
> FW-Date         Tue Jan 13 19:28:47 EST 2015
> Bootloader      L1TC000118D0
> 
> Flame 2.2 build:
> Gaia-Rev        7c5b27cad370db377b18a742d3f3fdb0070e899f
> Gecko-Rev      
> https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/748b20315f75
> Build-ID        20150113162504
> Version         37.0a2
> Device-Name     flame
> FW-Release      4.4.2
> FW-Incremental  eng.cltbld.20150113.194114
> FW-Date         Tue Jan 13 19:41:25 EST 2015
> Bootloader      L1TC000118D0

Hi, Shine,

I suggest you to file another new bug.
We do not prefer to talk two user scenarios on the same bug.
We can mark it as duplicate bug if developers think the root cause is the same.
So, could you please file a new bug?
Thanks you.
Flags: needinfo?(echang) → needinfo?(yue.xia)
Like :whsu's comments implies in comment 4, it looks like there might be an out-of-memory thing going on.  The provided logs also seem to indicate there may be difficulties involving other processes.

I'd suggest reproducing using "firewatch" or the one built into the Firefox WebIDE while reproducing the email test.  Details on both can be found here: https://developer.mozilla.org/en-US/Apps/Build/Performance/Apps_for_low-memory_Firefox_OS_devices

Maybe take a screenshot of the firewatch/webIDE while running if it shows something weird happening.  Until bug 943798 in email is fixed, we do expect email's memory usage to grow, but it really should not be getting OOM-killed on a Flame device, especially if it's running with full memory.  The firewatch graph should help make it clear if this is email's problem or something else on the device is eating up memory.  (For example, a broken child-process causing the parent process to bloat, etc.)
See Also: → 1121871
> Hi, Shine,
> 
> I suggest you to file another new bug.
> We do not prefer to talk two user scenarios on the same bug.
> We can mark it as duplicate bug if developers think the root cause is the
> same.
> So, could you please file a new bug?
> Thanks you.

Hi William,
I have filed another new bug, please see bug 1121871, thanks!
Flags: needinfo?(yue.xia)
(In reply to Andrew Sutherland [:asuth] from comment #6)
> Like :whsu's comments implies in comment 4, it looks like there might be an
> out-of-memory thing going on.  The provided logs also seem to indicate there
> may be difficulties involving other processes.
> 
> I'd suggest reproducing using "firewatch" or the one built into the Firefox
> WebIDE while reproducing the email test.  Details on both can be found here:
> https://developer.mozilla.org/en-US/Apps/Build/Performance/Apps_for_low-
> memory_Firefox_OS_devices
> 
> Maybe take a screenshot of the firewatch/webIDE while running if it shows
> something weird happening.  Until bug 943798 in email is fixed, we do expect
> email's memory usage to grow, but it really should not be getting OOM-killed
> on a Flame device, especially if it's running with full memory.  The
> firewatch graph should help make it clear if this is email's problem or
> something else on the device is eating up memory.  (For example, a broken
> child-process causing the parent process to bloat, etc.)

Yes! I agree with Andrew.
In order to understand the memory usage, please provide the a screenshot of the firewatch/webIDE as we did on Bug 1121871#c2.
Many thanks.
Keywords: qawanted
(In reply to Shine from comment #7)
> > Hi, Shine,
> > 
> > I suggest you to file another new bug.
> > We do not prefer to talk two user scenarios on the same bug.
> > We can mark it as duplicate bug if developers think the root cause is the
> > same.
> > So, could you please file a new bug?
> > Thanks you.
> 
> Hi William,
> I have filed another new bug, please see bug 1121871, thanks!

Thanks Shine! :)
Hi William,
I have uploaded the screenshot of the webIDE and the logcat of Flame 2.2
See attachment:WebIDE Monitor Screenshot.PNG and logcat_1345.txt

Flame 2.2 build:
Gaia-Rev        7c5b27cad370db377b18a742d3f3fdb0070e899f
Gecko-Rev       https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/ce27f2692382
Build-ID        20150115002505
Version         37.0a2
Device-Name     flame
FW-Release      4.4.2
FW-Incremental  eng.cltbld.20150115.035734
FW-Date         Thu Jan 15 03:57:45 EST 2015
Bootloader      L1TC000118D0
Attached file logcat_1345.txt
Keywords: qawanted
(In reply to Sue from comment #10)
> Created attachment 8550116 [details]
> WebIDE Monitor Screenshot.PNG
> 
> Hi William,
> I have uploaded the screenshot of the webIDE and the logcat of Flame 2.2
> See attachment:WebIDE Monitor Screenshot.PNG and logcat_1345.txt
> 
> Flame 2.2 build:
> Gaia-Rev        7c5b27cad370db377b18a742d3f3fdb0070e899f
> Gecko-Rev      
> https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/ce27f2692382
> Build-ID        20150115002505
> Version         37.0a2
> Device-Name     flame
> FW-Release      4.4.2
> FW-Incremental  eng.cltbld.20150115.035734
> FW-Date         Thu Jan 15 03:57:45 EST 2015
> Bootloader      L1TC000118D0

Thanks Sue! :)
Good Job.
Yes, thanks for the combination of the log and the webide screenshot!

Based on the WebIDE graph, it looks like email was not using a lot of memory when it died, suggesting this could actually be a real crash.  (Possibly involving memory-pressure causing something incorrect to happen, or just the GC pause making a crash much more probable by making some type of race condition much more probable.)

I guess the main questions are:
- Was crash reporting enabled and/or is there some other way to get a crash report like https://developer.mozilla.org/en-US/Firefox_OS/Debugging/Firefox_OS_crash_reporting describes?
- What does "adb shell dmesg" show right after the crash?  It seems like the OomLogger is not actually reporting anything to logcat for the device in question, so the contents of dmesg would need to be manually reported if this was an OOM case.  It should also perhaps have crash data in there, otherwise.

One other possibility that shouldn't be happening is that about:memory is getting triggered by something and that's causing a worker crash per bug 1122322.
and for posterity/my laziness: here are my notes from the log and the screenshot:

log interesting stuff:
- 13:43:53 memory pressure over (and check)
- 13:44:17, search apparently killed
- 13:44:18, find my device apparently killed  [ looks like 13:43:47 on the pic? ]
- 13:44:19, homescreen apparently killed [does look a sec later]
- 13:44:41 memory pressure over (and check)
- 13:44:57, email crashed [ looks like 1:45:02 in pic? ]
- 13:45:01 memory pressure over (and check)


pic interesting stuff:
- 01:43:57 or so, big vertical memory drop in email
- 01:44:41 or so, drop in system memory in response to email memory spike?
- 01:44:47 or so, smaller, gradual memory drop in email
(In reply to Andrew Sutherland [:asuth] from comment #14)
> and for posterity/my laziness: here are my notes from the log and the
> screenshot:
> 
> log interesting stuff:
> ...
> - 13:44:57, email crashed [ looks like 1:45:02 in pic? ]
> - 13:45:01 memory pressure over (and check)
> 
> pic interesting stuff:
> - 01:43:57 or so, big vertical memory drop in email
> - 01:44:41 or so, drop in system memory in response to email memory spike?
> - 01:44:47 or so, smaller, gradual memory drop in email

WoW! Cool. Thanks for your information! :)
> I guess the main questions are:
> - Was crash reporting enabled and/or is there some other way to get a crash
> report like
> https://developer.mozilla.org/en-US/Firefox_OS/Debugging/
> Firefox_OS_crash_reporting describes?
> - What does "adb shell dmesg" show right after the crash?  It seems like the
> OomLogger is not actually reporting anything to logcat for the device in
> question, so the contents of dmesg would need to be manually reported if
> this was an OOM case.  It should also perhaps have crash data in there,
> otherwise.

Hi, Sue,

Sorry to disturb you again and again.
Could I have your help?
Please follow the guideline Andrew provided to collect 1) "Crash report" & 2) Linux kernel log "adb shell dmesg"
Thanks!
Flags: needinfo?(lulu.tian)
Attached file logcat_1115.txt
Hi William,
I have followed the guideline to collect "Crash report", but I can't acquire it.
After I enter the E-Mail app, it will refresh the mail list, and after about 2 minutes, it will exit the E-Mail app. See attachment:logcat_1115.txt
After that, I have ran the command "adb shell ls -l /data/b2g/mozilla/Crash\ Reports/submitted/" to get crash ID, but it will remind "/data/b2g/mozilla/Crash Reports/submitted/: No such file or directory".
I have ran "adb shell dmesg", see attachment:dmesg.txt

Flame 2.2 build:
Gaia-Rev        f5b3d1b6cfa3e702033f613915ae637cb735cbfb
Gecko-Rev       https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/8067c111ddff
Build-ID        20150118002501
Version         37.0a2
Device-Name     flame
FW-Release      4.4.2
FW-Incremental  eng.cltbld.20150118.035516
FW-Date         Sun Jan 18 03:55:27 EST 2015
Bootloader      L1TC000118D0
Flags: needinfo?(lulu.tian) → needinfo?(whsu)
Attached file dmesg.txt
Thank you very much for the dmesg log!  This indicates it is an OOM-kill and so we would not expect a crash report in about:crashes, so don't worry about that.

We're going to need to spin off a bug about the OOM killer logger no longer being useful (or make sure the bug already exists).

Sue, can you indicate how much memory your device is configured to run with?  This information can be sorta retrieved by doing "adb shell cat /proc/meminfo".  Doing a quick tally from the dmesg, it looks like the device is running with something like 343+ MiB?
Flags: needinfo?(lulu.tian)
Hi Andrew,
This mem of device is: 319m(use command "fastboot getvar mem")
The meminfo of doing "adb shell cat /proc/meminfo" is:
MemTotal:         219772 kB
MemFree:            3196 kB
Buffers:             760 kB
Cached:            45924 kB
SwapCached:          524 kB
Active:            71288 kB
Inactive:          66892 kB
Active(anon):      49516 kB
Inactive(anon):    49412 kB
Active(file):      21772 kB
Inactive(file):    17480 kB
Unevictable:        1100 kB
Mlocked:               0 kB
HighTotal:             0 kB
HighFree:              0 kB
LowTotal:         219772 kB
LowFree:            3196 kB
SwapTotal:        196604 kB
SwapFree:         175372 kB
Dirty:                44 kB
Writeback:             0 kB
AnonPages:         92244 kB
Mapped:            43224 kB
Shmem:              6332 kB
Slab:              21740 kB
SReclaimable:       7628 kB
SUnreclaim:        14112 kB
KernelStack:        3352 kB
PageTables:         2032 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:      306488 kB
Committed_AS:     211992 kB
VmallocTotal:     704512 kB
VmallocUsed:       25760 kB
VmallocChunk:     542212 kB
Flags: needinfo?(lulu.tian) → needinfo?(bugmail)
Hm, so, this sounds like a case where the garbage collection may need to be more aggressively tuned on 319MB/small-memory devices.  I'll ask around/investigate some.  Leaving needinfo against me active.
Ugh, so, this seems to be more than just a GC-tuning thing.  The email app is generating way too much garbage way too fast.  Like 45 megabytes in 7 seconds for an 11 megabyte download over wi-fi where that time sample was definitely not the entire file.

This is probably going to have been an email.js regression and we're going to need to look into the browserbox code path.  Note that there also may be some improvements that could be made from a GC/GC-tuning perspective (and 2.1 and 2.2 may have regressed here on the GC front), but there's only so much the GC can do if we're generating unreasonable amounts of garbage, especially since I think our garbage may usually involve the cycle collector which has a non-trivial runtime cost.

I'll do some quick preliminary investigations.
blocking-b2g: --- → 2.2?
I've tracked down some obvious badness in minefuncs and prepared a pull request to get us back to our pre-email.js conversion behaviour: https://github.com/whiteout-io/mimefuncs/pull/10

This causes the heap to grow at less than half the prior rate during initial ramp-up.  I was also able to download a 4262K file on a 319M flame with the fix.  However, the memory usage showed by the monitor still demonstrated some ridiculous memory swings.  We're not good at logging the various parts of the download/save life-cycle; it's possible these are the instantaneous memory doublings of base64-decoding or the like (and it's possible there are regressions there too).  So more investigation is needed.  (And an 11 megabyte file didn't work, but it's not clear if that would actually constitute a regression.)
(In reply to Andrew Sutherland [:asuth] from comment #23)
> I've tracked down some obvious badness in minefuncs and prepared a pull
> request to get us back to our pre-email.js conversion behaviour:
> https://github.com/whiteout-io/mimefuncs/pull/10
> 
> This causes the heap to grow at less than half the prior rate during initial
> ramp-up.  I was also able to download a 4262K file on a 319M flame with the
> fix.  However, the memory usage showed by the monitor still demonstrated
> some ridiculous memory swings.  We're not good at logging the various parts
> of the download/save life-cycle; it's possible these are the instantaneous
> memory doublings of base64-decoding or the like (and it's possible there are
> regressions there too).  So more investigation is needed.  (And an 11
> megabyte file didn't work, but it's not clear if that would actually
> constitute a regression.)

Thanks for Andrew's detailed information (pull request).
If you still need to collect more data, please feel free to let me know.
Many thanks.
Flags: needinfo?(whsu)
triage: not blocking, but :asuth is actively working on the fix and will request 2.2 approval.
blocking-b2g: 2.2? → -
Flags: needinfo?(bugmail)
Summary: [Flame][Email]About 2~3 minutes later, the attachment haven't been downloaded and device auto go back to Home screen. → [email] Email app OOM-killed while downloading ~4M attachment on 319MB-configured v2.2 flame device
landed a garbage generation mitigation on trunk:

https://github.com/whiteout-io/mimefuncs/pull/10
https://github.com/whiteout-io/mimefuncs/commit/e37b8f7d976d83d6ae126a33c1b9adeef880332a

https://github.com/mozilla-b2g/gaia-email-libs-and-more/pull/360
https://github.com/mozilla-b2g/gaia-email-libs-and-more/commit/504e6873d13d04bd02ae9c40b5c5aa51371bc9b3

https://github.com/mozilla-b2g/gaia/pull/27552
https://github.com/mozilla-b2g/gaia/commit/afe7398fb15b96b83f3d176d0e1135ee137da1f6


It's important to note that this is not a 100% fix, and it can't be without platform improvements.  The email bug for streaming attachments to disk is bug 943798 and it is dependent on some platform changes that have not landed.

The main set of mitigations we have after this are:
- Change the value of MAX_ATTACHMENT_SIZE in apps/email/js/cards/message_reader.js to a smaller value, such as 5 megabytes.  This is what we did on Tarako.  https://github.com/mozilla-b2g/gaia/blob/afe7398fb15b96b83f3d176d0e1135ee137da1f6/apps/email/js/cards/message_reader.js#L996 is what would be modified

- Look for additional easy optimizations that can be made in the email.js dataflow-path.

- Tune some GC parameters.


I am marking this bug fixed because we have a landed patch and it gets confusing to have multiple patches on multiple bugs.  In the event this does not sufficiently address the desired use-cases, we should spin off a new bug.  Especially if multiple things need to be considered, it might make sense to file a new meta bug that we can discuss on and then do any specific fixes/investigations on bugs that block that bug.  Since most of this bug was us not being sure that this was an OOM issue, I don't think there's any big loss in using a new bug.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
[Approval Request Comment]
[Bug caused by] (feature/regressing bug #): Memory usage was altered by the adoption of email.js; but there's also been a lot of JS engine and GC changes.  Some of this is emergent stuff.
[User impact] if declined: The email app seems to get OOM killed with high probability while downloading an ~4M file without this patch.  It seems happy (or much happier) with the patch.
[Testing completed]: Testing on v2.2 flame device configured with 319MB, existing unit tests, upstream unit tests.
[Risk to taking this patch] (and alternatives if risky):  Very low risk, pretty clear-cut win.  Certainly significantly less risky than doing things like changing the GC settings for all apps!
[String changes made]: No string changes made.
Attachment #8552354 - Flags: review+
Attachment #8552354 - Flags: approval-gaia-v2.2?
Attachment #8552354 - Flags: approval-gaia-v2.2? → approval-gaia-v2.2+
v2.2: https://github.com/mozilla-b2g/gaia/commit/010c4c69f101b0747d939089a8df9ab012485e6d
Assignee: nobody → bugmail
Target Milestone: --- → 2.2 S4 (23jan)
Blocks: 1133668
Hi Helen,

Could you help to verify this bug? thanks.
Flags: needinfo?(hedan)
Keywords: verifyme
It seems that there is no improvement of this bug.

According to comment 26, let us see this bug again after bug 943798 is fixed.

Leaving verifyme keyword as reminding.
Flags: needinfo?(hedan)
This issue is verified as NOT fixed. I can reproduce the low memory kill on Flame master with 319MB. We're no longer using 319MB configuration for testing so I'm not sure if this bug is still relevant.

Bug reproduced 1 out of 1 attempt on each of the following branch:

Device: Flame 2.6 (319MB, full flashed)
BuildID: 20151110030237
Gaia: c0482775b1526add626b170dd53a72d10bcaf07c
Gecko: cc473fe5dc512c450634506f68cbacfb40a06a23
Gonk: 205ac4204bbbb2098a8046444acba551ba5dc75a
Version: 45.0a1 (2.6) 
Firmware Version: v18Dv4
User Agent: Mozilla/5.0 (Mobile; rv:45.0) Gecko/45.0 Firefox/45.0

Device: Flame 2.5 (319MB, full flashed)
BuildID: 20151109004552
Gaia: cf646c52bb947af28329b0a100df91d1b1f2a907
Gecko: 4eafef5b80f8985c94c4a067f130d37513e1a581
Gonk: 205ac4204bbbb2098a8046444acba551ba5dc75a
Version: 44.0a2 (2.5) 
Firmware Version: v18Dv4
User Agent: Mozilla/5.0 (Mobile; rv:44.0) Gecko/44.0 Firefox/44.0

Device: Flame 2.2 (319MB, full flashed)
BuildID: 20151110032503
Gaia: 885647d92208fb67574ced44004ab2f29d23cb45
Gecko: ac5fce5a78e5
Gonk: bd9cb3af2a0354577a6903917bc826489050b40d
Version: 37.0 (2.2) 
Firmware Version: v18Dv4
User Agent: Mozilla/5.0 (Mobile; rv:37.0) Gecko/37.0 Firefox/37.0

Removing verifyme tag and NI assignee as FYI.
QA Whiteboard: [QAnalyst-Triage?] [failed-verification]
Flags: needinfo?(jmercado)
Flags: needinfo?(bugmail)
Keywords: verifyme
Since we did land a mitigation long ago and I doubt the mitigation regressed, if we cared about 319M we would spin off a new bug.  But I don't believe we do care about 319M.
Flags: needinfo?(bugmail)
I think since we are no longer testing on 319 MB and this DOES work on 512 that we can consider this bug closed.
QA Whiteboard: [QAnalyst-Triage?] [failed-verification] → [QAnalyst-Triage+]
Flags: needinfo?(jmercado)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: