Closed Bug 998408 Opened 6 years ago Closed 6 years ago

[tarako] sending a large email can cause the email app to get killed

Categories

(Firefox OS Graveyard :: Gaia::E-Mail, defect)

ARM
Gonk (Firefox OS)
defect
Not set

Tracking

(blocking-b2g:-)

RESOLVED WORKSFORME
2.0 S1 (9may)
blocking-b2g -

People

(Reporter: nhirata, Assigned: asuth)

References

Details

(Whiteboard: [tarako-bug-bash-1.3T][partner-blocker][sprd308061])

Attachments

(1 file)

Attached file email_logcat
1. launch email
2. setup email
3. create an email to some address
4. attach a 2 meg video

Expected: email doesn't bother sending the large email due to size
Actual: email tried and dies.

Note: 
1. the occurance is towards the end of the logcat

Gaia      f0872318d46781bb59d0a13a2e1fffb115b8ca2b
Gecko     https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/32bb713e6d0d
BuildID   20140418014001
Version   28.1
ro.build.version.incremental=eng.cltbld.20140418.052257
ro.build.date=Fri Apr 18 05:23:05 EDT 2014
Tarako
blocking-b2g: --- → 1.3T?
Duplicate of this bug: 998430
(Nick filed the bug in bug bash)
Whiteboard: [tarako-bug-bash-1.3T]
The log shows we were intelligently streaming from disk in a chunked fashion with a network transmission rate of 136 KiB/sec (17 Mbits/sec).  My guess is we were killed by transient memory usage because of our high turnover because of the good network connection.  I suspect we wouldn't have crashed if the cellular modem was in play.

I'm not sure there's too much e-mail can do about this other than artificially lowering our throughput.  Unless we're somehow leaking or badly fragmenting memory (which is not indicated by tests when our functionality landed where files on the order of 100 MB were sent), we need the garbage collector to be much more aggressive.
Hm, so upon seeing the description in bug 998430 ("Sending and email closes (crashes?) the app.  Email not listed under sent, but was received.") I'm a little confused.  The log implies that we never actually finished loading the last chunk, but receipt of the message implies we did finish loading the chunk and sent it and closed out the SMTP connection.

The lack of the matching log messages implies one of the following:
- we did finish, but buffering betrayed us.  This means __android_log_print is buffered or dump() is buffered prior to __android_log_print in the main window context now.  http://dxr.mozilla.org/mozilla-central/source/dom/base/nsGlobalWindow.cpp#5968 still looks the same and implies synchronous logging, so I must concede to being confused.
- The server somehow interpreted our aborted connection as "what the hell, send the e-mail!".

Q1: Nick, can you verify that the message is not corrupt and/or provide a copy of the message you received?  https://wiki.mozilla.org/Gaia/Email/ProvidingEmailsForDebugging lists how to provide the raw message, but just saving the attachment and verifying the length/md5 on the source file and the received file will likely provide sufficient clarity there.

If the logging inconsistency is a question of log buffering, then it's possible that something weird happened when we went to APPEND the message to the sent folder, so...

Q2: Nick: Was this a gmail account?  We don't IMAP APPEND on gmail accounts.

Assuming it happened, in general the IMAP APPEND should be a similar situation to the SMTP send.  We intentionally do not persist the job to disk to avoid wasted disk space (also beneficial for memory purposes), and the IMAP APPEND should leverage the same Blob-slicing magic that SMTP uses, so it's memory usage shouldn't be any worse (unless there are platform issues).
Flags: needinfo?(nick)
I sent the email to Marcia, who may or may not still have it.  It was from a mozilla.com to a mozilla.com email address.  Marcia, do you still have the first test email I sent you yesterday?
Flags: needinfo?(nick) → needinfo?(mozillamarcia.knous)
(In reply to Nick Desaulniers [:\n] from comment #5)
> I sent the email to Marcia, who may or may not still have it.  It was from a
> mozilla.com to a mozilla.com email address.  Marcia, do you still have the
> first test email I sent you yesterday?

Looks as if I deleted it, but I do recall the email had an animated gif of some sort at the bottom.
Flags: needinfo?(mozillamarcia.knous)
qawanted to confirm this bug again. thanks
Keywords: qawanted
Flags: needinfo?(nhirata.bugzilla)
Naoki is going to try this again and have more comments in here.
While commenting on another bug I realized that although we take tiny bites (96 KiB) of Blobs when sending/appending Blobs, when base64-encoding the attachments we take much bigger bites for efficiency reasons (~512KiB) but these also result in non-trivial output "bites" (~700 KiB).  The comment and constant BLOB_BASE64_BATCH_CONVERT_SIZE can be found at:
https://github.com/asutherland/gaia-email-libs-and-more/blob/master/data/lib/mailapi/drafts/jobs.js#L33

Note that in the provided log the encoding stage has been over for a while by the time the email app gets killed.  Refreshing my understanding of jemalloc briefly from its manpage (http://www.canonware.com/download/jemalloc/jemalloc-latest/doc/jemalloc.html), assuming these allocations come from jemalloc and not a JS arena allocator, we appear to be falling in the "large" chunk bucket.  (The general gecko setting in memory/build/jemalloc_config.c sets our defaults to lg_chunk:20 which is 2^20 = 1MiB).  Our choice of constants could be interacting badly here to bloat our image.  More investigation or just shifting the constant down and trying that out (in the generated apps/email/js/ext/mailapi/worker-bootstrap.js) may be advisable.  Choosing a value of "1896 * 57" should bring us in at just under 256 KiB per batch so we should end up fitting 4 batches in a single jemalloc chunk.
blocking-b2g: 1.3T? → 1.3T+
(In reply to Joe Cheng [:jcheng] from comment #7)
> qawanted to confirm this bug again. thanks

Following the STR from Comment 0 on the latest 1.3T build, the behavior is the Email is attempted to be sent for a file labeled 2MB (2015kb = 1.96MB). On the sending email page, a spinning black square displays in a clockwise manner for over 7 minutes then the echo effect is heard resembling a sent message. The message is received at the recipient email address with the 2MB attachment and can be viewed after download. 

Following the STR from Comment 0 for a 7MB file, a message appears that the attachment is too large without an attempt made to send. 

Environmental Variables:
Device: Tarako 1.3T MOZ
BuildID: 20140429014002
Gaia: b5adc5a943d3abbd6ab070a47c847f2c24891cc5
Gecko: e9890f5d4709
Version: 28.1
Firmware Version: sp6821a_gonk4.0_user.pac
Keywords: qawanted
Taking this to see if we can lower our (peak) memory usage a bit by adjusting the setting in comment 9 and possibly adding some throttling to give GC a chance to catch up.

Re: comment 10 repro where we did successfully send the message, it sounds like the reproduction was done using either slow wi-fi or a cellular network connection instead of wi-fi.  This is sorta consistent with my guess in comment 3 about giving GC a chance. to catch up with us.  We may also be benefiting from increased headroom from JS optimizations.

I'll make sure to test with a file size just under our current send limit on 1.3T (5 MiB unless it got changed on branch; will check).
Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Flags: needinfo?(nhirata.bugzilla)
Whiteboard: [tarako-bug-bash-1.3T] → [tarako-bug-bash-1.3T][partner-blocker][sprd308061]
Target Milestone: --- → 2.0 S1 (9may)
I ran a bunch of tests on high-speed wi-fi with the current v1.3T code (case A) and the code modified as I proposed in comment 9 (case B).  In all cases I was able to send a video recorded on the device with an on-disk size of 4824271 bytes (4.6 MiB, 4.82 MB) without problems as long as the screen stayed on.  The exception I am excluding is the one time I let the screen turn off and then the email app died.

Sadly, it's not going to be intuitive to the user that they need to keep tapping the screen to stop the email app from getting dead.  I think what we need to do / the only thing we can do here is to create a v1.3t variant on bug 965139.  In that bug for v1.3 :psingapati provided a patch where we hold the wi-fi wake-lock during the time of the send.  But we'll need to amend it slightly further.  So what we'd do is:

- Hold the 'screen' wakelock during the send process.
- Cause the composite/account sendMessage process not return success until the sent message has been saved to the sent folder.  Otherwise we may send the message but fail to save it in the sent folder.
- We need to add a cancel button to the UI.  Right now there is no cancel button, unfortunately because of the risks below, if we're adding a wakelock like this, we *reallllllly* need a cancel button, and that's basically a new feature.  (Unless we expect the user to just try and stop things by hitting the power button / hitting the home button, and that kills us.)  Note that we will need to reuse an existing "cancel" string.  Hopefully they were all localized generically.

The general risks to this strategy are that:
- The user gets tired of watching us slowly send the message and hits the power button anyways to turn off the screen or hits the home button.  And we still die anyways.
- There's risk that things fail in such a way that we never want to release our wake-lock.  The user drops the phone in their pocket / bag / whatever and we end up draining their battery and making the device real hot, etc.

I am moving this back to 1.3T? since I think the central issue of this bug has now morphed and addressing that issue has non-trivial risk.  I see there is an associated partner bug which may have a different understanding of the bug, but it's my understanding those bugs aren't in English, so I'll need someone to help out there too.

== Other Notes

- As alluded to in comment 10, at some point in the send process a memory-pressure event or something like that will cause the spinner progress graphic to get killed and replaced by just a black square.

== Details of memory investigation

For data I ended up using the graphing web UI of https://github.com/digitarald/firewatch with "hideSystem" in static/index.js set back to "false" so that I could see what the b2g system process was doing.  The system app ends up with a lot of the memory overhead of this process, so it's important to be aware of its memory usage.

As far as I can tell both ways are equally good, which is to say they're equally bad.  Our memory usage always peaks during the attachment encoding process.  Peaks are comparable.  Memory falls off afterwards.  Usage varies afterwards with the system app's memory usage rising a lot.

I think an ideal tool for investigating this would track GC's/CC's on a timeline along with some of the cheaper-to-compute memory statistics.  I think there is some work in this direction, but I don't think it's yet available and definitely not on v1.3t.  (And perturbation is likely to be a major issue.)  If someone is aware of such things, please let me know (and let everyone dev-gaia know too.)

Tests run using:
Gecko     https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/dfc111de2faa
BuildID   20140504164001
and my gaia checkout was: 1177a857a3caeb8fd1feae94c83298a9144c2ff5

My testing procedure was:
- Use a non-gmail IMAP server so we have to APPEND our message to the sent folder.
- Trigger compose from inside the email app.
- Trigger attaching from inside the email app via the paperclip icon
- Pick the videos app
- Pick the video which is reported as 5 megs that I previously recorded.
- Verify the choice by hitting the confirm/whatever icon in the upper right of the video app.
- Watch the logcat to confirm that the attaching process has completed.  (This is not strictly required, because of the was the job queueing is done, the only thing that hitting send does before we're ready is ensure that there is zero slack time.)
- Wait a varying amount of time to sometimes see if anything interesting happens GC-wise.
- Hit the send button.
- Make sure to tap the screen during the process so we don't turn off
- Wait for the send to succeed.
- Watch the logcat and wait for the append to succeed, tapping the header occasionally to make sure the screen doesn't turn off.
blocking-b2g: 1.3T+ → 1.3T?
Andrew, could we hold a wakelock while we are sending data? Would that mitigate the issue when we go offscreen?
need to retest.
Flags: needinfo?(nhirata.bugzilla)
Keywords: qawanted
Using today's Tarako build I WAS able to reproduce this using slightly different STR

1) Launch Email app
2) Setup Email account
3) Create a new email message
4) Tap on the attachment icon> Camera> Video
5) Take a video for about 1 min 30 sec (to get a 2MB video)
6) Send Email

The email will try to send for a while and then crash and not send

1.3t Environmental Variables:
Device: Tarako 1.3t
BuildID: 20140507014006
Gaia: 25a17d9d7143977ea9a81ed310098e326609d248
Gecko: 68a2f24960d2
Version: 28.1
Firmware Version: SP6821a-Gonk-4.0-4-29
Keywords: qawanted
Would comment 15 be completely different or would it hit the same code path you were looking at?
Flags: needinfo?(nhirata.bugzilla) → needinfo?(bugmail)
(In reply to Fabrice Desré [:fabrice] from comment #13)
> Andrew, could we hold a wakelock while we are sending data? Would that
> mitigate the issue when we go offscreen?

Yes, we could try that but there are complicating factors, see comment 12.  Note that I made a typo about the enhancement that made us hold a wifi wakelock in bug 965139; that was landed for v1.4 and is not present on the v1.3 branch.

Also worth noting that it's possible :gsvelto's work on bug 994518 (as initially raised on bug 989713 comment 72) could address the LMK implications us becoming doomed by the lock screen and its homescreen bias and eliminate a need to hold the screen wakelock.  It might still be appropriate to uplift at least the wifi wakelock from bug 965139 (which gets us cpu wakelock too) to help make sure the file goes through.

Aside: We are implementing send-in-background for v2.0 and that's drastically improving our wake-lock story for sending emails.  It will integrate keep-alive-based-on-activity with failsafe timeouts so we don't need to worry about pathological failure cases.
Flags: needinfo?(bugmail)
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #16)
> Would comment 15 be completely different or would it hit the same code path
> you were looking at?

From the email perspective, comment 15 is this bug.  But a question is whether recording a new video results in bloating the system processes's memory footprint in some way and/or causes the kernel to have dirty filesystem-data that reduces the effective available memory.

I'm still rather unclear about what process is being triggered when the email app triggers an inline activity.  I don't see any other process in b2g-info.  Are they running in the email process and potentially bloating up our process's memory footprint?
asuth which would help more: getting a profile of the email app, getting more b2g-info, or getting an about-memory report ?
Flags: needinfo?(bugmail)
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #19)
> asuth which would help more: getting a profile of the email app, getting
> more b2g-info, or getting an about-memory report ?

Looking at comment 15 again, it involves creating a new account.  As discussed on other email Tarako bugs, I think at this point we just need to accept that email has a high probability of getting low-memory-killed if an account has been freshly created given how GC seems to be working on the device.

If we crash with a variant of comment 15 where you start up email with an existing account rather than creating a new account, then that's a more serious problem.

For now I'm just going to make this bug depend on bug 994518 which :gsvelto is already working.  If that gets fixed and we're still having problems sending, I think we definitely should pursue the wakelock uplift.


As an aside about comment 15:  Please, please, please, please *always* provide a logcat whenever doing anything with the email app as requested at https://wiki.mozilla.org/Gaia/Email/RequiredBugInfo.  Also please make sure that the logcat includes timestamp info and the OomLogger class of information as described therein.
Depends on: 994518
Flags: needinfo?(bugmail)
(In reply to Andrew Sutherland (:asuth) from comment #20)
> Looking at comment 15 again, it involves creating a new account.  

I didn't have to create a new account to reproduce the issue.  Steps 3 to 6 worked with an old account as well.
Flags: needinfo?(bugmail)
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #21)
> I didn't have to create a new account to reproduce the issue.  Steps 3 to 6
> worked with an old account as well.

Interesting.  Do you have/can you produce a logcat?  We could also be seeing bug 982779.  (Not the memory-backed parts, but possibly mystery blob things.)  I would like to determine whether we're OOM dying versus bad IPC dying.
Flags: needinfo?(bugmail) → needinfo?(nhirata.bugzilla)
Er, and because of bug 1001206 we'll also need the dmesg from the device since the OomLogger is unreliable.
This is weird.  I can't seem to reproduce this on today's build.
Looking at the repo's there looks like some kernel changes.  I asked dharris to retest.
Flags: needinfo?(nhirata.bugzilla)
Flags: needinfo?(dharris)
I also was UNABLE to reproduce this issue on today's tarako build after 10 tests. Also by using video's that were 3MB and 4MB



1.3t Environmental Variables:
Device: Tarako 1.3t
BuildID: 20140509014003
Gaia: d07f2dff31e71aed9711d988677fbc5b2c6e83e4
Gecko: 59b3ea7e75fb
Version: 28.1
Firmware Version: SP6821a-Gonk-4.0-4-29
Flags: needinfo?(dharris)
Hi Andrew, we can't seem to reproduce the issue.  I think there might have been some kernel fixes that helped fix the issue.  Should we close this off as WFM now then?
Flags: needinfo?(bugmail)
Close a bug as WFM?  Music to my ears!  (And since two separate people failed to repro it's less likely to be a fluke.)  Resolving WFM.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Flags: needinfo?(bugmail)
Resolution: --- → WORKSFORME
clearing nom flag since bug doesn't occur anymore.
blocking-b2g: 1.3T? → -
You need to log in before you can comment on or make changes to this bug.