Closed Bug 1063658 Opened 10 years ago Closed 10 years ago

Work around memory-backed Blob lifetime issues when passed between processes by Web Activities by saving them to DeviceStorage-backed temporary files in .gallery/tmp and returning those instead.

Categories

(Firefox OS Graveyard :: Gaia::Gallery, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(blocking-b2g:2.0+, b2g-v2.0 verified, b2g-v2.0M verified, b2g-v2.1 verified, b2g-v2.2 verified)

VERIFIED FIXED
2.1 S6 (10oct)
blocking-b2g 2.0+
Tracking Status
b2g-v2.0 --- verified
b2g-v2.0M --- verified
b2g-v2.1 --- verified
b2g-v2.2 --- verified

People

(Reporter: asuth, Assigned: djf)

References

Details

(Keywords: regression, smoketest)

Attachments

(13 files)

On bug 1062262 we're seeing the following phenomenon on FxOS v2.0 through trunk that's very reminiscent of bug 982779 which was supposed to fix this phenomenon (and landed on v2.0):

- Email app triggers gallery app via web activity.  Gallery app was not previously running, so a process is created.
- User selects an image that needs to be rotated per EXIF.  Gallery app rotates it and creates a memory-backed Blob.  User hits done, the memory-backed Blob is returned via the activity.
- The email app gets the Blob and starts doing its game where it slices up the Blob into chunks and uses XHR to fetch those chunks sequentially.
- The email app dies!  Well, sometimes it doesn't die and instead the email app's attaching process hangs.


From :djf and the bug, my understanding is that we haven't been hearing more about this because all FxOS devices except Tarako have the cameras configured so that we do not need to perform dynamic rotation.  And in general Camera/Gallery are using disk-backed Blobs.  Also, it seems like this depends on the gallery app's process getting killed; if the gallery app is open in the background for other reasons, things don't die horribly, etc.


Here's an excerpt from the v2.1 logcat I'm attaching.  PID-wise, 3884 is email, 4036 is the gallery process, 1429 is the parent process.

09-05 13:08:36.896 I/GeckoDump( 3884): LOG: compose: attach activity success: attachment1 [details] [diff] [review].jpg
09-05 13:08:37.086 I/Gecko   ( 3884): WLOG: attachBlobToDraft: fetching 0 to 524286 of 1923917
09-05 13:08:37.186 I/Gecko   ( 3884): WLOG: attachBlobToDraft: fetched
09-05 13:08:37.206 I/Gecko   ( 4036): ###!!! [Child][MessageChannel] Error: Channel closing: too late to send/recv, messages will be lost
09-05 13:08:37.226 I/Gecko   ( 1429): ###!!! [Parent][MessageChannel] Error: Channel error: cannot send/recv
09-05 13:08:37.226 I/Gecko   ( 1429): ###!!! [Child][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
 * that last one repeated a lot *
09-05 13:08:37.226 I/Gecko   ( 1429): ###!!! [Child][MessageChannel] Error: Channel error: cannot send/recv
09-05 13:08:37.226 I/Gecko   ( 1429): [Parent 1429] WARNING: FileDescriptorSet destroyed with unconsumed descriptors: file ../../../gecko/ipc/chromium/src/chrome/common/file_descriptor_set_posix.cc, line 20
 * that last one repeated a lot *
09-05 13:08:37.276 I/Gecko   ( 3884): WLOG: attachBlobToDraft: flushing
09-05 13:08:37.416 I/Gecko   ( 3884): WLOG: attachBlobToDraft: fetching 524286 to 1048572 of 1923917
09-05 13:08:37.426 I/Gecko   ( 1429): ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1A0003,name=PContent::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
09-05 13:08:37.426 I/Gecko   ( 1429): ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x60003,name=???) Route error: message sent to unknown actor ID
09-05 13:08:37.426 I/Gecko   ( 1429): [Parent 1429] WARNING: waitpid failed pid:3884 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 261
 * asuth thinks everything else is boring because we never hear from PID 3884 or 4036 again *


A 2.0 log excerpt is much less chatty and looks more like (3769=email, 5039=gallery 619=parent), although the chattiness could be random:

I/Gecko   ( 3769): WLOG: attachBlobToDraft: fetching 0 to 524286 of 1923917
I/Gecko   ( 5039): ###!!! [Child][MessageChannel] Error: Channel closing: too late to send/recv, messages will be lost
I/Gecko   ( 3769): WLOG: attachBlobToDraft: fetched
I/Gecko   (  619): [Parent 619] WARNING: pipe error (238): Connection reset by peer: file ../../../gecko/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 452
I/Gecko   ( 3769): WLOG: attachBlobToDraft: fetching 524286 to 1048572 of 1923917
I/Gecko   (  619): ###!!! [Parent][DispatchAsyncMessage] Error: Value error: message was deserialized, but contained an illegal value
I/Gecko   (  619): [Parent 619] WARNING: waitpid failed pid:3769 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 261
I/Gecko   (  619): ###!!! [Parent][MessageChannel] Error: Channel error: cannot send/recv
any chance you can grab a log with MOZ_IPC_MESSAGE_LOG=1 ?
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #1)
> any chance you can grab a log with MOZ_IPC_MESSAGE_LOG=1 ?

Sure.  https://wiki.mozilla.org/IPDL/Five_minute_example and the fact that my first attempt didn't work seems to imply that I need a DEBUG build, so spinning one of those.
Flags: needinfo?(bugmail)
This is potentially the less boring log to look at.  I grepped it down but will also include the full log next.  Key facts:

- PIDs: root=8244, email=8646, gallery=8703

- I configured/built by using B2G_DEBUG=1 in my environment and putting it in the b2g .config file.  The resulting build is somewhat slow and sucky on the Flame.  This impacts things.

- The email app did not crash in this case, but this appears to be a timing thing.  The gallery app is dead before the email app manages to even start processing the Blob and our IndexedDB transaction manages to die somehow even.  I am assuming all the mystery blob stuff may still be useful.

- I bounded the pre-grepped log to start at the first mention of the gallery process being triggered.  Its first line lines up with the last line of this excerpt here:
"""
09-05 22:49:37.061 I/GeckoDump( 8646): LOG: compose: attach: triggering web activity
09-05 22:49:38.871 I/Gecko   ( 8244): [Parent 8244] ###!!! ASSERTION: old rule tree still referenced: 'Not Reached', file ../../../gecko/layout/style/nsStyleSet.cpp, line 1969
09-05 22:49:38.871 I/Gecko   ( 8244): [Parent 8244] ###!!! ASSERTION: old rule tree still referenced: 'Not Reached', file ../../../gecko/layout/style/nsStyleSet.cpp, line 1969
09-05 22:49:39.181 I/GeckoDump( 8244): XXX FIXME : Got a mozContentEvent: activity-choice
"""

- The log excerpt lines up with when things got boring here:
09-05 22:50:11.391 I/Gecko   ( 8244): [Parent 8244] ###!!! ASSERTION: old rule tree still referenced: 'Not Reached', file ../../../gecko/layout/style/nsStyleSet.cpp, line 1969

- The filter I used is:
grep -i --regexp blob --regexp LOG --regexp ERR --regexp WARN --regexp IndexedDB --regexp activity flame-kk-trunk-blob-sadness-MOZ_IPC_MESSAGE_LOG\=1.txt > flame-kk-trunk-blob-sadness-filtered.txt
Flags: needinfo?(bugmail)
uh, and for posterity, since I wasn't actually quite sure how to get the log to happen, this is what I ran from my linux command line to get my logs after first doing "adb shell stop b2g".  If you are lazy and don't provide all the environment variables, then web activities won't work and you'll have a bad time:

adb shell MOZ_IPC_MESSAGE_LOG=1 GRE_HOME=/system/b2g LD_PRELOAD=/system/b2g/libmozglue.so LD_LIBRARY_PATH=/vendor/lib:/system/lib:/system/b2g /system/b2g/b2g &> /path/to/logfile
I worked around whatever that IndexedDB problem is by using an existing draft.  (Draft saving worked fine if I just did "start b2g", but my MOZ_IPC_MESSAGE_LOG=1 usage would break it.)

- PIDs: root=10699, email=11134, gallery=11169

Not much extra happens in the log.  Mainly this:

WLOG: attachBlobToDraft: fetching 0 to 524286 of 1923917
[time:1409974229582548][11134->10699][PContentChild] Sending Msg_PBlobConstructor([TODO])
[time:1409974229586959][11134->10699][PBlobChild] Sending Msg_PBlobStreamConstructor([TODO])
[time:1409974229594530][10699<-11134][PIndexedDBTransactionParent] Received Msg_AllRequestsFinished([TODO])
[time:1409974229600288][10699<-11134][PContentParent] Received Msg_PBlobConstructor([TODO])
[Parent 10699] WARNING: NS_ENSURE_SUCCESS(rv, nullptr) failed with result 0x8000FFFF: file ../../../gecko/dom/ipc/Blob.cpp, line 1876
[time:1409974229619743][10699->0][PIndexedDBTransactionParent] Sending Msg_Complete([TODO])
[Parent 10699] WARNING: pipe error (192): Connection reset by peer: file ../../../gecko/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 451
[Parent 10699] WARNING: pipe error: Broken pipe: file ../../../gecko/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 728
[Parent 10699] WARNING: pipe error: Broken pipe: file ../../../gecko/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 728
[Parent 10699] WARNING: pipe error: Broken pipe: file ../../../gecko/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 728
There's some stuff in the logcat that doesn't seem to have come out in the log too:

09-05 23:30:28.930 I/Gecko   (11134): WLOG: attachBlobToDraft: retrieving message
09-05 23:30:29.180 E/OomLogger(10699): [Kill]: lowmem_shrink: convert oom_adj to oom_score_adj:
09-05 23:30:29.300 E/OomLogger(10699): [Kill]: lowmem_shrink: convert oom_adj to oom_score_adj:
09-05 23:30:29.390 E/OomLogger(10699): [Kill]: lowmem_shrink: convert oom_adj to oom_score_adj:
09-05 23:30:29.570 I/Gecko   (11134): WLOG: attachBlobToDraft: fetching 0 to 524286 of 1923917
09-05 23:30:29.590 I/Gecko   (10699): [Parent 10699] WARNING: NS_ENSURE_SUCCESS(rv, nullptr) failed with result 0x8000FFFF: file ../../../gecko/dom/ipc/Blob.cpp, line 1876
09-05 23:30:29.590 I/Gecko   (10699): 
09-05 23:30:29.590 I/Gecko   (10699): ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1C0003,name=PContent::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
09-05 23:30:29.590 I/Gecko   (10699): 
09-05 23:30:29.590 I/Gecko   (10699): 
09-05 23:30:29.590 I/Gecko   (10699): ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x60003,name=???) Route error: message sent to unknown actor ID
09-05 23:30:29.590 I/Gecko   (10699): 
09-05 23:30:29.620 I/Gecko   (10699): [Parent 10699] WARNING: pipe error (192): Connection reset by peer: file ../../../gecko/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 451
09-05 23:30:29.620 I/Gecko   (10699): [Parent 10699] WARNING: pipe error: Broken pipe: file ../../../gecko/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 728
09-05 23:30:29.620 I/Gecko   (10699): [Parent 10699] WARNING: pipe error: Broken pipe: file ../../../gecko/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 728
09-05 23:30:29.620 I/Gecko   (10699): [Parent 10699] WARNING: pipe error: Broken pipe: file ../../../gecko/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 728
09-05 23:30:29.630 I/Gecko   (10699): 
09-05 23:30:29.630 I/Gecko   (10699): ###!!! [Parent][MessageChannel] Error: Channel error: cannot send/recv
09-05 23:30:29.630 I/Gecko   (10699): 
09-05 23:30:29.630 I/Gecko   (10699): 
09-05 23:30:29.630 I/Gecko   (10699): ###!!! [Parent][MessageChannel] Error: Channel error: cannot send/recv
09-05 23:30:29.630 I/Gecko   (10699): 
09-05 23:30:29.630 I/Gecko   (10699): 
09-05 23:30:29.630 I/Gecko   (10699): ###!!! [Parent][MessageChannel] Error: Channel error: cannot send/recv
09-05 23:30:29.630 I/Gecko   (10699): 
09-05 23:30:29.630 I/Gecko   (10699): 
09-05 23:30:29.630 I/Gecko   (10699): ###!!! [Parent][MessageChannel] Error: Channel error: cannot send/recv
09-05 23:30:29.630 I/Gecko   (10699): 
09-05 23:30:29.630 I/Gecko   (10699): 
09-05 23:30:29.630 I/Gecko   (10699): ###!!! [Parent][MessageChannel] Error: Channel error: cannot send/recv
09-05 23:30:29.630 I/Gecko   (10699): 
09-05 23:30:29.650 I/Gecko   (10699): 
09-05 23:30:29.650 I/Gecko   (10699): ###!!! [Parent][MessageChannel] Error: Channel error: cannot send/recv
09-05 23:30:29.650 I/Gecko   (10699):
:bent, can you or someone else review the MOZ_IPC_MESSAGE_LOG=1 logs?  I do understand you are all super-busy and last week was a focused work week.

[Blocking Requested - why for this release]:
Memory-backed Blobs causing process death is a potentially major problem for anything using web activities.  This affects v2.0-v2.2 on at least Flame and Nexus 4; per :djf it sounds like devices we actually ship don't need to perform EXIF rotation and so the gallery/email repro case may not be a huge problem for our shipped apps on known devices.  However, I worry about the ecosystem ramifications, so I think since we're still well inside 2.1 stabilization and trunk drift should hopefully be contained, it seems good try and get this addressed for v2.1 and evaluate v2.0 uplift.  (Hopefully the recent IndexedDB/PBackground stuff is orthogonal.)
blocking-b2g: --- → 2.1?
Flags: needinfo?(bent.mozilla)
We landed a workaround for SMS in bug 1059233 (not yet in v2.1 or v2.0, so it's easy to check on these branches), I think it's related to this issue. I'd like to remove the workaround if possible.

Andrew, do you see the issue in v1.4? I also asked for a regression window in bug 1059233, hopefully we'll know what caused this.
"This affects v2.0-v2.2 on at least Flame and Nexus 4; per :djf it sounds like devices we actually ship don't need to perform EXIF rotation and so the gallery/email repro case may not be a huge problem for our shipped apps on known devices. "

Andrew, do we truly need to hold 2.1 on this? If it's not this, it's another activity-chain with another memory inflation scenario, right? I'm also concerned about the risk of changes like this in stabilization period.
blocking-b2g: 2.1? → backlog
Flags: needinfo?(bugmail)
Sorry, I'm not able to look at this any time soon...
[Blocking Requested - why for this release]:  Re-requesting because there was some confusion in comment 10 about the manifestation of this problem and the risk profile.  See below for more.

(In reply to Dietrich Ayala (:dietrich) from comment #10)
> Andrew, do we truly need to hold 2.1 on this? If it's not this, it's another
> activity-chain with another memory inflation scenario, right? I'm also
> concerned about the risk of changes like this in stabilization period.

This is not a memory-inflation scenario, it's death-by-web-activity-by-Blob.  The problem case here is:
- App "gonna-die" triggers an activity that opens app "returns-a-memory-backed-Blob" which was not already active.
- "returns-a-memory-backed-Blob" does new Blob([stuff], stuff) and returns it via its activity.
- RACE RACE RACE RACE RACE
  - "returns-a-memory-backed-Blob"'s process is terminated because it's no longer needed
  - "gonna-die" gets a reference to the Blob.  Attempts to use the Blob may either fail or result in "gonna-die" dying.  If "gonna-die" is very fast it may be able to use the Blob before "returns-a-memory-backed-Blob".

This is fundamentally a platform issue.  One that we can mitigate in our own apps (as we did for Tarako and 1.3 and may end up doing for 2.0/2.1/2.2), but if we want to be all "hey, other developers, it's a platform you can use", it's arguably friendly for us to not leave random app-breaking and app-killing edge-cases all over the place.

I'm arguing for trying to fix this earlier in 2.1 stabilization rather than later because the longer we wait to address this the greater the chance that an uplift to v2.1 becomes infeasible.

There are mitigations we can put in place:
- (limited) just putting some Gallery-mitigations back in place that already existed (see https://bugzilla.mozilla.org/show_bug.cgi?id=1059233#c16) to protect apps from death-by-Gallery-app, but the platform issues still remain.
- (limited) having the system app keep triggered processes around "long enough" to ensure the Blob is fine.  Unfortunately "long enough" is not something you can really determine.
blocking-b2g: backlog → 2.1?
Flags: needinfo?(bugmail)
(In reply to Julien Wajsberg [:julienw] from comment #9)
> Andrew, do you see the issue in v1.4? I also asked for a regression window
> in bug 1059233, hopefully we'll know what caused this.

Hadn't tried to reproduce, just updated my hamachi on 1.4 to usefully reproduce, but its wi-fi is broken.  So perhaps it makes sense to trust the bug 1059233 regression check, but keeping in mind that because of the involvement of races and all the other factors involved and the lack of a canonical STR on bug 1059233 with the regression check (it doesn't address things like making sure the gallery app was closed when triggering the activity), I'm not going to put too much stock in it.
I understand what you mean Andrew. I'll do more test today. I think we need a fix at least in v2.0 about this, and maybe the fix should be the workaround in the Gallery app. 

We also need to fix this issue for good, and find why bug 982779 regressed or maybe did not fix the issue correctly. I have a NI in bug 1059233 to look further today.
For the issue we're seeing in bug 1059233, I'm confident it works in v1.4 (tried latest peak build from geeksphone) and fails in v2.0 (latest Flame build from OTA).

Now I'll try to reproduce the regression window found by QA.
This is also affecting SMS in bug 1059233. As alternatives:

* we could uplift the workaround from bug 1059233 already landed in master. I'd rather backout it and this wouldn't fix the email bug anyway.
* apply the ugly workaround in Gallery from Bug 975599 (was removed in bug 1006919) again, but I think :djf doesn't want this.

The regression window in bug 1059233 wasn't helpful and only pointed to bug 1006919, which is strange because it was marked as "verified" back at the time. Anyway it doesn't point to any gecko regression. When I try to check the regression window myself, I get a crash as soon as I get the "crop" window, which makes it impossible to verify the bug (I don't know how QA managed to get a regression window..).

I'll now try the STR from bug 1062262.
blocking-b2g: 2.1? → 2.0?
I reproduce the issue in bug 1062262 in v1.4; I also reproduce the issue in SMS if I use a small enough image. This is quite funny (sort of) because I definitely don't reproduce bug 1059233 on v1.4.
STR for SMS:
1. have attachment 8492258 [details]
2. open SMS and start a new message
3. press "pick" button and choose gallery
4. choose image from step 1, don't crop and accept
5. tap the image in SMS' composer, then tap view

=> the gallery crashes instantly
Bhavana,

Please see blocking requests above from Andrew and Julien
Flags: needinfo?(bbajaj)
It is several months too late to block 2.0 on this.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #21)
> It is several months too late to block 2.0 on this.

Can you clarify?  Is this a pedantic statement or is a fix intractable for v2.0 for some reason?  For the email app at least we're still getting/have v2.0 blockers.  I think the idea would be that it would be good to prioritize an investigation of this problem with targeted uplift of any fix (if viable) to v2.0.  We will likely have to workaround in gaia on v2.0 and arguably partners are going to be throwing around "acceptance blocker" and all that stuff if it's not fixed one way or the other.
I have no idea what the cause of the bug is, so I can't say whether it's possible to put together a fix for 2.0 or not.  Based on our past experience here I'd expect any fix to be relatively complicated.

But more importantly, it's unlikely that we will have resources (meaning me or bent) to investigate and fix this on any short time scale.
(In reply to Julien Wajsberg [:julienw] from comment #19)
> => the gallery crashes instantly

Julien, if this is reliably reproducible for you, is there any chance you can attach gdb to the gallery app before you do this and get a stack trace of the crash?
Flags: needinfo?(bent.mozilla)
Yep, I'll try this.
Flags: needinfo?(felash)
Triage group reviewed - blocking+ for 2.0. There are enough comments about the user impact in this bug to have great concern. If porting a fix to 2.0 becomes impossible, renom.
Ben, I have some difficulties having a debug build. I built for Hamachi but the /system partition is too small for a debug build; and I can't have a working backup-flame for Flame. So I'm quite stuck for today.

If you want to try on your side, you need to backout bug 1059233 to be able to reproduce the STR in comment 19.

Keeping the NI.
David, how do you feel having your ugly workaround on v2.0 again?
Flags: needinfo?(dflanagan)
blocking-b2g: 2.0? → 2.0+
(In reply to Julien Wajsberg [:julienw] from comment #27)
> Ben, I have some difficulties having a debug build.

Thanks Julien. You shouldn't need a DEBUG build just to get a stack trace of the crash though. As long as you build it yourself I think gdb should be able to produce something usable.

Of course, a DEBUG build would be a lot better, since it would have assertions/warnings too...
Got a debug build, but couldn't get a gdb stacktrace... There is no more preallocated process so I can't attach to it, and I couldn't get a stacktrace when email failed using the STR from bug 1062262.

Will try to get some help tomorrow but I'm sure it would be easier and faster if someone else try to do it.
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #29)
> (In reply to Julien Wajsberg [:julienw] from comment #27)
> > Ben, I have some difficulties having a debug build.
> 
> Thanks Julien. You shouldn't need a DEBUG build just to get a stack trace of
> the crash though. As long as you build it yourself I think gdb should be
> able to produce something usable.

So actually not, I couldn't get any stack trace, either from a DEBUG or non-DEBUG build. It's probably me though :/
Flags: needinfo?(felash)
Flags: needinfo?(felash)
(In reply to Julien Wajsberg [:julienw] from comment #31)
> (In reply to ben turner [:bent] (use the needinfo? flag!) from comment #29)
> > (In reply to Julien Wajsberg [:julienw] from comment #27)
> > > Ben, I have some difficulties having a debug build.
> > 
> > Thanks Julien. You shouldn't need a DEBUG build just to get a stack trace of
> > the crash though. As long as you build it yourself I think gdb should be
> > able to produce something usable.
> 
> So actually not, I couldn't get any stack trace, either from a DEBUG or
> non-DEBUG build. It's probably me though :/

Alex, can you help Julien here?
Flags: needinfo?(lissyx+mozillians)
(In reply to Julien Wajsberg [:julienw] from comment #28)
> David, how do you feel having your ugly workaround on v2.0 again?

I don't like the idea, but if that is what we need to do, at least we know we have some tested code we can use.

Note, however, that the workaround from 975599 won't work as-is. That workaround was only for the case where the user was picking an image and cropping it (turning the original file-backed blob into a memory-backed blob.)

The STR in comment 19 explicitly say not to crop the image.  In this case, however, the test image that Julien has attached uses EXIF orientation.  Since EXIF orientation does not work on the web, the gallery rotates the image before returning it from the pick activity. So in this case (with this specific test image) the image is being converted from a file-backed blob to a memory-backed blob without any cropping.

The share activity does something similar: it will return a memory-backed blob if the image being shared relies on EXIF orientation.

So we'll need to modify the patch from bug 975599 so that it saves any memory-backed image, either picked or shared, to the card.

Let me know if this becomes necessary. It is probably most efficient for me to do that work.
Flags: needinfo?(dflanagan)
So far, we were unable to reproduce on my Flame.
Flags: needinfo?(lissyx+mozillians)
Attached file stacktrace
09-26 15:53:36.157   297   297 I Gecko   : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1C0003,name=PContent::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
So the signal received is a SIGTERM, that comes from the creation of the mMonitor() at http://dxr.mozilla.org/mozilla-central/source/dom/ipc/Blob.cpp#167
So we managed to reproduce on Alexandre's Flame, but only once we reduced the available memory. (on my own Flame I had 1G and the bug reproduced... let's just say it can depend on the available memory).

Stack trace has been provided so removing NI for now. But please ask if you need anything from me.
Flags: needinfo?(felash)
This looks like we're just out of memory...
I doubt it, the workaround patch in bug 1059233 fixes the issue, but it's cloning the blob in memory... That would surely fail if we're out of memory.
Ben do you see any reason for getting SIGTERM in a memset() call?
Flags: needinfo?(bent.mozilla)
So, SIGTERM comes from http://hg.mozilla.org/mozilla-central/annotate/5e704397529b/ipc/chromium/src/base/process_util_posix.cc#l70

Let's continue to find why.
Assignee: nobody → lissyx+mozillians
Flags: needinfo?(bent.mozilla)
Attached file ipc error stacktrace
stacktrace when KillProcess is called
It means that:
> 09-29 15:46:01.616  3928  3928 I Gecko   : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1C0003,name=PContent::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value

is a valid error message, I think.

> (gdb) f 5
> #5  0xb5192ce6 in mozilla::ipc::MessageChannel::OnMaybeDequeueOne (this=<optimized out>) at ../../../gecko/ipc/glue/MessageChannel.cpp:1098
> 1098	        DispatchMessage(recvd);
> (gdb) l
> 1093	        AutoEnterRPCTransaction transaction(this, &recvd);
> 1094	
> 1095	        MonitorAutoUnlock unlock(*mMonitor);
> 1096	
> 1097	        CxxStackFrame frame(*this, IN_MESSAGE, &recvd);
> 1098	        DispatchMessage(recvd);
> 1099	    }
> 1100	    return true;
> 1101	}
> 1102	
> (gdb) print recvd 
> $1 = {<Pickle> = {static kPayloadUnit = 1179403647, header_ = 0xac58db00, header_size_ = 32, capacity_ = 128, variable_buffer_offset_ = 0}, _vptr.Message = 0xb664e490, file_descriptor_set_ = {
>     mRawPtr = 0x0}, name_ = 0xb626927a "PContent::Msg_PBlobConstructor"}
(gdb) print params
$2 = {mValue = {VChildBlobConstructorParams = "\000\021ҩ", '\000' <repeats 13 times>, "\004\000\000\000\000\000\000FR+\266\000\000\000\000\003\000\000\000@\201\064\251\003\000\000", 
    VParentBlobConstructorParams = "\000\021ҩ", '\000' <repeats 13 times>, "\004\000\000\000\000\000\000FR+\266\000\000\000\000\003\000\000\000@\201\064\251\003\000\000\000\254\321?\264\001\000\000"}, mType = mozilla::dom::BlobConstructorParams::TParentBlobConstructorParams}
Comment on attachment 8496843 [details]
MsgValueError stacktrace

It appears that AllocPBlobParent returned null, which means you probably want to look at BlobParent::Create.
That also means that the child is sending a MysteryBlob constructor.
No, sorry, the Slice call could be failing instead.
So, the AllocPBlobParent() call, which is the one failing in attachment 8496843 [details] indeed calls BlobParent::Create(), http://hg.mozilla.org/mozilla-central/annotate/5e704397529b/dom/ipc/Blob.cpp#l1117

I see no code handing the type visible comment 46: BlobConstructorParams::TParentBlobConstructorParams
Looks like it's slice-related.
According to attachment 8496852 [details], it would be CreateSlice() that fails because of mActor being NULL, and making the AllocPBlobParent() failing in result.
FYI, I have very easy reproductibility with:
 - the image provided by julien
 - flame set at 319MB
 - revering bug 1059233
> 09-29 17:17:32.056 13662 13662 I Gecko   : BlobParent::RemoteBlob::CreateSlice this=0xac148280 mActor=0xa92e8800
> 09-29 17:17:36.406 13662 13662 I Gecko   : BlobParent::RemoteBlob::CreateSlice this=0xac148280 mActor=0x0
> 09-29 17:17:36.406 13662 13662 I Gecko   : 
> 09-29 17:17:36.406 13662 13662 I Gecko   : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1C0003,name=PContent::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
Ben, maybe you can have another look with all those new infos ?
Flags: needinfo?(bent.mozilla)
Ok, it looks like the child sends the parent a blob and we copy it over. Then the child dies for some reason (attachment 8496896 [details] says it's a channel error, but I bet that's just due to bug 1035454). Then the parent tries to slice the blob and fails because we've already cleaned up the child actors.

There's no short-term fix for this. We've been talking about redoing the way we slice blobs, but that's going to be a scary and invasive patch.
Flags: needinfo?(bent.mozilla)
Thanks! What should we be doing then, if there is no safe fix we can provide given it's 2.0+ ?
Flags: needinfo?(bent.mozilla)
Flags: needinfo?(anygregor)
[Blocking Requested - why for this release]:

(In reply to Dietrich Ayala (:dietrich) from comment #26)
> Triage group reviewed - blocking+ for 2.0. There are enough comments about
> the user impact in this bug to have great concern. If porting a fix to 2.0
> becomes impossible, renom.
blocking-b2g: 2.0+ → 2.0?
Flags: needinfo?(anygregor)
Well, I'd retest now that bug 994190 landed, but if that doesn't help then we could change the activities code to keep the child process alive a little bit longer...
Flags: needinfo?(bent.mozilla)
IMO the best bet is to land the workaround in Gallery. Kind of sad but the safest, and possible for v2.0.

The consequences in SMS are not that bad, I could live with it, but the consequences in E-Mail are really bad.
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #61)
> Well, I'd retest now that bug 994190 landed, but if that doesn't help then
> we could change the activities code to keep the child process alive a little
> bit longer...

Do we expect this to 100% address the problem or just make us more likely to win a race?  Specifically, does:
A) the recipient process (email) need the originating process (gallery) to still be alive when it goes to use the Blob
or just
B) the parent process needs the originating process (gallery) to still be alive to finish the process of completely replicating the Blob in the parent process, but then email can wait as long as it wants to do things with the Blob.

Because if it's "B" that sounds great.
Flags: needinfo?(bent.mozilla)
[Blocking Requested - why for this release]:

Going to 2.1? as it seems that this would not impact the known to be releasing 2.0 devices.

Please let me know if my understanding is incorrect, otherwise lets get this fixed on 2.1 and do an 2.0 uplift approval if available.
blocking-b2g: 2.0? → 2.1?
Looks like I reproduced this on my Flame, after restoring my Nexus S profile and opening a smart collection (Social).
[Blocking Requested - why for this release]:

(In reply to Wayne Chang [:wchang] from comment #64)
> [Blocking Requested - why for this release]:
> 
> Going to 2.1? as it seems that this would not impact the known to be
> releasing 2.0 devices.

Where do you get this, Wayne?
blocking-b2g: 2.1? → 2.0?
Leaving the bug so that someone else can address this.
Assignee: lissyx+mozillians → nobody
If we land the workaround in the gallery, does it fix the problem for email?

There's an unknown amount of risk here (with an unknown fix) and at this point in 2.0's timeline it seems tight.
Flags: needinfo?(bugmail)
(In reply to Andrew Overholt [:overholt] from comment #68)
> If we land the workaround in the gallery, does it fix the problem for email?

Not the existing workaround, no.  See comment 33.
Flags: needinfo?(bugmail)
:djf,

Given the above comment can you comment on how risky would it be to re-write and uplift a patch per https://bugzilla.mozilla.org/show_bug.cgi?id=1063658#c33 to fix this issue in 2.0 ?
Flags: needinfo?(bbajaj)
Keywords: qawanted
(NI implied per comment 70)
Flags: needinfo?(dflanagan)
I think this QA-Wanted tag was added by accident instead of a NI
Keywords: qawanted
(In reply to Julien Wajsberg [:julienw] (PTO until 7th October, ask :steveck or :azasypkin for SMS stuff) from comment #66)
> [Blocking Requested - why for this release]:
> 
> (In reply to Wayne Chang [:wchang] from comment #64)
> > [Blocking Requested - why for this release]:
> > 
> > Going to 2.1? as it seems that this would not impact the known to be
> > releasing 2.0 devices.
> 
> Where do you get this, Wayne?

I thought comment 6 suggested that devices need to be doing EXIF rotation to hit this.

Andrew,
Can you provide an easy to follow STR so we can ask QA to test this on the known launching 2.0 devices to assess this? Thanks!

Mike,
Once we have the STR can you please test a few devices (check with Rachelle/Howie on which to test)?

Thanks.
Flags: needinfo?(bugmail)
Flags: needinfo?(mlien)
(In reply to Wayne Chang [:wchang] from comment #73)
> Can you provide an easy to follow STR so we can ask QA to test this on the
> known launching 2.0 devices to assess this? Thanks!

Download the first attachment (90 degrees) from bug 1062262 to any Firefox OS device.  Try to attach it to an email via the "gallery" app.  Observe failure to attach with high probability.

As noted above in the bug, this may also reproduce on freshly taken pictures on devices that take pictures with EXIF rotation rather than rotating the image before saving to disk.  I have no innate knowledge of shipping 2.0 devices or their camera capabilties and would have to do extensive research of private documents to figure this out.  I'm currently dealing with several blockers and I believe we have multiple layers of management that keep track of this, so I'll defer to them to perform the analysis.
Flags: needinfo?(bugmail)
NI me, I want to check whether the email bug happens when cropping too.
Flags: needinfo?(felash)
(In reply to bhavana bajaj [:bajaj] from comment #70)
> :djf,
> 
> Given the above comment can you comment on how risky would it be to re-write
> and uplift a patch per
> https://bugzilla.mozilla.org/show_bug.cgi?id=1063658#c33 to fix this issue
> in 2.0 ?

Not very risky. We had it working in 1.3 and 1.4. I think I can generalize it to handle the EXIF rotation as well as the cropping case and also the share activity as well as the pick activity.

Given Ben's pessimism about a platform fix in comment #58, and given that bug 1077106 is probably the same underlying issue, I'm taking this bug and will put the hack back in gallery and will request uplift approval for 2.1 and 2.0.
Assignee: nobody → dflanagan
Flags: needinfo?(dflanagan)
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #58)
> 
> There's no short-term fix for this. We've been talking about redoing the way
> we slice blobs, but that's going to be a scary and invasive patch.

Ben: do you have a bug filed for the long-term fix? I'd like to have a bug number I can cite in my workaround.
Yep, bug 1079546.
Flags: needinfo?(bent.mozilla)
I can reproduce this issue in the following cases:

1) Start email app. Pick a photo from Gallery that uses EXIF rotation. (test image 55 e.g.)
2) Start MMS app. Pick a photo from Gallery that uses EXIF rotation. (test image 55 e.g.)
3) Start MMS app. Pick a photo photo that does not have a small file size (a photo or either of the test images that display test patterns, e.g.) and crop it.
4) Start Gallery share a photo with EXIF rotation (test photo 54, e.g.) with the email app
5) Share test photo 54 with the MMS app
6) Share test photo 54 with the Wallpaper app
This patch fixes the bug in each of the scenarios described above.

I'm asking for Punam's review because she worked on this workaround the first time around. Punam: I've restructured the code a bit, but this should all look very familiar to you.

Andrew: I'm asking for your review because you've got fresh eyes and your app is most impacted by this bug.

When we landed this workaround the first time for v1.3 and 1.4, we never had any tests written for it. I'm comfortable landing it again without tests. (But if there is disagreement about that from the reviewers, I'll see what I can do, though it will depend on whether we've got a good MockDeviceStorage already written somewhere.)
Attachment #8501423 - Flags: review?(pdahiya)
Attachment #8501423 - Flags: review?(bugmail)
I should mention that the test images I refer to in comment #79 are in apps/gallery/test/images/
blocking-b2g: 2.0? → 2.0+
(In reply to David Flanagan [:djf] from comment #79)
> I can reproduce this issue in the following cases:
> 
> 1) Start email app. Pick a photo from Gallery that uses EXIF rotation. (test
> image 55 e.g.)
> 2) Start MMS app. Pick a photo from Gallery that uses EXIF rotation. (test
> image 55 e.g.)
> 3) Start MMS app. Pick a photo photo that does not have a small file size (a
> photo or either of the test images that display test patterns, e.g.) and
> crop it.

Actually, the trick is really to crop it small enough so that the resulting image is < 300K, the bug should happen with any image in that case.


> 4) Start Gallery share a photo with EXIF rotation (test photo 54, e.g.) with
> the email app
> 5) Share test photo 54 with the MMS app
> 6) Share test photo 54 with the Wallpaper app
Comment on attachment 8501423 [details] [review]
link to patch on github

r=asuth.  I also have some minor enhancements that I think would be nice to have, I've put them in 2 separate commits available at https://github.com/asutherland/gaia/compare/mozilla-b2g:master...asutherland:gallery-nuts-to-temp-files?expand=1 that you should feel free to squash into your patch as you like.

They are:
1) Have the pick activity provide the original filename as a separate "name" in the activity so that the filename doesn't end up being the new random filename you generated.  Email already had some old fallback logic in place for this, but in the wrong order for this use-case.  I'm assuming we used to use "name" somewhere or still do somewhere elsewhere.

2) Have the file killer also guard against files that are from the future.  In general I think our devices prefer living in the past rather than the future so this shouldn't be necessary, but it seems dangerous to otherwise potentially let a huge number of files accumulate if someone does run into this problem.  The flip side of this is that if the device has some kind of absurd time-date oscillation that the temp files might end up only living as long as a single pick/share activity.  However, that's fine for the email app.


Over-engineering-wise, I think it'd be preferable to have some kind of numeric/LRU-style bound on killing the temporary files but am fine with what we have right now on the basis that:
- We shipped something like this before
- Simple is good
- It helps protect apps that make risky assumptions about the lifetime of the user's images.  A day seems to strike a nice balance.
- The most semantically correct option of using/abusing reference-counted IndexedDB blobs seems likely to have its own scary risk problems, especially given the recent regressions we've seen in this area.


I tested on a v2.2 flame device with share and pick and manually used "touch -t" to verify that the reaping logic obliterates the future and the past correctly.
Attachment #8501423 - Flags: review?(bugmail) → review+
(In reply to Julien Wajsberg [:julienw] (PTO until 7th October, ask :steveck or :azasypkin for SMS stuff) from comment #82)
> (In reply to David Flanagan [:djf] from comment #79)
> > I can reproduce this issue in the following cases:
> > 
> > 1) Start email app. Pick a photo from Gallery that uses EXIF rotation. (test
> > image 55 e.g.)
> > 2) Start MMS app. Pick a photo from Gallery that uses EXIF rotation. (test
> > image 55 e.g.)
> > 3) Start MMS app. Pick a photo photo that does not have a small file size (a
> > photo or either of the test images that display test patterns, e.g.) and
> > crop it.
> 
> Actually, the trick is really to crop it small enough so that the resulting
> image is < 300K, the bug should happen with any image in that case.
> 

I've got a bunch or text images now that are just text on a solid background so they end up with a small filesize. And when cropped, the cropped region is also small, and I can't reproduce the bug on those. Ben thinks that the underlying bug is all about slicing (which is what the Email app does).  So I'm thinking that maybe what's going on here (and the reason that bug 1077106 is reported only on 2.2) is that I convinced you to switch to using shared/js/image_utils.js which slices the first 32kb of jpeg files. So with jpegs smaller than 32kb the bug is not triggered.  In any case, according to Ben this isn't a lifetime bug anymore but a slice() bug.  I'm not sure how deeply he's looked into it.

In any case, I think I'm going to try modifying apps/sms/test/marionette/apps/activitycaller to see if I can make it reproduce this bug and land that change along with this patch so we'll know if we break it again.
Andrew,

Thanks for the review and the add-on code. I'll take a look and see if I can do anything about the name issue.  Somewhat related: with this fix in place, your app can remove the nocrop:true property from the activity request and allow users to crop the images that they're attaching, if you feel so inclined.

Also: does email have any integration tests that could be modified to exercise this bug and its patch?
Flags: needinfo?(bugmail)
I'm striking out on my idea of modifying the sms app marionette tests since they all fail in the marionette harness when I run them locally. If I can't even run them, I certainly can't modify them. Also, I've now realized that the sms tests uses their own custom app to do the activities, and modifying the tests to actually use gallery would probably be harder and would introduce an unwanted dependency between the two apps.  So maybe it is the python marionette tests where this should go? In either case, possibly too much work right now for a patch that we just need to get landed.
The patch for this bug fixes bug 1077106, so I've closed that bug as a dupe of this one.
That bug is a SMS regression and a smoketest blocker, so I'm carrying those flags over here.
Given that this is now a smoketest issue, I think I should get the workaround landed today.  I'll file a followup bug to try to get an automated test that can ensure we don't regress this for a third time.
(In reply to David Flanagan [:djf] from comment #89)
> Given that this is now a smoketest issue, I think I should get the
> workaround landed today.  I'll file a followup bug to try to get an
> automated test that can ensure we don't regress this for a third time.

Yes, a separate bug is good enough for me.

(clearing my NI; I still don't know why this worked in v1.4, but it's ok, I can live with this).

Thanks a lot for your work David.
Flags: needinfo?(felash)
I just reverted the patch in bug 1059233 and duped the bug to this bug. David, please check with an uptodate master version of SMS!
Comment on attachment 8501423 [details] [review]
link to patch on github

Patch looks good and fixes the crash seen in email and SMS app due to memory backed blobs. 
I see a bug on latest master flame-kk while trying to view attached image in SMS app. Symptoms are similar to second part of bug 1059233, since bug 1059233 is  marked resolved, I have created a new bug 1080020 to track this issue.
Attachment #8501423 - Flags: review?(pdahiya) → review+
I've been pondering whether to take Andrew's additional commits in the patch. Leaning against the original name patch since in the case that we need a name, the blob we are passing is actually different (cropped or rotated) than the original file. So it would arguably be misleading to assert that it is the same.  I'm not sure what to do about the time one. But then I heard that :bent is actively working on a solution in gecko, so now I'm leaning toward just landing what we've got and calling it good enough.
Landed on master: https://github.com/mozilla-b2g/gaia/commit/b46e28a4d9d042452b2f1fe07a191317caccc6fa
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Changing product and component so that I can request uplift of the gaia workaround
Component: DOM → Gaia::Gallery
Product: Core → Firefox OS
Version: Trunk → unspecified
Comment on attachment 8501423 [details] [review]
link to patch on github

[Approval Request Comment]
[Bug caused by] (feature/regressing bug #): I'm not sure why this bug came back. We had a workaround very much like this in 1.3 and 1.4. Then the gecko bug was fixed and we removed it. But now we have problems again in 2.0 and later. This issue is different than and predates the changes in bug 994190, however.


[User impact] if declined: Images with EXIF orientation can't be attached to email or shared as wallpaper. Cropped images or images with EXIF orientation can't be attached to SMS messages.

[Testing completed]: on master

[Risk to taking this patch] (and alternatives if risky): not very risky because this workaround has been in v1.3 for quite a while now.

[String changes made]: none
Attachment #8501423 - Flags: approval-gaia-v2.1?(fabrice)
Attachment #8501423 - Flags: approval-gaia-v2.0?(fabrice)
Filed bug 1080104 for removal of this workaround when the underlying bug 1079546 gets fixed.
Filed bug 1080122 for adding integration tests to make sure this doesn't regress again.
VERIFIED FIX in v2.2 Master Flame KK by STR of bug child duplicate: bug 1077106
Results: User can attach cropped gallery image to messages app without app closing.

Worth noting that bug 1077667 was RESOLVED DUPLICATE to 1077106, however that issue REPROES following it's own STR and identical Environmental variables. 


Enviromental Variables:
----------------------------------------
Device: Flame 2.2 Master
BuildID: 20141009040206
Gaia: 7b92615bdc97e5c675cd385ec68bc5e47e0c5288
Gecko: f0bb13ef0ee4
Gonk: 52c909e821d107d414f851e267dedcd7aae2cebf
Version: 35.0a1 (2.2 Master)
Firmware: V180
User Agent: Mozilla/5.0 (Mobile; rv:35.0) Gecko/35.0 Firefox/35.0
Flags: needinfo?(pbylenga)
Keywords: verifyme
Status: RESOLVED → VERIFIED
Comment on attachment 8501423 [details] [review]
link to patch on github

QA please mkae sure to verify once this lands on 2.0, thanks !
Attachment #8501423 - Flags: approval-gaia-v2.1?(fabrice)
Attachment #8501423 - Flags: approval-gaia-v2.1+
Attachment #8501423 - Flags: approval-gaia-v2.0?(fabrice)
Attachment #8501423 - Flags: approval-gaia-v2.0+
(Updating the bug title to more accurately reflect what the fix on this bug actually does.)
Flags: needinfo?(bugmail)
Summary: Memory-backed Blobs aren't surviving long enough in multi-process environments again (again) → Work around memory-backed Blob lifetime issues by saving them to DeviceStorage-backed temporary files in .gallery/tmp and returning those instead.
Summary: Work around memory-backed Blob lifetime issues by saving them to DeviceStorage-backed temporary files in .gallery/tmp and returning those instead. → Work around memory-backed Blob lifetime issues when passed between processes by Web Activities by saving them to DeviceStorage-backed temporary files in .gallery/tmp and returning those instead.
QA Whiteboard: [QAnalyst-Triage+]
Flags: needinfo?(pbylenga)
Flags: needinfo?(mlien)
Bug 1087464 also bit blobs particularly, so you might want to see if workarounds can be removed now that that is fixed.
Verified fixed in the latest Flame 2.1 and 2.0 builds using the STR from bug 1077106.
MMS attachment is cropped, viewable, and sent properly.

Environmental Variables:
Device: Flame 2.1 (319MB)
BuildID: 20141105001204
Gaia: 154da5e17029a51002d5d9b7df39563d509edde6
Gecko: 3b0c3580a58d
Version: 34.0 (2.1)
Firmware: V188
User Agent: Mozilla/5.0 (Mobile; rv:34.0) Gecko/34.0 Firefox/34.0

Environmental Variables:
Device: Flame 2.0 (319MB)
BuildID: 20141105000201
Gaia: fe2167fa5314c7e71c143a590914cbf3771905a8
Gecko: 093de6b632c5
Version: 32.0 (2.0)
Firmware: V188
User Agent: Mozilla/5.0 (Mobile; rv:32.0) Gecko/32.0 Firefox/32.0
QA Whiteboard: [QAnalyst-Triage+] → [QAnalyst-Triage?]
Flags: needinfo?(pbylenga)
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(pbylenga)
Thanks Martin!
Remove "verifyme"
Keywords: verifyme
This issue has been verified successfully on Woodduck 2.0 with STR from bug 1077106.
Reproducing rate: 0/10
See attachment: Verify_Woodduck_MMS.mp4

build version:
Gaia-Rev        87b23fa81c3b59f2ba24b84f7d686f4160d4e7cb
Gecko-Rev       d049d4ef127844121c9cf14d2e8ca91fd9045fcb
Build-ID        20141125050313
Version         32.0
Attached video Verify_Woodduck_MMS.mp4
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: