Closed Bug 1048615 Opened 10 years ago Closed 10 years ago

[Flame] FX OS crash in nsJARChannel::OnStartRequest(nsIRequest*, nsISupports*) with Sketchbook Squad app

Categories

(Core :: Networking, defect)

32 Branch
ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla34
blocking-b2g 2.1+
Tracking Status
b2g-v2.0 --- unaffected
b2g-v2.1 --- verified

People

(Reporter: marcia, Assigned: swu)

References

()

Details

(Keywords: crash, regression, reproducible, Whiteboard: [b2g-crash])

Crash Data

Attachments

(4 files, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-57efad21-76df-46a5-a619-d9b8a2140804.
=============================================================

Flame, running:

Gaia   5fd14b8bc428f87f9b5cf9cc49f9a4f362a970fb
SourceStamp e6614d8d85f9
BuildID 20140804040204
Version 34.0a1
Base image: 123

STR:
1. Download Sketchbook Squad app from the Marketplace
2. Launch the app
3. Immediate crash.

Same thing happens with another app from the same developer, Chrono Cash. The crash doesn't happen using the latest 2.0 build on Flame - the app launches, but I don't seem to be able to actually play the game.
[Blocking Requested - why for this release]:

Reproducible crash regression on target apps on marketplace.
blocking-b2g: --- → 2.1?
QA Contact: pcheng
mozilla-inbound regression window:

Last Working Environmental Variables:
Device: Flame
BuildID: 20140717193829
Gaia: 7fbcfbacf286c2a8e41a3a96b6d82f1541880617
Gecko: daa82439f77f
Version: 33.0a1 (2.1 Master)
Firmware V123
User Agent Mozilla/5.0 (Mobile; rv:33.0) Gecko/33.0 Firefox/33.0

First Broken Environmental Variables:
Device: Flame
BuildID: 20140717194830
Gaia: 7fbcfbacf286c2a8e41a3a96b6d82f1541880617
Gecko: a5446bb5fa7d
Version: 33.0a1 (2.1 Master)
Firmware V123
User Agent Mozilla/5.0 (Mobile; rv:33.0) Gecko/33.0 Firefox/33.0

Gaia is the same.

Gecko pushlog:
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=daa82439f77f&tochange=a5446bb5fa7d

Caused by bug 988816 ?
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(jmitchell)
Caused by bug 988816 ?
Blocks: 988816
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(jmitchell) → needinfo?(swu)
It is reproducible on my Flame device with latest mozilla-central, looking at it.
Assignee: nobody → swu
The crash happens if there are multiple remote open file requests for same JAR file triggered simultaneously when mEnsureChildFd enabled.

In bug 988816, when a different AsyncOpen() request triggered during a progressing AsyncOpen() for same JAR file, we let it do a new IPC, and expect two of them get file descriptor separately.  However, they belong to same TabChild, and use same JAR path in TabChild::GetCachedFileDescriptor(), so the info->mCallback in mCachedFileDescriptorInfos can store only one of them.  That will cause state error and may result in a OnStartRequest() after OnStopRequest(), which is the crash scenario of this bug.

The patch lets the later simultaneous requests wait until first request for same file completes, and obtain file descriptor from JAR cache at that time. This change also reduces unnecessory IPC.
Attachment #8472891 - Flags: review?(jduell.mcbugs)
Flags: needinfo?(swu)
Fabrice, could you help to review on the test case?
Attachment #8473467 - Flags: review?(fabrice)
Attachment #8473467 - Flags: review?(fabrice) → review+
Comment on attachment 8472891 [details] [diff] [review]
Patch: Fix crash when simultaneously opening remote JAR file with mEnsureChildFd enabled.

Review of attachment 8472891 [details] [diff] [review]:
-----------------------------------------------------------------

This code looks fine to me, except... I can't figure out if it will correctly handle the case where multiple JARChannels open the same remoteopenfile:// uri, AND the 1st one does NOT specify EnsureChildFd() but later ones do.  It looks to me like in that case the 1st request will complete but not store the fd in the JarCache, and then the 2nd, 3rd, etc JarChannels will neither have the fd from IPDL, nor get it from the JarCache.  But perhaps I'm missing something?  In any case I'd feel better if we had a test for that use case (one can be added in an xpcshell test fairly easily I suspect).

::: modules/libjar/nsJARChannel.cpp
@@ +1061,5 @@
> +                if (!jarCache) {
> +                    rv = NS_ERROR_FAILURE;
> +                }
> +                PRFileDesc *jar_fd = nullptr;
> +                jarCache->GetFd(mJarFile, &jar_fd);

shouldn't you test for success here? What if the jarCache doesn't have the fd, as I suspect can happen?
^^^ see previous comment
Flags: needinfo?(swu)
Note: if I'm wrong and this code does handle the case I mention in comment 7, feel free to mark +r and land.  If it needs a fix flag me for review again.
(In reply to Jason Duell (:jduell) from comment #7)
> Comment on attachment 8472891 [details] [diff] [review]
> Patch: Fix crash when simultaneously opening remote JAR file with
> mEnsureChildFd enabled.
> 
> Review of attachment 8472891 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This code looks fine to me, except... I can't figure out if it will
> correctly handle the case where multiple JARChannels open the same
> remoteopenfile:// uri, AND the 1st one does NOT specify EnsureChildFd() but
> later ones do.  It looks to me like in that case the 1st request will
> complete but not store the fd in the JarCache, and then the 2nd, 3rd, etc
> JarChannels will neither have the fd from IPDL, nor get it from the
> JarCache.  But perhaps I'm missing something?  In any case I'd feel better
> if we had a test for that use case (one can be added in an xpcshell test
> fairly easily I suspect).

That's for catching this, that's a good point.  In such case, the latter JARChannels will fail to get fd.

Currently I have two ideas in mind for this case:
1) For JARChannel request with EnsureChildFd(), pass different path name as index to TabChild::GetCachedFileDescriptor().  For example, if the original path name is "/path/application.zip", we use "/path/application.zip_1" as index for EnsureChildFd() case.  This breaks the meaning of path a little, but should be reasonable if the purpose of path argument is just for indexing.  

2) Once JARChannel decided to try remote opening, set jarCache->SetMustCacheFd() with true before checking the RemoteOpenFileInProgress().  This should make the fd be stored in JarCache for the 1st request, but it may need more handling to avoid possible racing.

            if (mEnsureChildFd && jarCache) {
                jarCache->SetMustCacheFd(remoteFile, true);
            }    

            if (gJarHandler->RemoteOpenFileInProgress(remoteFile, this)) {
                // JarHandler will trigger OnRemoteFileOpen() after the first
                // request for this file completes and we'll get a JAR cache
                // hit.
                return NS_OK;
            }    

Do you have any suggestions?

> 
> ::: modules/libjar/nsJARChannel.cpp
> @@ +1061,5 @@
> > +                if (!jarCache) {
> > +                    rv = NS_ERROR_FAILURE;
> > +                }
> > +                PRFileDesc *jar_fd = nullptr;
> > +                jarCache->GetFd(mJarFile, &jar_fd);
> 
> shouldn't you test for success here? What if the jarCache doesn't have the
> fd, as I suspect can happen?

If GetFd() fails to get fd, rv returned from the next line will also fail.
  rv = SetRemoteNSPRFileDesc(jar_fd);

Then it resulted a NotifyError(rv) in same function.
Flags: needinfo?(swu) → needinfo?(jduell.mcbugs)
(In reply to Shian-Yow Wu [:swu] from comment #10)
> That's for catching this, that's a good point.  In such case, the latter

s/That's/Thanks/
Changes in this patch:
1. Added xpcshell test for comment 7 case.
2. Fixed comment 7 case by approach 2 of comment 10.
Attachment #8472891 - Attachment is obsolete: true
Attachment #8472891 - Flags: review?(jduell.mcbugs)
Attachment #8475187 - Flags: review?(jduell.mcbugs)
Comment on attachment 8475187 [details] [diff] [review]
Patch(v2): Fix crash when simultaneously opening remote JAR file with mEnsureChildFd enabled.

Review of attachment 8475187 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/libjar/test/unit/test_jarchannel.js
@@ +228,5 @@
> +        chan1.asyncOpen(new Listener(function(l) {
> +        }), null);
> +
> +        // Open channel 2 with ensureChildFd()
> +        var chan2 = ios.newChannel(jarBase + "/inner2.zip", null, null)

why ask for inner1, then inner2?  Aren't you best off asking for the exact same URI?  (IIUC it shouldn't make a difference because these are inner zips, and we'll wind up remoteOpenFile-ing the same test_bug637286.zip, right?)
Attachment #8475187 - Flags: review?(jduell.mcbugs) → review+
> 2) Once JARChannel decided to try remote opening, set jarCache->SetMustCacheFd() with true 
> before checking the RemoteOpenFileInProgress().  This should make the fd be stored in 
> JarCache for the 1st request, but it may need more handling to avoid possible racing.

I think that there's hopefully no race: if a 2nd JarChannel with the same URI comes along and there's already one pending, we'll call the JarCache>SetMustCacheFd synchronously, so it'll be set whenever the 1st request finally finishes and stores the entry in the Jar cache.   That was my understanding, at least, but I didn't look at it too rigorously.
Flags: needinfo?(jduell.mcbugs)
QA Whiteboard: [QAnalyst-Triage+] → [QAnalyst-Triage+][lead-review+]
Comment on attachment 8475187 [details] [diff] [review]
Patch(v2): Fix crash when simultaneously opening remote JAR file with mEnsureChildFd enabled.

Removing r+ pending discussion in bug 988816 comment 47 - 48.  We need to figure out if it's safe to share file offsets (which is what we'll do if we cache the fd like this).  It's definitely not safe in the general case--i.e. the semantics are very different than regular nsIFile.Open().  But perhaps we can get away with it for the use cases.

If that doesn't pan out we'll need to do a remote IPDL open for each fd requested.
Attachment #8475187 - Flags: review+
One other possible idea here:  since we know the only file we're remoteOpen-ing is application.zip, for performance we could implement some kind of cache of separately open()'d fd's.  I.e we could specify how many fd's we want to cache on the child, and whenever one is handed out, we could IPDL back to the parent to get a new one, without actually suffering the latency of waiting for the IPDL and the parent-side I/O.

But that's just an optimization idea--we might be fine with something that just does IPDL for each fd, but fixes the issue where TabChild only supports a single pending request at a time.
(In reply to Jason Duell (:jduell) from comment #15)
> Comment on attachment 8475187 [details] [diff] [review]
> Patch(v2): Fix crash when simultaneously opening remote JAR file with
> mEnsureChildFd enabled.
> 
> Removing r+ pending discussion in bug 988816 comment 47 - 48.  We need to
> figure out if it's safe to share file offsets (which is what we'll do if we
> cache the fd like this).  It's definitely not safe in the general case--i.e.
> the semantics are very different than regular nsIFile.Open().  But perhaps
> we can get away with it for the use cases.

So far, the fd from JAR cache is only for the purpose of mmap, which is safe.  If there is no requirement to use the cached fd on other purpose than mmap in the future, it seems worth to use cached fd on limited use case to win performance over multiple IPDL open() requests.

(In reply to Jason Duell (:jduell) from comment #16)
> One other possible idea here:  since we know the only file we're
> remoteOpen-ing is application.zip, for performance we could implement some
> kind of cache of separately open()'d fd's.  I.e we could specify how many
> fd's we want to cache on the child, and whenever one is handed out, we could
> IPDL back to the parent to get a new one, without actually suffering the
> latency of waiting for the IPDL and the parent-side I/O.

We don't know the potential number of burst remote open requests in advance, if cache buffer underrun happens, the remaining requests will be pended.  Therefore, if we decided to support separate fd openings, enhancing TabChild to support multiple pending requests should be the first thing to do.

> 
> But that's just an optimization idea--we might be fine with something that
> just does IPDL for each fd, but fixes the issue where TabChild only supports
> a single pending request at a time.
(In reply to Jason Duell (:jduell) from comment #16)
> One other possible idea here:  since we know the only file we're
> remoteOpen-ing is application.zip

Bug 1034143 is basically the same problem, but for temporary files used to implement jar:http(s):// URLs.  We seem to consider our IPC framework to be too inefficient to use it in the critical path here; would that also apply to those files?
Comment on attachment 8475187 [details] [diff] [review]
Patch(v2): Fix crash when simultaneously opening remote JAR file with mEnsureChildFd enabled.

Review of attachment 8475187 [details] [diff] [review]:
-----------------------------------------------------------------

Yes, I need to think about how we'll handle bug 103143 for that reason.

For now, since this code apparently works despite the dup(), I'm going to +r and we'll document the limitation in bug 1055966
Attachment #8475187 - Flags: review+
Carrying r+ from jduell and:
1) Addressed comment 13
2) Fixed try server failure on Windows platform
Attachment #8475187 - Attachment is obsolete: true
Attachment #8476590 - Flags: review+
Setting this as fixed on 2.1 given it landed before the merge.
blocking-b2g: 2.1? → 2.1+
Attached video Verify_video.3gp
This bug has been verified to fail on Flame 2.1
STR:
1. Download Sketchbook Squad app from the Marketplace
2. Launch the app
**The app launches, but I don't seem to be able to actually play the game.

See attachment: Verify_video.3gp and logcat_flame_0137.txt
Reproducing rate: 5/5
Flame2.1 build:
Gaia-Rev        38e17b0219cbc50a4ad6f51101898f89e513a552
Gecko-Rev       https://hg.mozilla.org/releases/mozilla-b2g34_v2_1/rev/8b92c4b8f59a
Build-ID        20141205001201
Version         34.0
Flags: needinfo?(mlien)
Attached file logcat_flame_0137.txt
Refer to comment 0 and comment 24, crash problem doesn't happen anymore
Mark as VERIFIED FIXED
Status: RESOLVED → VERIFIED
Flags: needinfo?(mlien)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: