Closed Bug 1049806 Opened 7 years ago Closed 7 years ago

b2g process leaks FD

Categories

(Firefox OS Graveyard :: Stability, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:2.0+, firefox32 wontfix, firefox33 wontfix, firefox34 fixed, b2g-v1.3T ?, b2g-v1.4 ?, b2g-v2.0 fixed, b2g-v2.1 fixed)

RESOLVED FIXED
2.1 S2 (15aug)
blocking-b2g 2.0+
Tracking Status
firefox32 --- wontfix
firefox33 --- wontfix
firefox34 --- fixed
b2g-v1.3T --- ?
b2g-v1.4 --- ?
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: tkundu, Assigned: jduell.mcbugs)

References

Details

(Keywords: memory-footprint, perf, Whiteboard: [caf priority: p1][MemShrink])

Attachments

(2 files)

There is not test steps  for this. But please note that we are running stability test on device for more than 24 hours. 

Observation:
1) b2g is using 430 FDs and it has opened 202 FD with /data/local/webapps/marketplace.firefox.com/application.zip . This points to FD leak in b2g process
2) about-memory-45 (collected at device timestamp [1]) shows that b2g is allocating 
     23.14 MB (100.0%) -- explicit
     ├──16.58 MB (71.63%) -- js-non-window
     │  ├──15.77 MB (68.13%) -- zones/zone(0xNNN)
     │  │  ├───7.25 MB (31.34%) ── unused-gc-things [3]
     │  │  ├───6.79 MB (29.35%) -- compartment([System Principal], inProcessTabChildGlobal?ownedBy=chrome://b2g/content/shell.html)



[1] b2g memory growth device timestamp 2014-08-06 11:04:14 . You can try to b2g memory usage using this timestamp in all attached log files.
[2] FULL LOG LOCATION: https://drive.google.com/file/d/0B1cSMS8_GuAERVJlVE1OM1dvc28/edit?usp=sharing . It has DMD logs, gc/cc logs, about-memory* reports, lsof logs etc.


I also attached a diff between about-memory-0 and about-memory-45 as attachment to this bug.
[Blocking Requested - why for this release]:
blocking-b2g: --- → 2.0?
Whiteboard: [CR 700743]
Bhavana -- this is the new bug causing framework reboot and affecting MTBF numbers. Please do the needful.
Flags: needinfo?(bbajaj)
Jason, it seems we're leaking remote-file FD. Any idea about that?
Flags: needinfo?(jduell.mcbugs)
Does this branch contain the fix from bug 918595, which landed on mozilla 32?
Flags: needinfo?(jduell.mcbugs)
I'm trying to see if I can duplicate the leak, but meanwhile, a couple questions:

Tapas: can you say anything about what the stability test is doing?  I'm wondering if it's a realistic use case or not--that might affect whether this needs to be a blocker.

Also, can you say which branch this bug is on?  "version" is marked unspecified in this bug.

So the leak is in the child process, not the parent.  That means we're somehow remotely opening the marketplace app's fd 200 times.  The original remoteOpenFile code is designed to only do the remote open once, and then all further app:// uris wind up hitting the JAR cache, so only one fd should be ever opened. That means that either 1) my original code was broken and we've never been hitting the jar cache; 2) the jar cache is somehow getting full and our entry is getting evicted (strange since most apps can only open a single file, i.e. their own application.zip file, so the jarcache population should have only one entry), or 3) some recent code change has broken something, most likely bug 988816.

The patch in bug 918595 only closes fds when the TabChild is getting destructed.  That is probably too late, i.e, won't fix this bug even if that patch is in the tree.

I think the most likely scenario here is that 1) we've always leaked an fd for each RemoteFileOpen(), but until bug 988816 landed we only did one of those per process.  So if this is a blocker I suggest we consider backing out 988816 (which would also require backing out bug 945152 and bug 957497, so I don't know how feasible it is), since those allow lots of RemoteFileOpen's to occur, and seeing if that fixes things.  

Meanwhile I'm going to look into whether we actually do leak an fd per RemoteFileOpen, and if so if we can easily fix it, which would be the best possible solution.
Flags: needinfo?(tkundu)
(In reply to Jason Duell (:jduell) from comment #5)
> I'm trying to see if I can duplicate the leak, but meanwhile, a couple
> questions:
> 
> Tapas: can you say anything about what the stability test is doing?  I'm
> wondering if it's a realistic use case or not--that might affect whether
> this needs to be a blocker.

Its a realistic usecase and it will happen users use their phone for a long time without rebooting. 
STR will be very difficult and you won't be able to reproduce on your device. I am afraid if you try to reproduce it on your device then you may waste your time without any result.

> 
> Also, can you say which branch this bug is on?  "version" is marked
> unspecified in this bug.
> 
This is reproduced on FFOS 2.0 .
here is the reference code for you:

https://www.codeaurora.org/cgit/quic/lf/b2g/mozilla/gaia/log/?h=mozilla/v2.0
https://www.codeaurora.org/cgit/quic/lf/b2g/mozilla/gecko/log/?h=mozilla/v2.0


> So the leak is in the child process, not the parent.  
>..................................
> Meanwhile I'm going to look into whether we actually do leak an fd per
> RemoteFileOpen, and if so if we can easily fix it, which would be the best
> possible solution.

Nice find. Could you please give us a logging patch which can confirm this theory ? 
We can test with your patch and get back to you asap.
Flags: needinfo?(tkundu)
Flags: needinfo?(jduell.mcbugs)
Flags: needinfo?(bbajaj)
Jason,

Patch in https://bugzilla.mozilla.org/show_bug.cgi?id=918595 is in the FirefoxOs gecko branch for 2.0.

Here are the gecko (https://hg.mozilla.org/releases/mozilla-b2g32_v2_0) and gaia (https://github.com/mozilla-b2g/gaia/tree/v2.0) branches on Mozilla side which ship 2.0 nightlies. The bug reported here is by Code aurora and the branches are in comment #6 that Tapas mentioned..
Whiteboard: [CR 700743] → [caf priority: p1][CR 700743]
blocking-b2g: 2.0? → 2.0+
Keywords: footprint, perf
Whiteboard: [caf priority: p1][CR 700743] → [caf priority: p1][CR 700743][MemShrink]
Assignee: nobody → jduell.mcbugs
As khuey has pointed out, bug 988816 isn't on Mozilla32, so my theory in comment 5 is bunk.
Flags: needinfo?(jduell.mcbugs)
Note: dougt tells me that marketplace is the only 3rd party app we ship (and this the only one that uses RemoteFileOpen, since all the apps in /system wind up opening their app.zip via regular nsIFile open() calls).
We might just be accidentally leaking fds in HandleFileDescriptorAndNotifyListener().  I think we need to make that assertion (which is a noop in opt builds) handle the case where mNSPRFileDesc is non-null.  We also should add a ton of debugging statements and have Tapas rerun the test.
Attached patch v1Splinter Review
Attachment #8469653 - Flags: review?(bent.mozilla)
Attachment #8469653 - Flags: review?(bent.mozilla) → review+
Tapas:  can you run this patch ASAP through the test harness?  We have a strong suspicion this will fix the bug but obviously haven't been able to repro.
Flags: needinfo?(tkundu)
(In reply to Doug Turner (:dougt) from comment #13)
> isn't it closed:

Unfortunately no, FileDescriptor() internally dup's.
I've verified that the leak was actually on the parent, which is good, since that's what the patch we've got affects.
(In reply to Jason Duell (:jduell) from comment #15)
> Tapas:  can you run this patch ASAP through the test harness?  We have a
> strong suspicion this will fix the bug but obviously haven't been able to
> repro.

Thanks a lot, we will run automation and confirm you asap. Do you think that it will also fix js-non-window leak (16MB) of Comment 0 ?
Flags: needinfo?(jduell.mcbugs)
> Do you think that it will also fix js-non-window leak (16MB)

No, 200 fds is horrible from an fd limit perspective, but the data structures involved are not going to be 16 MB worth.  Much much less. I'd be surprised if fd's are even 1KB each.

I've also looked over some other uses of FileDescriptor to make sure we didn't make the same mistake (not closing the fd we init with).  Both of these locations look fine:

  http://mxr.mozilla.org/mozilla-central/source/dom/ipc/TabParent.cpp#152
  http://mxr.mozilla.org/mozilla-central/source/dom/asmjscache/AsmJSCache.cpp#1277
Flags: needinfo?(jduell.mcbugs)
After talking to khuey I suspect we'll need to fork this bug to handle the js-non-window leak issue.  The memory is in zones and none of the RemoteOpenFile code allocates in zones.
Doug, we mentioned to have a revised patch with additional logging available here in the call, given CAF's build cut-off time (5:30pm PT), if that does not come in the next few , they are going to cherry pick the v1 patch in this bug and go ahead with testing
Flags: needinfo?(dougt)
bajaj, that is what is needed. yes.
Flags: needinfo?(dougt)
So, this is two bugs.  I am going to clone this bug for the "and hits >70MB in 24 hours" part of the problem.
Summary: b2g process leaks FD and hits >70MB in 24 hours → b2g process leaks FD
Blocks: 1051051
This fixes a bug, right?  We're just not sure if it's the only cause of what CAF is seeing?  Can we just land this then?
Flags: needinfo?(jduell.mcbugs)
Yes, this *should* fix the fd leak (but not the memory leak) with a high degree of certainty.  Of course once QC's test run is finished we should have actual data on whether the fd leak is fixed.  But I'm fine with landing this before we get results back if that expedites things.
Flags: needinfo?(jduell.mcbugs)
We are not seeing fd leak issue after picking the fix here, please go ahead and land it. Of course, we are still waiting for bug 1051051 fix.
Flags: needinfo?(tkundu)
Whiteboard: [caf priority: p1][CR 700743][MemShrink] → [caf priority: p1][MemShrink]
https://hg.mozilla.org/mozilla-central/rev/a5e9d4753ede
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → 2.1 S2 (15aug)
Flags: in-moztrap?(bzumwalt)
No STR is present to create test case to address bug.
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
Flags: in-moztrap?(bzumwalt)
Flags: in-moztrap-
Can I land this patch on v1.3t and 1.4?
You need to log in before you can comment on or make changes to this bug.