Closed Bug 919864 Opened 6 years ago Closed 6 years ago

Leaking screenshot blobs in system app via blob URIs

Categories

(Firefox OS Graveyard :: Gaia::System, defect, P1, critical)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:koi+, b2g-v1.2 fixed)

RESOLVED FIXED
1.2 C3(Oct25)
blocking-b2g koi+
Tracking Status
b2g-v1.2 --- fixed

People

(Reporter: m1, Assigned: alive)

References

Details

(Keywords: perf, Whiteboard: [MemShrink:P1] [c=memory p= s= u=1.2])

Attachments

(3 files, 1 obsolete file)

For the past 3-4 days we've consistently observed abnormal USS growth in the b2g process. 

STR:
1) Run orangutan for 8 hours while logging |adb shell busybox watch b2g-procrank|.


Initial memory usage is usually something like:
  APPLICATION        PID      Vss      Rss      Pss      Uss  cmdline
  b2g                125  104368K   67620K   58470K   50860K  /system/b2g/b2g

and then ~8 hours later:

  APPLICATION        PID      Vss      Rss      Pss      Uss  cmdline
  b2g                125  189144K  152396K  143676K  142460K  /system/b2g/b2g

Above was run on a 512MB device, but on a 256MB device the UI would likely be useable at this point.  The expectation is that b2g memory usage remain well under 100MB at all times.
(I've filed this bug /early/ to get more eyes on it.  I'm still gathering data and will share as it becomes available.  It would be very good for Moz folks to try to reproduce as well, the STR are pretty simple.  This bug has been reproduced here on a build without any commercial XPCOM extensions as well FYI.)
blocking-b2g: --- → koi+
Whiteboard: [MemShrink]
Not sure if it's related, but I'm trying to diagnose an OoM problem in bug 851626. In that bug, I can't reproduce the problem with a b2g18 build, so it's something we've introduced since then.
Partial about:memory from an occurrence of this bug overnight, looks like we have a big pile of dark matter :(

105.31 MB (100.0%) -- explicit
├───60.50 MB (57.45%) ── heap-unclassified
├────7.39 MB (07.02%) -- window-objects
├────6.10 MB (05.79%) -- images
├────4.74 MB (04.51%) -- heap-overhead
├────3.24 MB (03.07%) ++ (15 tiny)
├────2.78 MB (02.64%) -- dom
├────2.45 MB (02.32%) -- workers/workers()
└────1.35 MB (01.28%) ── freetype
USS=104144K.  Top DMD entry is 52.25% of the heap, with next entry is is 1.91% and then the rest is noise (not shown).
Mike, is there someone on your team who could take a look at this?
Flags: needinfo?(mlee)
Jed, need you to work with Kyle to troubleshoot and if possible, fix this.
Assignee: nobody → jld
Status: NEW → ASSIGNED
Flags: needinfo?(mlee) → needinfo?(jld)
Keywords: perf
Priority: -- → P1
Whiteboard: [MemShrink] → [MemShrink] [c=memory p= s= u=1.2]
Looks like we're leaking blobs here or something.
So the obvious step forward here is to get GC/CC logs and see if we can find the blobs and what is holding them alive.

If we can't do that then we can dive in with gdb if we can reproduce.

mvines can you get the GC/CC logs for jld?
Flags: needinfo?(mvines)
In this run we saw USS of b2g main process start at 48508K, finish at 148640K. 

gecko: 72011db807e6a5e46462676974cd45e208e1a3e3
gaia: 85c4af3f48a91878d565f518ba0eed68f0628e21
(recent v1.2 bits)
Flags: needinfo?(mvines)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #8)
> So the obvious step forward here is to get GC/CC logs and see if we can find
> the blobs and what is holding them alive.

Maybe I'm missing something, but Blobs don't look either cycle collected nor related to the GC heap, so I don't think the CC/GC logs will be very useful. At least, there doesn't seem to be much related to blobs in the CC log.
If they're being used by JS they'll have JS reflections and we might be able to trace them back that way.  But yes, they probably won't appear in the CC log.
So I actually don't see anything that looks like a blob in the GC log :-(

However I do see 218 unique blob URI strings in it.  I wonder if we're leaking blob URIs.
Does orangutan use the cards view at all?  I see some stuff there that looks a little sketchy.
Flags: needinfo?(mvines)
I think there are 7 Blobs in the GC log, but I can't tell if they are the leaky ones or not.  jld noticed that from the DMD log the average size of the blob string things is 128k.
A little less than 128k — 42857254 / 337 = 127172 = 124 * 1024 + 196.  (Or, for the number with shoes on: 43495424 / 337 = 129066 = 126 * 1024 + 42.)

So probably most of them are 128k and some are smaller.  Or half of them are 256k.  Or... something.

I don't suppose there's any chance b2g can pretend to be a reasonable platform and give me a core dump and debug info?  Or will I have to reproduce this locally and live-debug it?  Because at this point the next thing I'd do is walk the jemalloc metadata.
Flags: needinfo?(jld)
If we're leaking in the parent process, this should all be reproducible in the desktop builds where you will get dumps and debug info.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #13)
> Does orangutan use the cards view at all?  I see some stuff there that looks
> a little sketchy.

Yeah, every N events it sends a long home key to bring up the cards view.
Flags: needinfo?(mvines)
So, going into IPCMessageUtils.h and making ParamTraits<nsACString>::Read log whenever the string is >4KiB gets me messages like this if I enter the window manager from an app:

I/Gecko   ( 3720): Read large nsACString: 23642
I/Gecko   ( 3720): Read large nsACString: 32378
I/Gecko   ( 3720): Read large nsACString: 32378

If I go to the homescreen instead, it's like this:

I/Gecko   ( 3720): Read large nsACString: 32518
I/Gecko   ( 3720): Read large nsACString: 32518

I suspect that my hamachi's screen is smaller than the one used to discover this bug.
(In reply to Jed Davis [:jld] from comment #18)
> I suspect that my hamachi's screen is smaller than the one used to discover
> this bug.

Perhaps physically, but the screen resolution is identical with the QRD device that we're running here.
I tried using DMD.  It crashed on startup, and gdb doesn't want to do anything symbolic with the result.  However:

Program received signal SIGSEGV, Segmentation fault.
0x400129b0 in ?? ()
(gdb) x/7i $pc
=> 0x400129b0:  ldr     r3, [r0, #0]
   0x400129b2:  lsls    r2, r3, #1
   0x400129b4:  ite     mi
   0x400129b6:  orrmi.w r3, r3, #2147483648     ; 0x80000000
   0x400129ba:  bicpl.w r3, r3, #2147483648     ; 0x80000000
   0x400129be:  adds    r0, r0, r3
   0x400129c0:  bx      lr

This is an implementation of prel31 resolution from the ARM exception handling spec.  The address in question:

(gdb) p/x $r0
$1 = 0x42c6c43c

But also, just because gdb is confused doesn't mean I am.

4000d000-40015000 r-xp 00000000 1f:01 932        /system/b2g/libdmd.so

$ addr2line -ie objdir-gecko/dist/bin/libdmd.so 0x59b0
/tmp/android-build-bb7e003d31d08f72cabc269a652912b7/src/build/../gcc/gcc-4.4.3/libgcc/../gcc/config/arm/unwind-arm.c:535

As for the would-be exception index entry:

42c65000-42c6b000 r-xp 00000000 1f:01 724        /system/lib/hw/sensors.default.so
42c6b000-42c6c000 rwxp 00006000 1f:01 724        /system/lib/hw/sensors.default.so
42c85000-42e47000 rwxs 00e67000 00:0b 2587       /dev/graphics/fb0

So maybe it's sensors.default.so.  Let's try `readelf -u`:

[...]
0x4790: 0x1 [cantunwind]

0x2368: 0x80b0b0b0
[...]

It violates the ABI by having exception index entries out of order; this is bug in ICS's version of gold, but I don't know yet if it's why the unwinder is going out of bounds.


Of course, none of this gets me any closer to finding the original memory leak.
I got some help on blobs and createObjectURL from :khuey on IRC, and we found a leak in window_manager.js.  (Also, a simpler way to check for this kind of leak: break into gdb and print gDataTable->mTable.entryCount.)

I've attached a patch against v1.2 gaia which should avoid this leak, and (in case we still need more info) logs a message when it does so.  There's also a less serious issue in that we hold on to a screenshot even after an app is closed; I haven't tried to address that.

This patch conflicts with bug 920890, and won't apply to master, and also won't apply to v1.2 after that bug is uplifted (it's currently waiting on merge resolution).

All that said: Does this fix the original leak?
Flags: needinfo?(mvines)
Depends on: 920890
No longer depends on: 924696
Bug 920890 is uplifted now, so I need to see what's going on post-that....
Flags: needinfo?(jld)
Screenshot leaks I've found on master:

1. With an app open, repeat: long-press the home button, then touch the card to return to that app.
2. Open an app, go to card view, kill app by swiping it upwards; repeat with the same app.
3. Open more apps than will fit in memory, then go to card view; repeat with the same sequence of apps.

I've been testing leaks by attatching with gdb and looking at gDataTable.mTable.entryCount; if I can do a sequence of actions repeatedly and the count keeps going up, I figure it's probably a leak.

#1 is fixable by revoking _screenshotURL before reassigning it.  #2 is fixable by revoking the screenshot during removeFrame in window_manager.js.  #3 I don't understand what's going on.  So I should probably hand off the bug, I think.
Assignee: jld → alive
Flags: needinfo?(mvines)
Flags: needinfo?(jld)
Attachment #815173 - Attachment is obsolete: true
I have some ideas on how we can make this better.  I'll touch base with alive next week.
Just noticed I am the assignee :)
I do some improvement about screenshot in https://bugzilla.mozilla.org/show_bug.cgi?id=925609
not sure it satisfies all cases in comment 24.
For the screenshot doesn't revoked when app is terminated:
Because currently the kill function is living in window manager instead of app window, so the revokeObjectURL in appWindow is never called. We could improve this by moving mozbrowsererror event handler into appWindow.
Does this make sense?: Keep the blob instead of created ObjectURL and create a new ObjectURL each time we want to display the screenshot, revoking it immediately we show the screenshot.
(In reply to Alive Kuo [:alive] from comment #29)
> Does this make sense?: Keep the blob instead of created ObjectURL and create
> a new ObjectURL each time we want to display the screenshot, revoking it
> immediately we show the screenshot.

Yes.  This is sort of what I was thinking in comment 25.  There is even a provision in the spec for blob URIs that revoke themselves after being used once, but I don't know that we implemented that yet.
Also we still have to make sure our references to the blob are cleaned up, but we can rely on the GC to do that if we get rid of the objects the blobs are stored on.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #30)
> Yes.  This is sort of what I was thinking in comment 25.  There is even a
> provision in the spec for blob URIs that revoke themselves after being used
> once, but I don't know that we implemented that yet.

If it's automatically revoked already then currently we should have some flicking when we from an app go to cardview :/
Anyway I am going to have a patch for
1. Be sure to revoke objectURL when app is killed
2. Store blob instead objectURL in appWindow and create/revoke objectURL via stored blob each time we need to display image.

My question is how do I know I have already fixed the leak problem? No idea how to confirm.
(In reply to Alive Kuo [:alive] from comment #32)
> My question is how do I know I have already fixed the leak problem? No idea
> how to confirm.

We can get mvines to have his people do some QA on the patch.  jld also managed to reproduce several different issues in comments 24, perhaps he can comment on that.
Flags: needinfo?(jld)
I tried to see if I could automate the cases I found, so other people could use them more easily.

Orangutan needs a certain amount of manual makefile editing to get it built with a b2g tree and pushed to a b2g device, and then I couldn't get it to actually generate any kind of effect.

I also tried Marionette, but it looks like I'd need to spend some time reading through the system app to figure out how to use it automate these actions.

It might also help if we exposed the number / total size of blobs — in about:memory, to sufficiently privileged script, that kind of thing.  That might be bug 924696.
Flags: needinfo?(jld)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #33)
> We can get mvines to have his people do some QA on the patch.

Yes, we can certainly try out a v1.2 patch in our setup here pretty easily.  Which patch did you want our monkeys to kick and scream at?
Target Milestone: --- → 1.2 C3(Oct25)
(In reply to Michael Vines [:m1] [:evilmachines] from comment #35)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #33)
> > We can get mvines to have his people do some QA on the patch.
> 
> Yes, we can certainly try out a v1.2 patch in our setup here pretty easily. 
> Which patch did you want our monkeys to kick and scream at?

Working in Progress, stay tuned :)
Depends on: 925609
Better blob and objectURL for screenshot management.

Michael, please note this is based on https://bugzilla.mozilla.org/show_bug.cgi?id=925609
Attachment #816926 - Flags: review?(timdream)
Attachment #816926 - Flags: feedback?(mvines)
(In reply to Alive Kuo [:alive] from comment #37)
> Created attachment 816926 [details]
> https://github.com/mozilla-b2g/gaia/pull/12842

This patch doesn't apply on gaia/v1.2, could you please rebase it.
Whiteboard: [MemShrink] [c=memory p= s= u=1.2] → [MemShrink:P1] [c=memory p= s= u=1.2]
(In reply to Michael Vines [:m1] [:evilmachines] from comment #38)
> (In reply to Alive Kuo [:alive] from comment #37)
> > Created attachment 816926 [details]
> > https://github.com/mozilla-b2g/gaia/pull/12842
> 
> This patch doesn't apply on gaia/v1.2, could you please rebase it.

Don't need rebase, it's because 925609 isn't uplifted.
https://github.com/mozilla-b2g/gaia/pull/12872 here for v1.2 including 925609 commit.
Component: General → Gaia::System
Summary: b2g process memory leak on v1.2 → Leaking screenshot blobs in system app via blob URIs
Thanks.  Running monkeys with these two patches now, will report back with data in a day or two.
Flags: needinfo?(mvines)
Attachment #816926 - Flags: review?(timdream) → review+
USS of the b25 process looked pretty good overnight, maxing out at 73mb-ish.  So far so good, I think we should land the fix.
Flags: needinfo?(mvines)
Attachment #816926 - Flags: feedback?(mvines) → feedback+
Travis red is unrelated. Merged
https://github.com/mozilla-b2g/gaia/commit/56ce4212d45941536ffd814b1c92167af91e0efe

Thank everyone in the bug: Kyle, Jed, Michael.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Uplifted 56ce4212d45941536ffd814b1c92167af91e0efe to:
v1.2: a884560d00563def6f2fda20ab1b6a53c855ed4a
Depends on: 933561
No longer depends on: 933561
You need to log in before you can comment on or make changes to this bug.