don't lock images on the active page on B2G

VERIFIED FIXED in Firefox 23

Status

()

Core
ImageLib
VERIFIED FIXED
4 years ago
3 years ago

People

(Reporter: gal, Assigned: Joe Drew (not getting mail))

Tracking

unspecified
mozilla23
x86
Mac OS X
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:tef+, firefox21 wontfix, firefox22 wontfix, firefox23 fixed, b2g18 fixed, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 fixed)

Details

(Whiteboard: [MemShrink])

Attachments

(2 attachments, 13 obsolete attachments)

12.60 KB, patch
Details | Diff | Splinter Review
12.44 KB, patch
Details | Diff | Splinter Review
(Reporter)

Description

4 years ago
We OOM on image-rich pages with B2G. We currently lock images on the current tab. Instead, we will disable image locking and set the timeout and max decode memory to very high values, and we will rely on the OOM handler to clear the decoded image cache.
(Reporter)

Comment 1

4 years ago
Created attachment 738643 [details] [diff] [review]
patch
Assignee: nobody → gal
(Reporter)

Comment 2

4 years ago
we have to check that bug 807143 doesn't reappear with this fix
(Assignee)

Comment 3

4 years ago
I understand the locking change, but why the timeout & max decoded size change?
(Reporter)

Comment 4

4 years ago
Created attachment 738647 [details] [diff] [review]
patch
Attachment #738643 - Attachment is obsolete: true
(Reporter)

Comment 5

4 years ago
I see some issues with the patch. Decoding gets stuck half way through the image on pintrest.
(Reporter)

Comment 6

4 years ago
In general it works pretty well though. Decoding is fast enough for normal reading/scrolling.
(Reporter)

Comment 7

4 years ago
however, on GP I don't think the OOM handler kicks in
(Assignee)

Comment 8

4 years ago
(In reply to Andreas Gal :gal from comment #5)
> I see some issues with the patch. Decoding gets stuck half way through the
> image on pintrest.

That sounds like bug 807143.
(Assignee)

Comment 9

4 years ago
(Bug 807143 was not just about flickering, but about images not redrawing/not redrawing fully, for reasons that was never explored.)
(Assignee)

Updated

4 years ago
Component: Graphics → ImageLib
(Assignee)

Comment 10

4 years ago
(In reply to Andreas Gal :gal from comment #7)
> however, on GP I don't think the OOM handler kicks in

With lock disabled, I'd be a little surprised if we stored enough decoded image memory to reach OOM, fwiw.
(Reporter)

Comment 11

4 years ago
probably because the GP has more memory. this needs testing on a production device, but otherwise it looks promising
(Reporter)

Comment 12

4 years ago
Sounds like we need someone to debug the "never explored" part of comment 9. Joe, are you available to do that?

As for comment 10, look at the patch. We never discard anything except via the OOM handler.
(Assignee)

Comment 13

4 years ago
(In reply to Andreas Gal :gal from comment #12)
> Sounds like we need someone to debug the "never explored" part of comment 9.
> Joe, are you available to do that?

Not immediately, but in a couple of days I think I will be.

> As for comment 10, look at the patch. We never discard anything except via
> the OOM handler.

That's why I asked in comment 3 why you were doing this :), but I think you're right that this will make us not discard except via OOM or exceeding 512 MB. There are occasions when we discard immediately, but that's mostly in corner cases.
Please note that it's not an out-of-memory hook, it's a low-memory notification.  These are throttled to run no more than once every 5s (gonk.systemMemoryPressureRecoveryPollMS"), and even then only when we transition between a low-memory and a not-low-memory state.

So suppose we have the following situation:

 * Use your phone.  Lots of processes are loaded and running in the background.
 * Load the browser.  This causes us to start up a new process, which causes the amount of free memory on the system to go under 10mb.  We fire a low-memory notification.
 * Load Pintrest in the browser.

At this point, we will not throw away any images until we free up enough memory so the system has more than 10mb free (because we must leave the low-memory state before we fire the low-memory notification again.)  That's unlikely to happen; I expect we'll crash instead.

We could fire the low-memory notification more often, but it's not clear how to do that correctly.  We do not want every process on the system to GC every 5s while we're under memory pressure.
(Assignee)

Comment 15

4 years ago
Andreas, given comment 14, if we didn't have the redrawing problem, could we ship with the image flashing that happens without the long timeout and huge decoded data limit?
Flags: needinfo?(gal)
Whiteboard: [MemShrink]
Other things you almost certainly want to do here:

 * Avoid the 5ms sync decode after the first time we draw an image.  In fact, we should probably avoid the sync decode the first time we draw an image, too.  This will break reftests, but leaving in the sync decode will probably lead to serious jank when scrolling.

 * Discard all images in a document when it is destroyed.  Images can be shared between documents, so this isn't totally correct, but it's probably close enough.

If relying on the low-memory notifications as-is is good enough, then it's good enough.  I just think we may want to tweak them somehow, particularly if they're a primary means of notifying us that we should drop memory.  It's a very delicate balance; you can imagine us dropping all images every 5s while we're under memory pressure (but not GC'ing etc), but that would be a bad experience if we're low on memory for some reason other than images.
(Reporter)

Comment 17

4 years ago
I have never seen the state of comment 15 but I can go and reproduce that. In the meantime, we need to diagnose the partial decoding either way.
Flags: needinfo?(gal)
There are plenty of other ways to get into the state from comment 14.  The point is that after we get into a low-memory state, we will not discard any images until we leave the low-memory state.  That may be undesirable.
Let's keep this bug up to date tomorrow as well, Joe can get further involved, but note the six hour time difference.
(Assignee)

Comment 20

4 years ago
(In reply to Justin Lebar [:jlebar] from comment #16)
> leaving in the sync decode will
> probably lead to serious jank when scrolling.

Really? I thought scrolling was done in a different thread from the main Gecko thread.

>  * Discard all images in a document when it is destroyed.  Images can be
> shared between documents, so this isn't totally correct, but it's probably
> close enough.

Note that nsDocument::DeleteShell() already does this. I don't know whether that's good enough for the b2g case, though.
(Assignee)

Comment 21

4 years ago
(In reply to Andreas Gal :gal from comment #17)
> I have never seen the state of comment 15 but I can go and reproduce that.
> In the meantime, we need to diagnose the partial decoding either way.

I'm pretty sure it's partial *redrawing,* but regardless I will try to diagnose it tomorrow.
(Reporter)

Comment 22

4 years ago
I still feel that the low memory notification approach is the right way to go. jlebar, can we somehow tell that we are "still in low memory" state? In that case I would prefer to fire of a timer that keeps throwing away images until we no longer are in low memory state.
(Reporter)

Comment 23

4 years ago
Partial redrawing is fine. We should schedule another paint after that when we have the full image right?
> jlebar, can we somehow tell that we are "still in low memory" state?

Absolutely; hopefully it's clear in GonkMemoryPressureMonitoring.cpp how to do this, but let me know if it's not.

> Really? I thought scrolling was done in a different thread from the main Gecko thread.

If I'm wrong, that's awesome.  I didn't think we had async scrolling everywhere in b2g.  Maybe in the browser but not in apps, or vice versa?

> Note that nsDocument::DeleteShell() already does this.

That should probably be sufficient.
(Assignee)

Comment 25

4 years ago
You'd think so! Obviously something's going wrong, though; I wouldn't want to speculate as to what in particular.
(Reporter)

Comment 26

4 years ago
Any updates here Joe? Did you debug the cause for the partial paints?
(Reporter)

Comment 27

4 years ago
Created attachment 738966 [details] [diff] [review]
Keep sending low-memory-ongoing notifications every 5s which flush caches but don't GC.
Attachment #738647 - Attachment is obsolete: true
(Reporter)

Comment 28

4 years ago
Comment on attachment 738966 [details] [diff] [review]
Keep sending low-memory-ongoing notifications every 5s which flush caches but don't GC.

This is completely untested. Justin, can you please ask Lucas and Faramarz to find you someone who can take this and finish this today? Also, please ask Jeff for help with the rendering glitch (partial image rendering) I am seeing on pintrest. Looks like no updates from Joe, so we should do our own investigation.
Attachment #738966 - Flags: review?(justin.lebar+bug)
(In reply to Andreas Gal :gal from comment #28)
> Comment on attachment 738966 [details] [diff] [review]
> Keep sending low-memory-ongoing notifications every 5s which flush caches
> but don't GC.
> 
> This is completely untested. Justin, can you please ask Lucas and Faramarz
> to find you someone who can take this and finish this today? Also, please
> ask Jeff for help with the rendering glitch (partial image rendering) I am
> seeing on pintrest. Looks like no updates from Joe, so we should do our own
> investigation.

If we can cover it in Madrid great, but note that Joe's updates not being there between 7pm and 7am local time is not an indication that he's not going to look at it.  I expect more information in a few hours, so if anything happens in the meantime, let's make sure this bug is up to date.
I'm going to play around with this on an unagi.
Flags: needinfo?(josh)
Comment on attachment 738966 [details] [diff] [review]
Keep sending low-memory-ongoing notifications every 5s which flush caches but don't GC.

Changes to GonkMemoryPressureMonitoring look good.  But certainly the pref changes should be reviewed separately (and need tests).

I'm not familiar with this code, but CompositorChild sends a message to the parent on memory pressure (CompositorChild.cpp::MemoryPressureObserver), and then the parent does stuff on behalf of the child.  We may not want to do this, since it encourages priority inversion (low-priority bg processes making the main process do work on their behalf) when we're under long-lasting memory pressure.

It's also not clear to me how expensive nsFontCache::Compact() is.  Similarly, memory pressure causes us to run gfxFont.h::FlushShapedWordCaches, which looks like something we don't want to do every 5s.

I'm most worried about CompositorChild, followed by FlushShapedWordCaches.

Better safe than sorry?

P.S. If you use git-bz [1], it will automatically attach -U8 patches.  :)

[1] https://github.com/jlebar/moz-git-tools
Attachment #738966 - Flags: review?(justin.lebar+bug) → review+
> But certainly the pref changes should be reviewed separately (and need tests).

s/tests/testing/
Are we considering this for tef?
(Assignee)

Comment 34

4 years ago
(In reply to Andreas Gal :gal from comment #26)
> Any updates here Joe? Did you debug the cause for the partial paints?

No, because of the 6 hour time difference. I'm starting on it now.
(Reporter)

Comment 35

4 years ago
1.0.1 is crashing on pintrest and many other popular sites, so yes, if we can swing a decent fix, I will go and beg for an uplift.
(In reply to Andreas Gal :gal from comment #35)
> 1.0.1 is crashing on pintrest and many other popular sites, so yes, if we
> can swing a decent fix, I will go and beg for an uplift.

JST and I just had a mini-triage, tef+ this.
blocking-b2g: --- → tef+
(Assignee)

Comment 37

4 years ago
(In reply to Andreas Gal :gal from comment #5)
> I see some issues with the patch. Decoding gets stuck half way through the
> image on pintrest.

So I'm clear: This happens when loading m.pinterest.com in the browser, right?
(In reply to Joe Drew (:JOEDREW! \o/) from comment #37)
> (In reply to Andreas Gal :gal from comment #5)
> > I see some issues with the patch. Decoding gets stuck half way through the
> > image on pintrest.
> 
> So I'm clear: This happens when loading m.pinterest.com in the browser,
> right?

Yes.
Blocks: 860818
(Reporter)

Comment 39

4 years ago
Joe, any updates here?
(Assignee)

Comment 40

4 years ago
I am currently debugging why bug 807143 happens with image locking disabled, with the hope that fixing that bug (very easily reproduced) will fix pinterest partial display problems (difficult to reproduce, at least for me on an Unagi phone).
(Reporter)

Comment 41

4 years ago
Cool, thanks. I will focus on this patch then. On the TCL device loading pintrest kills the browser and takes down the system. Looks like the OOM killer doesn't work there.
(Reporter)

Comment 42

4 years ago
I/Gecko   (  144): 
I/Gecko   (  144): ###!!! [Parent][AsyncChannel] Error: Channel error: cannot send/recv
I/Gecko   (  144): 
E/GeckoConsole(  144): Content JS WARN at http://connect.facebook.net/en_US/all.js:52 in i: The "fb-root" div has not been created, auto-creating
I/Gecko   (  144): 
I/Gecko   (  144): ###!!! [Parent][AsyncChannel] Error: Channel error: cannot send/recv
I/Gecko   (  144): 
I/Gecko   (  144): 
I/Gecko   (  144): ###!!! [Parent][AsyncChannel] Error: Channel error: cannot send/recv
I/Gecko   (  144): 

I get this while pintrest is sucking up all the memory and the device dies. Wonderful. This is a device with a kernel that might not be up to date.

Updated

4 years ago
Blocks: 852643
(Reporter)

Comment 43

4 years ago
Joe, here is an easy test case with this patch. Applied it and then hold down the finger on an icon on the homescreen. We go into icon edit mode and all the icons get a little "x" on the top right and a second later all icons disappear (only the red x remains for each icon).
(Reporter)

Comment 44

4 years ago
Yeah, memory pressure triggering definitely doesn't work on my device but with this patch things on pintrest are much much better (still running OOM eventually though, because memory pressure purge never comes). Patch also slightly changes some perceived painting so definitely not zero risk (we don't decode on load any more).
(Assignee)

Comment 45

4 years ago
FWIW: Even after this works, I have some reservations about what the UX is going to be like for blessed apps if their icons, etc can be discarded from under them. There's going to be a bunch of flashing.

I have no idea if it's possible to have different prefs for those apps, but if so, it'd be a good idea to enable locking for them.
(Assignee)

Comment 46

4 years ago
We are definitely re-decoding these images after discard, but for some reason we're never redrawing them. Still trying to track down how the invalidation is getting lost.
(Reporter)

Comment 47

4 years ago
Yeah, lets see what it looks like. I am using a device of the other vendor now. That device properly kills the browser and the system survives. I will try my patch now.
(Reporter)

Comment 48

4 years ago
Note: flashing would only occur when we are in memory pressure, which basically means "omg omg we are about to die", so I think its ok to trade off a little flashing vs not dying. Outside memory pressure we should never flash since we never discard images.
(Assignee)

Comment 49

4 years ago
So: we redecode an image and mark it dirty, but because we're in memory pressure or hit our limit or whatever, by the time we get back around to actually reading the image, it's been discarded again. 

This repeats in a continuous loop forever.
(Reporter)

Comment 50

4 years ago
Cool. That problem should not happen with this patch. I am trying to get my device build to work to test with the kernel that actually properly sends memory pressure events. Thanks for the diagnosis Joe.

Here is what I expect to happen:
- we go into memory pressure
- we throw away all decoded images
- we start painting, re-decoded whatever is visible
- we don't throw away, as long we are no longer in memory pressure
- since we no longer lock all the images, we should come out of memory pressure, and everything is happy

A corner case might be when lots of apps are open and we are surfing the memory pressure threshold. We will see in a few.
(Assignee)

Comment 51

4 years ago
I could definitely see some other corner cases where having a giant image on-screen causes us to go into memory pressure again, or it takes a little while for memory pressure to "clear", but hopefully in practice we should be OK.
(Reporter)

Comment 52

4 years ago
Ok, I need someone else to test this. My proto device is borked. Genlock is taking a mental health day.
(Assignee)

Comment 53

4 years ago
Does jdm have a device that knows how to fire memory-pressure? (I don't. :( )
(Assignee)

Comment 54

4 years ago
Created attachment 739256 [details] [diff] [review]
port to b2g18
(Reporter)

Comment 55

4 years ago
The main issue I am seeing are disappearing icons when long-tapping on the home screen (red x appears, icons disappear). This might be related to drawImage path. We should focus on this. This is unrelated to memory pressure and pintrest. I will work with jlebar on the actual OOM testing tomorrow.
(Assignee)

Comment 56

4 years ago
Unfortunately I can't reproduce this problem at all. Andreas tells me that it happens on first boot, even.

I need someone who *can* reproduce to break in CanvasRenderingContext2D::DrawImage and find out what happens in SurfaceForElement. If that fails to decode the image for whatever reason, it'd explain this behaviour, and we'd be able to move on from there.

(Note that my attempts are all on b2g18.)
> I get this while pintrest is sucking up all the memory and the device dies. Wonderful.

FWIW we need dmesg output in order to understand OOMs.  logcat doesn't tell us anything helpful.
(Assignee)

Comment 58

4 years ago
I tried to build B2G against trunk mozilla-central with Andreas' patch, but it just ended up with a build that wouldn't boot. I blew away all objdirs and am trying again, but now I'm out of time until tomorrow.
I suspect you'll have much better luck building b2g18 than m-c.  Nobody tests m-c; it's perpetually broken.
(Assignee)

Comment 60

4 years ago
Yeah, okay. Hopefully someone can tell me what is happening in CanvasRenderingContext2D::DrawImage; if I have that information I can figure it out.
(Reporter)

Comment 61

4 years ago
milan, can we get some help today here?
Sure, just waiting for people to show up.
(Reporter)

Comment 63

4 years ago
Ok, so I think this is actually a blob that gets rendered that we snapshotted from a canvas. We read img.naturalWidth and if that is zero we don't paint it.
(Reporter)

Comment 64

4 years ago
So I think we fire onload before .width and .height is available. The gaia homescreen seems to rely on that being available and before it was because we start decode early.
(Reporter)

Comment 65

4 years ago
We might be revoking the blob url after onload and we set src to that url but the image isn't decoded yet (not locked!) so it can't be decoded later when we paint it.
(Reporter)

Comment 66

4 years ago
Created attachment 739508 [details] [diff] [review]
updated patch, decode early, but still don't lock
Attachment #738966 - Attachment is obsolete: true
Attachment #739256 - Attachment is obsolete: true
(In reply to Andreas Gal :gal from comment #66)
> Created attachment 739508 [details] [diff] [review]
> updated patch, decode early, but still don't lock

This seems reasonable if it works.
With this patch, I don't get the problem described in comment 43, but can't test pinterest.com itself, I'm running into a "can't turn wifi on" known bug.
On pinterest.com I managed to get the "oh, how embarrassing, this is taking too long" once after some scrolling, then within 5s on the reload.  OK the second time.
Created attachment 739540 [details] [diff] [review]
Patch that builds on mozilla-b2g18

No idea if Andreas' patch builds on other trees, but it certainly didn't on b2g18. Here's a version that does.
Flags: needinfo?(josh)
(Assignee)

Comment 71

4 years ago
(In reply to Andreas Gal :gal from comment #64)
> So I think we fire onload before .width and .height is available. The gaia
> homescreen seems to rely on that being available and before it was because
> we start decode early.

Barring a bug, this definitely isn't the case - that's the whole point of onload.

(In reply to Andreas Gal :gal from comment #65)
> We might be revoking the blob url after onload and we set src to that url
> but the image isn't decoded yet (not locked!) so it can't be decoded later
> when we paint it.

What do you mean by "revoking the blob url"? drawImage should always synchronously decode the image as long as it's loaded.
(Assignee)

Comment 72

4 years ago
This patch looks reasonable. I don't think it should cause much (if any!) main-thread jank.

I still want to know what's happening in drawImage, though; the RequestDecode part of this patch shouldn't be necessary.
What seems to happen is that

 * We have an image,
 * We drawImage it to a canvas
 * We pull the canvas out as a blob
 * /that blob/ is not decoded.
(Assignee)

Comment 74

4 years ago
Reading page.js in the homescreen app, it's entirely possible that the Object URL *is* being revoked before it can be painted, meaning that if we don't explicitly requestDecode as the image is added to the document (like Andreas' patch does), we can never decode it because the blob URL is gone.
(Assignee)

Comment 75

4 years ago
Specifically, displayRenderedIcon calls revokeObjectURL during onload, which is basically guaranteed to be too early; onload often fires before any painting is done, and painting is the only thing *guaranteed* to start the image decoding process.
(In reply to Joe Drew (:JOEDREW! \o/) from comment #75)
> Specifically, displayRenderedIcon calls revokeObjectURL during onload, which
> is basically guaranteed to be too early; onload often fires before any
> painting is done, and painting is the only thing *guaranteed* to start the
> image decoding process.

That sounds like a reasonable explanation for why this only happens with blobs.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #76)
> (In reply to Joe Drew (:JOEDREW! \o/) from comment #75)
> > Specifically, displayRenderedIcon calls revokeObjectURL during onload, which
> > is basically guaranteed to be too early; onload often fires before any
> > painting is done, and painting is the only thing *guaranteed* to start the
> > image decoding process.
> 
> That sounds like a reasonable explanation for why this only happens with
> blobs.

But there's still some weird stuff going on.
(Reporter)

Comment 78

4 years ago
Created attachment 739591 [details] [diff] [review]
patch
Attachment #739508 - Attachment is obsolete: true
Assigning to me to drive it, it still needs the full effort by everybody involved to actually resolve it :)
One thing I've thought about. With this patch, won't we accumulate decoded images from pages that we've navigated away from until we hit OOM?
(In reply to Jeff Muizelaar [:jrmuizel] from comment #80)
> One thing I've thought about. With this patch, won't we accumulate decoded
> images from pages that we've navigated away from until we hit OOM?

It depends on when nsDocument::DeleteShell is called.  If that's called when we navigate away from a page, we're fine.
Whiteboard: [MemShrink] → [MemShrink][status: needs new patch, our top people are on it]
What was the status of "mImageLocking is false" before?  We never bothered decoding the images in that case.
(Assignee)

Comment 83

4 years ago
We would decode them if they were painted.
(Assignee)

Comment 84

4 years ago
(In reply to Jeff Muizelaar [:jrmuizel] from comment #80)
> One thing I've thought about. With this patch, won't we accumulate decoded
> images from pages that we've navigated away from until we hit OOM?

Presumably we'd hit a low memory notification first, though.
To put in writing here what we've already discussed: We were observing a low-memory notification in the master process, but the child wasn't getting it before it was killed.  jdm tried increasing the threshold at which we get notified, and then the child gets the low-memory notification and then immediately dies.
(In reply to Milan Sreckovic [:milan] from comment #79)
> Assigning to me to drive it

This time for real :)
Assignee: gal → milan
(Reporter)

Comment 87

4 years ago
jlebar, how is it possible that the child process OOMs so quickly? Can we tell remaining physical memory from user space so we avoid getting so close?
(Assignee)

Comment 88

4 years ago
Our current idea is to do what we used to do on Maemo, and check for memory pressure when we're trying to create a new image frame; if we're currently under pressure, we'll just abort the decode.
> jlebar, how is it possible that the child process OOMs so quickly?

I don't see how it should be impossible.  Note that allocations in the main process (for example, Necko buffers) can cause us to OOM in the child.

> Can we tell remaining physical memory from user space so we avoid getting so close?

Yes; working on it.
Created attachment 739742 [details] [diff] [review]
WIP, measure available system memory before allocating decoded image data.

This compiles, but it doesn't run for me, probably because the b2g18 base rev I
pulled is broken.  As a result, I haven't tested this; it probably doesn't work
right.

I need to go for the evening, and I'm on a plane tomorrow, but maybe this will
help someone.
(Assignee)

Comment 91

4 years ago
It looks like this isn't enough, which isn't *too* surprising - with overcommit it's possible for memory to go from fine to too-low without calling EnsureFrame.

We're going to need a hook in FrameUpdated too, but one that forcibly kills the decoder and discards the frames.
> It looks like this isn't enough

I'm really surprised that my patch worked perfectly on the first try.  Did you
check that it's reading the amount of free memory correctly?  Sorry I haven't
been able to test it myself.  It sucks that our trees are so unstable.

> with overcommit it's possible for memory to go from fine to too-low without calling 
> EnsureFrame.

Yeah, that's a good point.  I can imagine us kicking off N decodes all at about the same time and saying "yep, plenty of space for you" for each of them.  I can also imagine us discarding all images and still falling over due to the images currently being decoded.

Before someone asks, I don't think we can disable overcommit without first modifying jemalloc and possibly the JS engine.  I suspect thirdy-party code like drivers may also rely on overcommit.

jdm and I talked for a while, and we think we still have some options here.
Not all of these are mutually-exclusive.

1. As Joe suggested, we could cancel decodes on memory pressure, or we could
   even read the amount of free memory on the system while we decode.

2. We could fire the low-memory event at a threshold above 10mb.  jdm observed
   that when the threshold was 15mb, the child process at least received the
   low-memory notification from the kernel before it was killed.  Maybe we
   need to go to 20mb.

3. We could fire the memory-pressure event faster after receiving the
   notification from the kernel.  Right now we dispatch an event to the main
   thread, but we could use nsThread.cpp::ScheduleMemoryPressureEvent, which
   cuts the line in the event loop.

4. We could allocate our images in discardable memory.  This would be a pretty
   heavyweight change.  There's a sketch of an API and implementation in bug
   748598.  I suspect it may be hard to get the right discarding policy in the
   kernel, though.
(Reporter)

Comment 93

4 years ago
The discardable memory stuff looks like exactly what we need. I have a sick day in Madrid, so I will grab that bug and go for it.
Not to be Debbie Downer, but an important part of using ashmem will be ensuring that processes get "charged" correctly for memory that they own, and that we free memory owned by low-priority processes before we free memory owned by high-priority processes.

Since ashmem is a shared memory system, I suspect it may not do this well.
(Reporter)

Comment 95

4 years ago
Ok. In that case I think the next best thing is to take your patch and in the alloc path purge the discard tracker and then trying to allocate again. We want to avoid failing in that path.
(Reporter)

Comment 96

4 years ago
jlebar, I have a nice DiscardableMemory abstraction patch. I guess we could still use that in the parent, especially in necko.
(Reporter)

Comment 97

4 years ago
jlebar, as far as I can tell ashmem properly accounts ashmem towards the "owner" process and when that process dies, ashmem gets deallocated. ashmem's purge policy is LRU and not bound to priorities. I think thats ok for cache-like use, so I want to start using it for the image cache and memcache and what not.
(Reporter)

Comment 98

4 years ago
Created attachment 739963 [details] [diff] [review]
Limit decoded image cache to 30MB

This patch actually works very well and the browser stays at a very manageable process size with pintrest. Flashing is minimal/not noticable after the page fully loaded. Until loading however there is an issue. I am kicking off eager decoding, and that eager decoding keeps kicking out images on top of the page while images further down load. We have to do eager decoding only until the image tracker quota is full. The next patch will do that.
Assignee: milan → gal
Attachment #739540 - Attachment is obsolete: true
Attachment #739591 - Attachment is obsolete: true
Attachment #739742 - Attachment is obsolete: true
(Reporter)

Comment 99

4 years ago
Ok, after a few hours of debugging I think the problem is that we don't start tracking images in the discard tracker until _after_ decoding is complete and we fire off too many eager decoding requests and they thresh the decoded image cache, which is why the first images on the page don't show up (they keep getting evicted by decoding requests from further down in the document).
(In reply to Andreas Gal :gal from comment #96)
> jlebar, I have a nice DiscardableMemory abstraction patch. I guess we could
> still use that in the parent, especially in necko.

Upon reflection, I have a worse concern than the one from comment 94.

Linux has an OOM killer component that kills processes when the system is (very nearly) out of memory.  Android adds a second component, the low-memory killer (LMK), which usually acts before the OOM killer.  The LMK can follow a set of strict priorities, whereas the priorities we give to the OOM killer are more like guidelines.  The LMK can also be configured to kill certain classes of apps when we reach certain thresholds of free memory, and we use this.  For example, we kill fg apps when there's less than 5mb of free memory on the system.  This seems to keep the main process from OOMing itself.

So the difficulty with discardable memory is, we need ashmem to work together with the LMK.  If a fg app has an ashmem mapping, we want that to be discarded before the fg app is killed by the LMK, otherwise ashmem isn't very useful.

Looking through the kernel source, it appears that maybe we will try to free ashmem segments before we invoke the LMK; I wasn't expecting even this level of coordination.  There doesn't appear to be any coordination causing us to kill low-priority tasks before we free a high-priority task's ashmem segments, which might cause us to favor keeping background tasks alive at the expense of being able to show images in a foreground task.

Anyway, if we went with this approach, we'd need to test it out and see what happens.  Perhaps we can tune our LMK params so that it works well enough.  Like low-memory notifications, ashmem may be a helpful tool as part of an ensemble of approaches, but I doubt it will fix all the problems.
(Assignee)

Updated

4 years ago
Attachment #739963 - Attachment is patch: true
Attachment #739963 - Attachment mime type: text/x-patch → text/plain
(Assignee)

Comment 101

4 years ago
Justin, Andreas: do you need Jeff or me to do anything at this point?
(Reporter)

Comment 102

4 years ago
Created attachment 740421 [details] [diff] [review]
updated patch, pref to limit eager image decoding on load
Attachment #739963 - Attachment is obsolete: true
The version of the patch from earlier this morning (not the comment #102 one) did not fix my problems in bug 861965. See https://bugzilla.mozilla.org/show_bug.cgi?id=861965#c14 and https://bugzilla.mozilla.org/attachment.cgi?id=740429

I haven't tried this latest version yet.
(Reporter)

Comment 104

4 years ago
Created attachment 740432 [details] [diff] [review]
remove some debug junk
Attachment #740421 - Attachment is obsolete: true
(Reporter)

Updated

4 years ago
Attachment #740432 - Flags: review?(justin.lebar+bug)
I have a gun up to my head to finish other blockers by Wednesday.  I'll get to this when I can, but you'll probably find a more timely review elsewhere.
Comment on attachment 740432 [details] [diff] [review]
remove some debug junk

Review of attachment 740432 [details] [diff] [review]:
-----------------------------------------------------------------

I'm taking liberty of offering r- on this patch, not because I have any idea what it does but because it does not compile.

::: content/base/src/nsDocument.cpp
@@ +8444,5 @@
> +  if (oldCount == 0) {
> +    if (mLockingImages)
> +      rv = aImage->LockImage();
> +    if (NS_SUCCEEDED(rv) && sOnloadDecodeLimit &&
> +        mLockingImages.Length() < sOnloadDecodeLimit)

This line fails to compile: 

The latest patch does not build for me:

/Users/djf/mozilla/mozilla-b2g18/content/base/src/nsDocument.cpp: In member function 'virtual nsresult nsDocument::AddImage(imgIRequest*)':
/Users/djf/mozilla/mozilla-b2g18/content/base/src/nsDocument.cpp:8448: error: request for member 'Length' in '((nsDocument*)this)->nsDocument::mLockingImages', which is of non-class type 'bool'
Attachment #740432 - Flags: review?(justin.lebar+bug) → review-
(Reporter)

Comment 107

4 years ago
Created attachment 741121 [details] [diff] [review]
disable image locking on b2g, handle ongoing OOM condition better, and only eagerly decode first 24 images in a document

pinterest no longer crashes for me with this patch, and memory stays around 70MB instead of eventually sucking up all the memory. Also, initial rendering seems faster since we have fewer images competing for CPU time.
Attachment #740432 - Attachment is obsolete: true
(Reporter)

Updated

4 years ago
Attachment #741121 - Attachment is patch: true
Attachment #741121 - Attachment mime type: text/x-patch → text/plain
(Reporter)

Updated

4 years ago
Attachment #741121 - Flags: review?(jmuizelaar)
(Reporter)

Comment 108

4 years ago
jlebar already reviewed the OOM pieces but I think jeff should in addition review the change to the image load behavior. This is disabled for anything but FFOS, so I think this is safe to land on trunk.
(Reporter)

Comment 109

4 years ago
I made a testcase that contains 200 images on a page. This is probably a bit more reliable to test than pinterest:

http://people.mozilla.org/~gal/imagelock/
Blocks: 849001
(Assignee)

Comment 110

4 years ago
Comment on attachment 741121 [details] [diff] [review]
disable image locking on b2g, handle ongoing OOM condition better, and only eagerly decode first 24 images in a document

Review of attachment 741121 [details] [diff] [review]:
-----------------------------------------------------------------

::: content/base/src/nsDocument.cpp
@@ +8442,5 @@
>    // If this is the first insertion and we're locking images, lock this image
>    // too.
> +  if (oldCount == 0) {
> +    if (mLockingImages)
> +      rv = aImage->LockImage();

I want this change in general, actually, so can you propose a separate patch to do it on mozilla-central?
Attachment #741121 - Flags: review?(jmuizelaar) → review+
(Reporter)

Comment 111

4 years ago
Sure, happy to do that. I was actually planning to land this on m-c and then uplift. Do you see a problem with applying this to m-c? It won't change behavior for Firefox just yet, but we can start playing with the new prefs (I think we should limit eager decoding on desktop too).
(Reporter)

Comment 112

4 years ago
Created attachment 741339 [details] [diff] [review]
patch for inbound
(Reporter)

Updated

4 years ago
Attachment #741339 - Flags: review?(joe)
(Reporter)

Updated

4 years ago
Whiteboard: [MemShrink][status: needs new patch, our top people are on it] → [MemShrink][status: patch ready for b2g18, m-c patch waiting for review]
(Assignee)

Comment 113

4 years ago
Comment on attachment 741339 [details] [diff] [review]
patch for inbound

Review of attachment 741339 [details] [diff] [review]:
-----------------------------------------------------------------

I didn't review anything regarding the memory notifications, so if the merge wasn't trivial you might want to get jlebar to look at them.

::: b2g/app/b2g.js
@@ +272,5 @@
>  pref("image.mem.decodeondraw", true);
> +pref("content.image.allow_locking", false); /* don't allow image locking */
> +pref("image.mem.min_discard_timeout_ms", 86400000); /* 24h, we rely on the out of memory hook */
> +pref("image.mem.max_decoded_image_kb", 30000); /* 30MB seems reasonable */
> +pref("image.onload.decode.limit", 24); /* don't decode more than 24 images eagerly */

image.onload.decode.limit should be added to all.js too, initially with a value of 0. It's probably best to leave it disabled like that for now, just so we limit our scope in this bug; I'm sure mobile in particular will want a value on it too, but I just don't know what that value will be.

Post bug 847223, we won't need this on desktop; if and when bug 691169 gets fixed, it won't matter for Fennec either. (I don't see an equivalent to making bug 689623 work on B2G, but I might have just missed it.)
Attachment #741339 - Flags: review?(joe) → review+
(Reporter)

Comment 114

4 years ago
This needs someone to land it. Assigning to faramarz to find an owner land the patches (m-c patch needs a tryserver run) and also add the all.js pref.
(Reporter)

Updated

4 years ago
Assignee: gal → frashed
(Reporter)

Updated

4 years ago
Whiteboard: [MemShrink][status: patch ready for b2g18, m-c patch waiting for review] → [MemShrink][status: ready to land, see last comment for instructions]

Updated

4 years ago
Assignee: frashed → fabrice
Pushed to try with the new pref in all.js:
https://tbpl.mozilla.org/?tree=Try&rev=2e5cd94c2d96
So, we have a reftests failing because of this patch. Joe, any chance you can take a look?
I'd be surprised if reftests weren't failing.  I think we should just disable them, or alternatively disable this behavior during reftests.

The right fix is bug 695763.
(Assignee)

Comment 118

4 years ago
This fails reftests on desktop platforms, even, though. The code change shouldn't have modified desktop behaviour!
(Assignee)

Comment 119

4 years ago
Argh.

    2.50 +    if (NS_SUCCEEDED(rv) && sOnloadDecodeLimit &&
    2.51 +        mImageTracker.Count() < sOnloadDecodeLimit)
    2.52        rv = aImage->StartDecoding();

should be

    2.50 +    if (NS_SUCCEEDED(rv) && (!sOnloadDecodeLimit ||
    2.51 +        mImageTracker.Count() < sOnloadDecodeLimit))
    2.52        rv = aImage->StartDecoding();
(Assignee)

Comment 120

4 years ago
(but indented correctly)
When I apply the patch (without Joe's comment 119 fix) to m-c, and try it as a fix for bug 861965, the phone reboots.
(Reporter)

Comment 122

4 years ago
Joe's fix makes sense. I can take a look at comment #121. I didn't check that use case yet. 5mb images might exceed the 24 picture limit. I didn't find an easy way to limit total size, only count.
With Joe's patch, I still crash, with a segfault.  Here are what look like the relevant lines from logcat, starting with the last debugging message from the gallery. At this point, the gallery app has successfully loaded and created previews for 3 5mp images, and is just starting to load the 4th image:

E/GeckoConsole( 3219): Content JS LOG at app://gallery.gaiamobile.org/js/metadata_scripts.js:1152 in createThumbnailAndPreview: createThumbnailAndPreview 0/IMG_20121006_171351.jpg
I/Gecko   ( 3075): 
I/Gecko   ( 3075): ###!!! [Parent][AsyncChannel] Error: Channel error: cannot send/recv
I/Gecko   ( 3075): 
I/Gecko   ( 3075): 
I/Gecko   ( 3075): ###!!! [Parent][AsyncChannel] Error: Channel error: cannot send/recv
I/Gecko   ( 3075): 
D/memalloc( 3075): /dev/pmem: Freeing buffer base:0x48955000 size:40960 offset:8048640 fd:137
D/memalloc( 3075): /dev/pmem: Freeing buffer base:0x4836b000 size:614400 offset:1847296 fd:127
I/Gecko   ( 3075): [Parent 3075] ###!!! ABORT: actor has been |delete|d: file /Users/djf/mozilla/builds/obj-mc-gonk/ipc/ipdl/PGrallocBufferParent.cpp, line 392
E/Gecko   ( 3075): mozalloc_abort: [Parent 3075] ###!!! ABORT: actor has been |delete|d: file /Users/djf/mozilla/builds/obj-mc-gonk/ipc/ipdl/PGrallocBufferParent.cpp, line 392
F/libc    ( 3075): Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1)
I/Gecko   ( 3075): [Parent 3075] WARNING: waitpid failed pid:3255 errno:10: file /Users/djf/mozilla/mozilla-central/ipc/chromium/src/base/process_util_posix.cc, line 260
I/DEBUG   ( 3073): debuggerd committing suicide to free the zombie!
I/DEBUG   ( 3257): debuggerd: Apr 15 2013 10:15:46

And here's an excerpt from dmesg output:

<4>[04-25 20:30:12.421] [3219: Gallery]select 3125 (Usage), adj 6, size 5445, to kill
<4>[04-25 20:30:12.421] [3219: Gallery]send sigkill to 3125 (Usage), adj 6, size 5445
<4>[04-25 20:30:13.632] [3219: Gallery]select 3255 (plugin-containe), adj 6, size 197, to kill
<4>[04-25 20:30:13.632] [3219: Gallery]send sigkill to 3255 (plugin-containe), adj 6, size 197
<4>[04-25 20:30:15.554] [28: kswapd0]select 3133 (Homescreen), adj 4, size 4019, to kill
<4>[04-25 20:30:15.554] [28: kswapd0]send sigkill to 3133 (Homescreen), adj 4, size 4019
<4>[04-25 20:30:16.224] [3219: Gallery]select 3219 (Gallery), adj 2, size 29238, to kill
<4>[04-25 20:30:16.224] [3219: Gallery]send sigkill to 3219 (Gallery), adj 2, size 29238
(Reporter)

Comment 124

4 years ago
Yeah, so sounds like those images are so big that the 24 image limit doesn't help. david, try setting the image decode limit to 1 and run the experiment again. If that works, I can probably make some changes to the patch.
It looks like reftests are still failing with the fix from comment 119 :
https://tbpl.mozilla.org/?tree=Try&rev=7d43ad4a2ed7
(Assignee)

Comment 126

4 years ago
You missed the brackets around (!sOnloadDecodeLimit || mImageTracker.Count() < sOnloadDecodeLimit)
(Assignee)

Comment 127

4 years ago
Although in the very common case that NS_SUCCEEDED(rv) is true that won't actually modify whether the if body is executed...
Not much better with the https://tbpl.mozilla.org/?tree=Try&rev=7b69b561e4d6

Joe, can you check the tests locally? I'm playing blind there.
(Assignee)

Comment 129

4 years ago
Yeah, doing so now.
(Reporter)

Comment 130

4 years ago
This is no longer a good use of fabrice time. Sorry for giving you a faulty patch to land fabrice. Joe, want to grab this? Of all of us here you are definitely most qualified.
(Reporter)

Updated

4 years ago
Assignee: fabrice → nobody
(Assignee)

Updated

4 years ago
Assignee: nobody → joe
(Reporter)

Updated

4 years ago
Whiteboard: [MemShrink][status: ready to land, see last comment for instructions] → [MemShrink][status: joe working on fixing the patch]
(Assignee)

Comment 131

4 years ago
Hooray, I can finally reproduce. Debugging!
(Assignee)

Comment 132

4 years ago
The below patch seems to fix it. Hooray! Trying: https://tbpl.mozilla.org/?tree=Try&rev=0fb01134417c

diff --git a/content/base/src/nsDocument.cpp b/content/base/src/nsDocument.cpp
--- a/content/base/src/nsDocument.cpp
+++ b/content/base/src/nsDocument.cpp
@@ -8895,18 +8895,18 @@ nsDocument::AddImage(imgIRequest* aImage
 
   nsresult rv = NS_OK;
 
   // If this is the first insertion and we're locking images, lock this image
   // too.
   if (oldCount == 0) {
     if (mLockingImages)
       rv = aImage->LockImage();
-    if (NS_SUCCEEDED(rv) && (sOnloadDecodeLimit || 
-        mImageTracker.Count() < sOnloadDecodeLimit))
+    if (NS_SUCCEEDED(rv) && (!sOnloadDecodeLimit ||
+                             mImageTracker.Count() < sOnloadDecodeLimit))
       rv = aImage->StartDecoding();
   }
 
   // If this is the first insertion and we're animating images, request
   // that this image be animated too.
   if (oldCount == 0 && mAnimatingImages) {
     nsresult rv2 = aImage->IncrementAnimationConsumers();
     rv = NS_SUCCEEDED(rv) ? rv2 : rv;
(Assignee)

Comment 133

4 years ago
Created attachment 742468 [details] [diff] [review]
patch for inbound
Attachment #741339 - Attachment is obsolete: true
(Reporter)

Comment 134

4 years ago
I will test on device after you land on inbound. Thanks!
(Assignee)

Comment 135

4 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/fe381d10084f

Updated

4 years ago
No longer blocks: 849001
https://hg.mozilla.org/mozilla-central/rev/fe381d10084f
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Whiteboard: [MemShrink][status: joe working on fixing the patch] → [MemShrink]
Target Milestone: --- → mozilla23

Updated

4 years ago
status-b2g18: --- → affected
status-b2g18-v1.0.1: --- → affected
(Reporter)

Comment 137

4 years ago
This still needs to be uplifted to b2g18.
(Assignee)

Comment 138

4 years ago
How does one do such a thing? I haven't had to in the past.
(Assignee)

Comment 139

4 years ago
Created attachment 743053 [details] [diff] [review]
b2g18 patch
Attachment #741121 - Attachment is obsolete: true
Your friendly neighborhood merge maven takes care of the B2G uplifts ;). Thanks for the branch patch, though. Makes the job easier!

https://hg.mozilla.org/releases/mozilla-b2g18/rev/61b65b06cba0
https://hg.mozilla.org/releases/mozilla-b2g18_v1_0_1/rev/10772f6fb11b
status-b2g18: affected → fixed
status-b2g18-v1.0.0: --- → wontfix
status-b2g18-v1.0.1: affected → fixed
status-firefox21: --- → wontfix
status-firefox22: --- → wontfix
status-firefox23: --- → fixed
Depends on: 866947
See https://bugzilla.mozilla.org/show_bug.cgi?id=860818#c45 for the QA analysis study here. Good news is that the original bug filed about "failing to load" Pinterest actually was a wfm, but the issue about stacking too many images to be decoded still occurs with and without this patch.

I'm waiting for Justin's input on what's the right followup to take here.

Marking verified to indicate this was tested though.
Status: RESOLVED → VERIFIED
Assuming that the relevant process even receives the memory pressure event before it OOMs (this is easy to check in logcat), it sounds like we should consider dropping some or all ongoing decodes on memory pressure, in addition to everything else we're doing.  I know we talked about this at Madrid...
^ This is bug 867264 now.

Updated

4 years ago
Depends on: 868066
(In reply to Andreas Gal :gal from comment #2)
> we have to check that bug 807143 doesn't reappear with this fix

Bad news. I think bug 807143 did reappear. See bug 868066.
See Also: → bug 972130
You need to log in before you can comment on or make changes to this bug.