Closed Bug 862970 Opened 12 years ago Closed 12 years ago

don't lock images on the active page on B2G

Categories

(Core :: Graphics: ImageLib, defect)

x86
macOS
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla23
blocking-b2g tef+
Tracking Status
firefox21 --- wontfix
firefox22 --- wontfix
firefox23 --- fixed
b2g18 --- fixed
b2g18-v1.0.0 --- wontfix
b2g18-v1.0.1 --- fixed

People

(Reporter: gal, Assigned: joe)

References

Details

(Whiteboard: [MemShrink])

Attachments

(2 files, 13 obsolete files)

12.60 KB, patch
Details | Diff | Splinter Review
12.44 KB, patch
Details | Diff | Splinter Review
We OOM on image-rich pages with B2G. We currently lock images on the current tab. Instead, we will disable image locking and set the timeout and max decode memory to very high values, and we will rely on the OOM handler to clear the decoded image cache.
Attached patch patch (obsolete) — Splinter Review
Assignee: nobody → gal
we have to check that bug 807143 doesn't reappear with this fix
I understand the locking change, but why the timeout & max decoded size change?
Attached patch patch (obsolete) — Splinter Review
Attachment #738643 - Attachment is obsolete: true
I see some issues with the patch. Decoding gets stuck half way through the image on pintrest.
In general it works pretty well though. Decoding is fast enough for normal reading/scrolling.
however, on GP I don't think the OOM handler kicks in
(In reply to Andreas Gal :gal from comment #5) > I see some issues with the patch. Decoding gets stuck half way through the > image on pintrest. That sounds like bug 807143.
(Bug 807143 was not just about flickering, but about images not redrawing/not redrawing fully, for reasons that was never explored.)
Component: Graphics → ImageLib
(In reply to Andreas Gal :gal from comment #7) > however, on GP I don't think the OOM handler kicks in With lock disabled, I'd be a little surprised if we stored enough decoded image memory to reach OOM, fwiw.
probably because the GP has more memory. this needs testing on a production device, but otherwise it looks promising
Sounds like we need someone to debug the "never explored" part of comment 9. Joe, are you available to do that? As for comment 10, look at the patch. We never discard anything except via the OOM handler.
(In reply to Andreas Gal :gal from comment #12) > Sounds like we need someone to debug the "never explored" part of comment 9. > Joe, are you available to do that? Not immediately, but in a couple of days I think I will be. > As for comment 10, look at the patch. We never discard anything except via > the OOM handler. That's why I asked in comment 3 why you were doing this :), but I think you're right that this will make us not discard except via OOM or exceeding 512 MB. There are occasions when we discard immediately, but that's mostly in corner cases.
Please note that it's not an out-of-memory hook, it's a low-memory notification. These are throttled to run no more than once every 5s (gonk.systemMemoryPressureRecoveryPollMS"), and even then only when we transition between a low-memory and a not-low-memory state. So suppose we have the following situation: * Use your phone. Lots of processes are loaded and running in the background. * Load the browser. This causes us to start up a new process, which causes the amount of free memory on the system to go under 10mb. We fire a low-memory notification. * Load Pintrest in the browser. At this point, we will not throw away any images until we free up enough memory so the system has more than 10mb free (because we must leave the low-memory state before we fire the low-memory notification again.) That's unlikely to happen; I expect we'll crash instead. We could fire the low-memory notification more often, but it's not clear how to do that correctly. We do not want every process on the system to GC every 5s while we're under memory pressure.
Andreas, given comment 14, if we didn't have the redrawing problem, could we ship with the image flashing that happens without the long timeout and huge decoded data limit?
Flags: needinfo?(gal)
Whiteboard: [MemShrink]
Other things you almost certainly want to do here: * Avoid the 5ms sync decode after the first time we draw an image. In fact, we should probably avoid the sync decode the first time we draw an image, too. This will break reftests, but leaving in the sync decode will probably lead to serious jank when scrolling. * Discard all images in a document when it is destroyed. Images can be shared between documents, so this isn't totally correct, but it's probably close enough. If relying on the low-memory notifications as-is is good enough, then it's good enough. I just think we may want to tweak them somehow, particularly if they're a primary means of notifying us that we should drop memory. It's a very delicate balance; you can imagine us dropping all images every 5s while we're under memory pressure (but not GC'ing etc), but that would be a bad experience if we're low on memory for some reason other than images.
I have never seen the state of comment 15 but I can go and reproduce that. In the meantime, we need to diagnose the partial decoding either way.
Flags: needinfo?(gal)
There are plenty of other ways to get into the state from comment 14. The point is that after we get into a low-memory state, we will not discard any images until we leave the low-memory state. That may be undesirable.
Let's keep this bug up to date tomorrow as well, Joe can get further involved, but note the six hour time difference.
(In reply to Justin Lebar [:jlebar] from comment #16) > leaving in the sync decode will > probably lead to serious jank when scrolling. Really? I thought scrolling was done in a different thread from the main Gecko thread. > * Discard all images in a document when it is destroyed. Images can be > shared between documents, so this isn't totally correct, but it's probably > close enough. Note that nsDocument::DeleteShell() already does this. I don't know whether that's good enough for the b2g case, though.
(In reply to Andreas Gal :gal from comment #17) > I have never seen the state of comment 15 but I can go and reproduce that. > In the meantime, we need to diagnose the partial decoding either way. I'm pretty sure it's partial *redrawing,* but regardless I will try to diagnose it tomorrow.
I still feel that the low memory notification approach is the right way to go. jlebar, can we somehow tell that we are "still in low memory" state? In that case I would prefer to fire of a timer that keeps throwing away images until we no longer are in low memory state.
Partial redrawing is fine. We should schedule another paint after that when we have the full image right?
> jlebar, can we somehow tell that we are "still in low memory" state? Absolutely; hopefully it's clear in GonkMemoryPressureMonitoring.cpp how to do this, but let me know if it's not. > Really? I thought scrolling was done in a different thread from the main Gecko thread. If I'm wrong, that's awesome. I didn't think we had async scrolling everywhere in b2g. Maybe in the browser but not in apps, or vice versa? > Note that nsDocument::DeleteShell() already does this. That should probably be sufficient.
You'd think so! Obviously something's going wrong, though; I wouldn't want to speculate as to what in particular.
Any updates here Joe? Did you debug the cause for the partial paints?
Comment on attachment 738966 [details] [diff] [review] Keep sending low-memory-ongoing notifications every 5s which flush caches but don't GC. This is completely untested. Justin, can you please ask Lucas and Faramarz to find you someone who can take this and finish this today? Also, please ask Jeff for help with the rendering glitch (partial image rendering) I am seeing on pintrest. Looks like no updates from Joe, so we should do our own investigation.
Attachment #738966 - Flags: review?(justin.lebar+bug)
(In reply to Andreas Gal :gal from comment #28) > Comment on attachment 738966 [details] [diff] [review] > Keep sending low-memory-ongoing notifications every 5s which flush caches > but don't GC. > > This is completely untested. Justin, can you please ask Lucas and Faramarz > to find you someone who can take this and finish this today? Also, please > ask Jeff for help with the rendering glitch (partial image rendering) I am > seeing on pintrest. Looks like no updates from Joe, so we should do our own > investigation. If we can cover it in Madrid great, but note that Joe's updates not being there between 7pm and 7am local time is not an indication that he's not going to look at it. I expect more information in a few hours, so if anything happens in the meantime, let's make sure this bug is up to date.
I'm going to play around with this on an unagi.
Flags: needinfo?(josh)
Comment on attachment 738966 [details] [diff] [review] Keep sending low-memory-ongoing notifications every 5s which flush caches but don't GC. Changes to GonkMemoryPressureMonitoring look good. But certainly the pref changes should be reviewed separately (and need tests). I'm not familiar with this code, but CompositorChild sends a message to the parent on memory pressure (CompositorChild.cpp::MemoryPressureObserver), and then the parent does stuff on behalf of the child. We may not want to do this, since it encourages priority inversion (low-priority bg processes making the main process do work on their behalf) when we're under long-lasting memory pressure. It's also not clear to me how expensive nsFontCache::Compact() is. Similarly, memory pressure causes us to run gfxFont.h::FlushShapedWordCaches, which looks like something we don't want to do every 5s. I'm most worried about CompositorChild, followed by FlushShapedWordCaches. Better safe than sorry? P.S. If you use git-bz [1], it will automatically attach -U8 patches. :) [1] https://github.com/jlebar/moz-git-tools
Attachment #738966 - Flags: review?(justin.lebar+bug) → review+
> But certainly the pref changes should be reviewed separately (and need tests). s/tests/testing/
Are we considering this for tef?
(In reply to Andreas Gal :gal from comment #26) > Any updates here Joe? Did you debug the cause for the partial paints? No, because of the 6 hour time difference. I'm starting on it now.
1.0.1 is crashing on pintrest and many other popular sites, so yes, if we can swing a decent fix, I will go and beg for an uplift.
(In reply to Andreas Gal :gal from comment #35) > 1.0.1 is crashing on pintrest and many other popular sites, so yes, if we > can swing a decent fix, I will go and beg for an uplift. JST and I just had a mini-triage, tef+ this.
blocking-b2g: --- → tef+
(In reply to Andreas Gal :gal from comment #5) > I see some issues with the patch. Decoding gets stuck half way through the > image on pintrest. So I'm clear: This happens when loading m.pinterest.com in the browser, right?
(In reply to Joe Drew (:JOEDREW! \o/) from comment #37) > (In reply to Andreas Gal :gal from comment #5) > > I see some issues with the patch. Decoding gets stuck half way through the > > image on pintrest. > > So I'm clear: This happens when loading m.pinterest.com in the browser, > right? Yes.
Blocks: 860818
Joe, any updates here?
I am currently debugging why bug 807143 happens with image locking disabled, with the hope that fixing that bug (very easily reproduced) will fix pinterest partial display problems (difficult to reproduce, at least for me on an Unagi phone).
Cool, thanks. I will focus on this patch then. On the TCL device loading pintrest kills the browser and takes down the system. Looks like the OOM killer doesn't work there.
I/Gecko ( 144): I/Gecko ( 144): ###!!! [Parent][AsyncChannel] Error: Channel error: cannot send/recv I/Gecko ( 144): E/GeckoConsole( 144): Content JS WARN at http://connect.facebook.net/en_US/all.js:52 in i: The "fb-root" div has not been created, auto-creating I/Gecko ( 144): I/Gecko ( 144): ###!!! [Parent][AsyncChannel] Error: Channel error: cannot send/recv I/Gecko ( 144): I/Gecko ( 144): I/Gecko ( 144): ###!!! [Parent][AsyncChannel] Error: Channel error: cannot send/recv I/Gecko ( 144): I get this while pintrest is sucking up all the memory and the device dies. Wonderful. This is a device with a kernel that might not be up to date.
Blocks: 852643
Joe, here is an easy test case with this patch. Applied it and then hold down the finger on an icon on the homescreen. We go into icon edit mode and all the icons get a little "x" on the top right and a second later all icons disappear (only the red x remains for each icon).
Yeah, memory pressure triggering definitely doesn't work on my device but with this patch things on pintrest are much much better (still running OOM eventually though, because memory pressure purge never comes). Patch also slightly changes some perceived painting so definitely not zero risk (we don't decode on load any more).
FWIW: Even after this works, I have some reservations about what the UX is going to be like for blessed apps if their icons, etc can be discarded from under them. There's going to be a bunch of flashing. I have no idea if it's possible to have different prefs for those apps, but if so, it'd be a good idea to enable locking for them.
We are definitely re-decoding these images after discard, but for some reason we're never redrawing them. Still trying to track down how the invalidation is getting lost.
Yeah, lets see what it looks like. I am using a device of the other vendor now. That device properly kills the browser and the system survives. I will try my patch now.
Note: flashing would only occur when we are in memory pressure, which basically means "omg omg we are about to die", so I think its ok to trade off a little flashing vs not dying. Outside memory pressure we should never flash since we never discard images.
So: we redecode an image and mark it dirty, but because we're in memory pressure or hit our limit or whatever, by the time we get back around to actually reading the image, it's been discarded again. This repeats in a continuous loop forever.
Cool. That problem should not happen with this patch. I am trying to get my device build to work to test with the kernel that actually properly sends memory pressure events. Thanks for the diagnosis Joe. Here is what I expect to happen: - we go into memory pressure - we throw away all decoded images - we start painting, re-decoded whatever is visible - we don't throw away, as long we are no longer in memory pressure - since we no longer lock all the images, we should come out of memory pressure, and everything is happy A corner case might be when lots of apps are open and we are surfing the memory pressure threshold. We will see in a few.
I could definitely see some other corner cases where having a giant image on-screen causes us to go into memory pressure again, or it takes a little while for memory pressure to "clear", but hopefully in practice we should be OK.
Ok, I need someone else to test this. My proto device is borked. Genlock is taking a mental health day.
Does jdm have a device that knows how to fire memory-pressure? (I don't. :( )
Attached patch port to b2g18 (obsolete) — Splinter Review
The main issue I am seeing are disappearing icons when long-tapping on the home screen (red x appears, icons disappear). This might be related to drawImage path. We should focus on this. This is unrelated to memory pressure and pintrest. I will work with jlebar on the actual OOM testing tomorrow.
Unfortunately I can't reproduce this problem at all. Andreas tells me that it happens on first boot, even. I need someone who *can* reproduce to break in CanvasRenderingContext2D::DrawImage and find out what happens in SurfaceForElement. If that fails to decode the image for whatever reason, it'd explain this behaviour, and we'd be able to move on from there. (Note that my attempts are all on b2g18.)
> I get this while pintrest is sucking up all the memory and the device dies. Wonderful. FWIW we need dmesg output in order to understand OOMs. logcat doesn't tell us anything helpful.
I tried to build B2G against trunk mozilla-central with Andreas' patch, but it just ended up with a build that wouldn't boot. I blew away all objdirs and am trying again, but now I'm out of time until tomorrow.
I suspect you'll have much better luck building b2g18 than m-c. Nobody tests m-c; it's perpetually broken.
Yeah, okay. Hopefully someone can tell me what is happening in CanvasRenderingContext2D::DrawImage; if I have that information I can figure it out.
milan, can we get some help today here?
Sure, just waiting for people to show up.
Ok, so I think this is actually a blob that gets rendered that we snapshotted from a canvas. We read img.naturalWidth and if that is zero we don't paint it.
So I think we fire onload before .width and .height is available. The gaia homescreen seems to rely on that being available and before it was because we start decode early.
We might be revoking the blob url after onload and we set src to that url but the image isn't decoded yet (not locked!) so it can't be decoded later when we paint it.
Attachment #738966 - Attachment is obsolete: true
Attachment #739256 - Attachment is obsolete: true
(In reply to Andreas Gal :gal from comment #66) > Created attachment 739508 [details] [diff] [review] > updated patch, decode early, but still don't lock This seems reasonable if it works.
With this patch, I don't get the problem described in comment 43, but can't test pinterest.com itself, I'm running into a "can't turn wifi on" known bug.
On pinterest.com I managed to get the "oh, how embarrassing, this is taking too long" once after some scrolling, then within 5s on the reload. OK the second time.
No idea if Andreas' patch builds on other trees, but it certainly didn't on b2g18. Here's a version that does.
Flags: needinfo?(josh)
(In reply to Andreas Gal :gal from comment #64) > So I think we fire onload before .width and .height is available. The gaia > homescreen seems to rely on that being available and before it was because > we start decode early. Barring a bug, this definitely isn't the case - that's the whole point of onload. (In reply to Andreas Gal :gal from comment #65) > We might be revoking the blob url after onload and we set src to that url > but the image isn't decoded yet (not locked!) so it can't be decoded later > when we paint it. What do you mean by "revoking the blob url"? drawImage should always synchronously decode the image as long as it's loaded.
This patch looks reasonable. I don't think it should cause much (if any!) main-thread jank. I still want to know what's happening in drawImage, though; the RequestDecode part of this patch shouldn't be necessary.
What seems to happen is that * We have an image, * We drawImage it to a canvas * We pull the canvas out as a blob * /that blob/ is not decoded.
Reading page.js in the homescreen app, it's entirely possible that the Object URL *is* being revoked before it can be painted, meaning that if we don't explicitly requestDecode as the image is added to the document (like Andreas' patch does), we can never decode it because the blob URL is gone.
Specifically, displayRenderedIcon calls revokeObjectURL during onload, which is basically guaranteed to be too early; onload often fires before any painting is done, and painting is the only thing *guaranteed* to start the image decoding process.
(In reply to Joe Drew (:JOEDREW! \o/) from comment #75) > Specifically, displayRenderedIcon calls revokeObjectURL during onload, which > is basically guaranteed to be too early; onload often fires before any > painting is done, and painting is the only thing *guaranteed* to start the > image decoding process. That sounds like a reasonable explanation for why this only happens with blobs.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #76) > (In reply to Joe Drew (:JOEDREW! \o/) from comment #75) > > Specifically, displayRenderedIcon calls revokeObjectURL during onload, which > > is basically guaranteed to be too early; onload often fires before any > > painting is done, and painting is the only thing *guaranteed* to start the > > image decoding process. > > That sounds like a reasonable explanation for why this only happens with > blobs. But there's still some weird stuff going on.
Attached patch patch (obsolete) — Splinter Review
Attachment #739508 - Attachment is obsolete: true
Assigning to me to drive it, it still needs the full effort by everybody involved to actually resolve it :)
One thing I've thought about. With this patch, won't we accumulate decoded images from pages that we've navigated away from until we hit OOM?
(In reply to Jeff Muizelaar [:jrmuizel] from comment #80) > One thing I've thought about. With this patch, won't we accumulate decoded > images from pages that we've navigated away from until we hit OOM? It depends on when nsDocument::DeleteShell is called. If that's called when we navigate away from a page, we're fine.
Whiteboard: [MemShrink] → [MemShrink][status: needs new patch, our top people are on it]
What was the status of "mImageLocking is false" before? We never bothered decoding the images in that case.
We would decode them if they were painted.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #80) > One thing I've thought about. With this patch, won't we accumulate decoded > images from pages that we've navigated away from until we hit OOM? Presumably we'd hit a low memory notification first, though.
To put in writing here what we've already discussed: We were observing a low-memory notification in the master process, but the child wasn't getting it before it was killed. jdm tried increasing the threshold at which we get notified, and then the child gets the low-memory notification and then immediately dies.
(In reply to Milan Sreckovic [:milan] from comment #79) > Assigning to me to drive it This time for real :)
Assignee: gal → milan
jlebar, how is it possible that the child process OOMs so quickly? Can we tell remaining physical memory from user space so we avoid getting so close?
Our current idea is to do what we used to do on Maemo, and check for memory pressure when we're trying to create a new image frame; if we're currently under pressure, we'll just abort the decode.
> jlebar, how is it possible that the child process OOMs so quickly? I don't see how it should be impossible. Note that allocations in the main process (for example, Necko buffers) can cause us to OOM in the child. > Can we tell remaining physical memory from user space so we avoid getting so close? Yes; working on it.
This compiles, but it doesn't run for me, probably because the b2g18 base rev I pulled is broken. As a result, I haven't tested this; it probably doesn't work right. I need to go for the evening, and I'm on a plane tomorrow, but maybe this will help someone.
It looks like this isn't enough, which isn't *too* surprising - with overcommit it's possible for memory to go from fine to too-low without calling EnsureFrame. We're going to need a hook in FrameUpdated too, but one that forcibly kills the decoder and discards the frames.
> It looks like this isn't enough I'm really surprised that my patch worked perfectly on the first try. Did you check that it's reading the amount of free memory correctly? Sorry I haven't been able to test it myself. It sucks that our trees are so unstable. > with overcommit it's possible for memory to go from fine to too-low without calling > EnsureFrame. Yeah, that's a good point. I can imagine us kicking off N decodes all at about the same time and saying "yep, plenty of space for you" for each of them. I can also imagine us discarding all images and still falling over due to the images currently being decoded. Before someone asks, I don't think we can disable overcommit without first modifying jemalloc and possibly the JS engine. I suspect thirdy-party code like drivers may also rely on overcommit. jdm and I talked for a while, and we think we still have some options here. Not all of these are mutually-exclusive. 1. As Joe suggested, we could cancel decodes on memory pressure, or we could even read the amount of free memory on the system while we decode. 2. We could fire the low-memory event at a threshold above 10mb. jdm observed that when the threshold was 15mb, the child process at least received the low-memory notification from the kernel before it was killed. Maybe we need to go to 20mb. 3. We could fire the memory-pressure event faster after receiving the notification from the kernel. Right now we dispatch an event to the main thread, but we could use nsThread.cpp::ScheduleMemoryPressureEvent, which cuts the line in the event loop. 4. We could allocate our images in discardable memory. This would be a pretty heavyweight change. There's a sketch of an API and implementation in bug 748598. I suspect it may be hard to get the right discarding policy in the kernel, though.
The discardable memory stuff looks like exactly what we need. I have a sick day in Madrid, so I will grab that bug and go for it.
Not to be Debbie Downer, but an important part of using ashmem will be ensuring that processes get "charged" correctly for memory that they own, and that we free memory owned by low-priority processes before we free memory owned by high-priority processes. Since ashmem is a shared memory system, I suspect it may not do this well.
Ok. In that case I think the next best thing is to take your patch and in the alloc path purge the discard tracker and then trying to allocate again. We want to avoid failing in that path.
jlebar, I have a nice DiscardableMemory abstraction patch. I guess we could still use that in the parent, especially in necko.
jlebar, as far as I can tell ashmem properly accounts ashmem towards the "owner" process and when that process dies, ashmem gets deallocated. ashmem's purge policy is LRU and not bound to priorities. I think thats ok for cache-like use, so I want to start using it for the image cache and memcache and what not.
This patch actually works very well and the browser stays at a very manageable process size with pintrest. Flashing is minimal/not noticable after the page fully loaded. Until loading however there is an issue. I am kicking off eager decoding, and that eager decoding keeps kicking out images on top of the page while images further down load. We have to do eager decoding only until the image tracker quota is full. The next patch will do that.
Assignee: milan → gal
Attachment #739540 - Attachment is obsolete: true
Attachment #739591 - Attachment is obsolete: true
Attachment #739742 - Attachment is obsolete: true
Ok, after a few hours of debugging I think the problem is that we don't start tracking images in the discard tracker until _after_ decoding is complete and we fire off too many eager decoding requests and they thresh the decoded image cache, which is why the first images on the page don't show up (they keep getting evicted by decoding requests from further down in the document).
(In reply to Andreas Gal :gal from comment #96) > jlebar, I have a nice DiscardableMemory abstraction patch. I guess we could > still use that in the parent, especially in necko. Upon reflection, I have a worse concern than the one from comment 94. Linux has an OOM killer component that kills processes when the system is (very nearly) out of memory. Android adds a second component, the low-memory killer (LMK), which usually acts before the OOM killer. The LMK can follow a set of strict priorities, whereas the priorities we give to the OOM killer are more like guidelines. The LMK can also be configured to kill certain classes of apps when we reach certain thresholds of free memory, and we use this. For example, we kill fg apps when there's less than 5mb of free memory on the system. This seems to keep the main process from OOMing itself. So the difficulty with discardable memory is, we need ashmem to work together with the LMK. If a fg app has an ashmem mapping, we want that to be discarded before the fg app is killed by the LMK, otherwise ashmem isn't very useful. Looking through the kernel source, it appears that maybe we will try to free ashmem segments before we invoke the LMK; I wasn't expecting even this level of coordination. There doesn't appear to be any coordination causing us to kill low-priority tasks before we free a high-priority task's ashmem segments, which might cause us to favor keeping background tasks alive at the expense of being able to show images in a foreground task. Anyway, if we went with this approach, we'd need to test it out and see what happens. Perhaps we can tune our LMK params so that it works well enough. Like low-memory notifications, ashmem may be a helpful tool as part of an ensemble of approaches, but I doubt it will fix all the problems.
Attachment #739963 - Attachment is patch: true
Attachment #739963 - Attachment mime type: text/x-patch → text/plain
Justin, Andreas: do you need Jeff or me to do anything at this point?
Attachment #739963 - Attachment is obsolete: true
The version of the patch from earlier this morning (not the comment #102 one) did not fix my problems in bug 861965. See https://bugzilla.mozilla.org/show_bug.cgi?id=861965#c14 and https://bugzilla.mozilla.org/attachment.cgi?id=740429 I haven't tried this latest version yet.
Attached patch remove some debug junk (obsolete) — Splinter Review
Attachment #740421 - Attachment is obsolete: true
Attachment #740432 - Flags: review?(justin.lebar+bug)
I have a gun up to my head to finish other blockers by Wednesday. I'll get to this when I can, but you'll probably find a more timely review elsewhere.
Comment on attachment 740432 [details] [diff] [review] remove some debug junk Review of attachment 740432 [details] [diff] [review]: ----------------------------------------------------------------- I'm taking liberty of offering r- on this patch, not because I have any idea what it does but because it does not compile. ::: content/base/src/nsDocument.cpp @@ +8444,5 @@ > + if (oldCount == 0) { > + if (mLockingImages) > + rv = aImage->LockImage(); > + if (NS_SUCCEEDED(rv) && sOnloadDecodeLimit && > + mLockingImages.Length() < sOnloadDecodeLimit) This line fails to compile: The latest patch does not build for me: /Users/djf/mozilla/mozilla-b2g18/content/base/src/nsDocument.cpp: In member function 'virtual nsresult nsDocument::AddImage(imgIRequest*)': /Users/djf/mozilla/mozilla-b2g18/content/base/src/nsDocument.cpp:8448: error: request for member 'Length' in '((nsDocument*)this)->nsDocument::mLockingImages', which is of non-class type 'bool'
Attachment #740432 - Flags: review?(justin.lebar+bug) → review-
pinterest no longer crashes for me with this patch, and memory stays around 70MB instead of eventually sucking up all the memory. Also, initial rendering seems faster since we have fewer images competing for CPU time.
Attachment #740432 - Attachment is obsolete: true
Attachment #741121 - Attachment is patch: true
Attachment #741121 - Attachment mime type: text/x-patch → text/plain
Attachment #741121 - Flags: review?(jmuizelaar)
jlebar already reviewed the OOM pieces but I think jeff should in addition review the change to the image load behavior. This is disabled for anything but FFOS, so I think this is safe to land on trunk.
I made a testcase that contains 200 images on a page. This is probably a bit more reliable to test than pinterest: http://people.mozilla.org/~gal/imagelock/
Blocks: 849001
Comment on attachment 741121 [details] [diff] [review] disable image locking on b2g, handle ongoing OOM condition better, and only eagerly decode first 24 images in a document Review of attachment 741121 [details] [diff] [review]: ----------------------------------------------------------------- ::: content/base/src/nsDocument.cpp @@ +8442,5 @@ > // If this is the first insertion and we're locking images, lock this image > // too. > + if (oldCount == 0) { > + if (mLockingImages) > + rv = aImage->LockImage(); I want this change in general, actually, so can you propose a separate patch to do it on mozilla-central?
Attachment #741121 - Flags: review?(jmuizelaar) → review+
Sure, happy to do that. I was actually planning to land this on m-c and then uplift. Do you see a problem with applying this to m-c? It won't change behavior for Firefox just yet, but we can start playing with the new prefs (I think we should limit eager decoding on desktop too).
Attached patch patch for inbound (obsolete) — Splinter Review
Attachment #741339 - Flags: review?(joe)
Whiteboard: [MemShrink][status: needs new patch, our top people are on it] → [MemShrink][status: patch ready for b2g18, m-c patch waiting for review]
Comment on attachment 741339 [details] [diff] [review] patch for inbound Review of attachment 741339 [details] [diff] [review]: ----------------------------------------------------------------- I didn't review anything regarding the memory notifications, so if the merge wasn't trivial you might want to get jlebar to look at them. ::: b2g/app/b2g.js @@ +272,5 @@ > pref("image.mem.decodeondraw", true); > +pref("content.image.allow_locking", false); /* don't allow image locking */ > +pref("image.mem.min_discard_timeout_ms", 86400000); /* 24h, we rely on the out of memory hook */ > +pref("image.mem.max_decoded_image_kb", 30000); /* 30MB seems reasonable */ > +pref("image.onload.decode.limit", 24); /* don't decode more than 24 images eagerly */ image.onload.decode.limit should be added to all.js too, initially with a value of 0. It's probably best to leave it disabled like that for now, just so we limit our scope in this bug; I'm sure mobile in particular will want a value on it too, but I just don't know what that value will be. Post bug 847223, we won't need this on desktop; if and when bug 691169 gets fixed, it won't matter for Fennec either. (I don't see an equivalent to making bug 689623 work on B2G, but I might have just missed it.)
Attachment #741339 - Flags: review?(joe) → review+
This needs someone to land it. Assigning to faramarz to find an owner land the patches (m-c patch needs a tryserver run) and also add the all.js pref.
Assignee: gal → frashed
Whiteboard: [MemShrink][status: patch ready for b2g18, m-c patch waiting for review] → [MemShrink][status: ready to land, see last comment for instructions]
Assignee: frashed → fabrice
Pushed to try with the new pref in all.js: https://tbpl.mozilla.org/?tree=Try&rev=2e5cd94c2d96
So, we have a reftests failing because of this patch. Joe, any chance you can take a look?
I'd be surprised if reftests weren't failing. I think we should just disable them, or alternatively disable this behavior during reftests. The right fix is bug 695763.
This fails reftests on desktop platforms, even, though. The code change shouldn't have modified desktop behaviour!
Argh. 2.50 + if (NS_SUCCEEDED(rv) && sOnloadDecodeLimit && 2.51 + mImageTracker.Count() < sOnloadDecodeLimit) 2.52 rv = aImage->StartDecoding(); should be 2.50 + if (NS_SUCCEEDED(rv) && (!sOnloadDecodeLimit || 2.51 + mImageTracker.Count() < sOnloadDecodeLimit)) 2.52 rv = aImage->StartDecoding();
(but indented correctly)
When I apply the patch (without Joe's comment 119 fix) to m-c, and try it as a fix for bug 861965, the phone reboots.
Joe's fix makes sense. I can take a look at comment #121. I didn't check that use case yet. 5mb images might exceed the 24 picture limit. I didn't find an easy way to limit total size, only count.
With Joe's patch, I still crash, with a segfault. Here are what look like the relevant lines from logcat, starting with the last debugging message from the gallery. At this point, the gallery app has successfully loaded and created previews for 3 5mp images, and is just starting to load the 4th image: E/GeckoConsole( 3219): Content JS LOG at app://gallery.gaiamobile.org/js/metadata_scripts.js:1152 in createThumbnailAndPreview: createThumbnailAndPreview 0/IMG_20121006_171351.jpg I/Gecko ( 3075): I/Gecko ( 3075): ###!!! [Parent][AsyncChannel] Error: Channel error: cannot send/recv I/Gecko ( 3075): I/Gecko ( 3075): I/Gecko ( 3075): ###!!! [Parent][AsyncChannel] Error: Channel error: cannot send/recv I/Gecko ( 3075): D/memalloc( 3075): /dev/pmem: Freeing buffer base:0x48955000 size:40960 offset:8048640 fd:137 D/memalloc( 3075): /dev/pmem: Freeing buffer base:0x4836b000 size:614400 offset:1847296 fd:127 I/Gecko ( 3075): [Parent 3075] ###!!! ABORT: actor has been |delete|d: file /Users/djf/mozilla/builds/obj-mc-gonk/ipc/ipdl/PGrallocBufferParent.cpp, line 392 E/Gecko ( 3075): mozalloc_abort: [Parent 3075] ###!!! ABORT: actor has been |delete|d: file /Users/djf/mozilla/builds/obj-mc-gonk/ipc/ipdl/PGrallocBufferParent.cpp, line 392 F/libc ( 3075): Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1) I/Gecko ( 3075): [Parent 3075] WARNING: waitpid failed pid:3255 errno:10: file /Users/djf/mozilla/mozilla-central/ipc/chromium/src/base/process_util_posix.cc, line 260 I/DEBUG ( 3073): debuggerd committing suicide to free the zombie! I/DEBUG ( 3257): debuggerd: Apr 15 2013 10:15:46 And here's an excerpt from dmesg output: <4>[04-25 20:30:12.421] [3219: Gallery]select 3125 (Usage), adj 6, size 5445, to kill <4>[04-25 20:30:12.421] [3219: Gallery]send sigkill to 3125 (Usage), adj 6, size 5445 <4>[04-25 20:30:13.632] [3219: Gallery]select 3255 (plugin-containe), adj 6, size 197, to kill <4>[04-25 20:30:13.632] [3219: Gallery]send sigkill to 3255 (plugin-containe), adj 6, size 197 <4>[04-25 20:30:15.554] [28: kswapd0]select 3133 (Homescreen), adj 4, size 4019, to kill <4>[04-25 20:30:15.554] [28: kswapd0]send sigkill to 3133 (Homescreen), adj 4, size 4019 <4>[04-25 20:30:16.224] [3219: Gallery]select 3219 (Gallery), adj 2, size 29238, to kill <4>[04-25 20:30:16.224] [3219: Gallery]send sigkill to 3219 (Gallery), adj 2, size 29238
Yeah, so sounds like those images are so big that the 24 image limit doesn't help. david, try setting the image decode limit to 1 and run the experiment again. If that works, I can probably make some changes to the patch.
It looks like reftests are still failing with the fix from comment 119 : https://tbpl.mozilla.org/?tree=Try&rev=7d43ad4a2ed7
You missed the brackets around (!sOnloadDecodeLimit || mImageTracker.Count() < sOnloadDecodeLimit)
Although in the very common case that NS_SUCCEEDED(rv) is true that won't actually modify whether the if body is executed...
Not much better with the https://tbpl.mozilla.org/?tree=Try&rev=7b69b561e4d6 Joe, can you check the tests locally? I'm playing blind there.
Yeah, doing so now.
This is no longer a good use of fabrice time. Sorry for giving you a faulty patch to land fabrice. Joe, want to grab this? Of all of us here you are definitely most qualified.
Assignee: fabrice → nobody
Assignee: nobody → joe
Whiteboard: [MemShrink][status: ready to land, see last comment for instructions] → [MemShrink][status: joe working on fixing the patch]
Hooray, I can finally reproduce. Debugging!
The below patch seems to fix it. Hooray! Trying: https://tbpl.mozilla.org/?tree=Try&rev=0fb01134417c diff --git a/content/base/src/nsDocument.cpp b/content/base/src/nsDocument.cpp --- a/content/base/src/nsDocument.cpp +++ b/content/base/src/nsDocument.cpp @@ -8895,18 +8895,18 @@ nsDocument::AddImage(imgIRequest* aImage nsresult rv = NS_OK; // If this is the first insertion and we're locking images, lock this image // too. if (oldCount == 0) { if (mLockingImages) rv = aImage->LockImage(); - if (NS_SUCCEEDED(rv) && (sOnloadDecodeLimit || - mImageTracker.Count() < sOnloadDecodeLimit)) + if (NS_SUCCEEDED(rv) && (!sOnloadDecodeLimit || + mImageTracker.Count() < sOnloadDecodeLimit)) rv = aImage->StartDecoding(); } // If this is the first insertion and we're animating images, request // that this image be animated too. if (oldCount == 0 && mAnimatingImages) { nsresult rv2 = aImage->IncrementAnimationConsumers(); rv = NS_SUCCEEDED(rv) ? rv2 : rv;
Attachment #741339 - Attachment is obsolete: true
I will test on device after you land on inbound. Thanks!
No longer blocks: 849001
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Whiteboard: [MemShrink][status: joe working on fixing the patch] → [MemShrink]
Target Milestone: --- → mozilla23
This still needs to be uplifted to b2g18.
How does one do such a thing? I haven't had to in the past.
Attached patch b2g18 patchSplinter Review
Attachment #741121 - Attachment is obsolete: true
Depends on: 866947
See https://bugzilla.mozilla.org/show_bug.cgi?id=860818#c45 for the QA analysis study here. Good news is that the original bug filed about "failing to load" Pinterest actually was a wfm, but the issue about stacking too many images to be decoded still occurs with and without this patch. I'm waiting for Justin's input on what's the right followup to take here. Marking verified to indicate this was tested though.
Status: RESOLVED → VERIFIED
Assuming that the relevant process even receives the memory pressure event before it OOMs (this is easy to check in logcat), it sounds like we should consider dropping some or all ongoing decodes on memory pressure, in addition to everything else we're doing. I know we talked about this at Madrid...
Depends on: 868066
(In reply to Andreas Gal :gal from comment #2) > we have to check that bug 807143 doesn't reappear with this fix Bad news. I think bug 807143 did reappear. See bug 868066.
See Also: → 972130
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: