Last Comment Bug 854799 - need the ability to discard image memory immediately
: need the ability to discard image memory immediately
Status: RESOLVED FIXED
[MemShrink:P1]
:
Product: Core
Classification: Components
Component: ImageLib (show other bugs)
: 18 Branch
: ARM Gonk (Firefox OS)
: -- normal (vote)
: mozilla23
Assigned To: Kyle Huey [:khuey] (khuey@mozilla.com)
:
Mentors:
Depends on: 865929
Blocks: 854783
  Show dependency treegraph
 
Reported: 2013-03-25 23:37 PDT by David Flanagan [:djf]
Modified: 2013-05-06 12:12 PDT (History)
21 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
tef+
wontfix
wontfix
fixed
fixed
wontfix
fixed


Attachments
link to test case on github (108 bytes, text/html)
2013-03-26 12:51 PDT, David Flanagan [:djf]
no flags Details
Patch (2.17 KB, patch)
2013-04-02 18:00 PDT, Kyle Huey [:khuey] (khuey@mozilla.com)
justin.lebar+bug: review+
Details | Diff | Review
patch, backout changeset 0caf5937f8bc due to tp regression (2.45 KB, patch)
2013-04-07 22:17 PDT, John Daggett (:jtd)
roc: review+
Details | Diff | Review
Patch (6.07 KB, patch)
2013-04-08 05:31 PDT, Kyle Huey [:khuey] (khuey@mozilla.com)
justin.lebar+bug: review+
Details | Diff | Review
Patch for b2g (9.82 KB, patch)
2013-04-11 10:56 PDT, Kyle Huey [:khuey] (khuey@mozilla.com)
no flags Details | Diff | Review

Description David Flanagan [:djf] 2013-03-25 23:37:56 PDT
The FirefoxOS gallery app has to handle very large images on a very memory constrained device. This is most difficult when scanning for new images. If images have a sufficiently large EXIF preview, the gallery just uses that and all is well.

But for large images without previews, the app must decode the full-size image in order to create a preview image and a thumbnail image. Decoding a 5mp image takes 20mb of memory.  The app needs to be able to scan these new images quickly.

But there is no way to force gecko to free image memory immediately.  If the app scans and decodes preview-less images as quickly as it can, it crashes with an OOM, even though it is using a single off-screen image element for all the image decoding.  

In order to prevent the OOM crashes, the app inserts an artificial delay between images.  For 2mp images without previews, I had a 400ms delay.  For 5mp images, I've had to wait more than 3 seconds between images. I'm actually working on a patch (in bug 847060) that squares the number of megapixels, multiplies that by 150ms and pauses for that long before moving on to the next image. It is a horrible heuristic, and won't always be good enough to avoid OOMs on all hardware in all conditions.

I've tried reducing the image.mem.min_discard_timeout_ms pref from 10000 to 200, but it doesn't really help.  I still have to wait more than 1500 ms between images.

So I really need a way to force image memory to be freed, or at least to find out when the image memory has been freed.

Possibly this should be considered a DOM bug instead of an imagelib bug. Feel feel to reassign it as appropriate.
Comment 1 Justin Lebar (not reading bugmail) 2013-03-26 02:17:28 PDT
> But there is no way to force gecko to free image memory immediately.

Setting src to "" really should work.  (I know you've tried it.)  Can you please provide a testcase?
Comment 2 David Flanagan [:djf] 2013-03-26 12:51:26 PDT
Created attachment 729727 [details]
link to test case on github

The link is to a github pull request that adds a test case to the UITests app.

To test: pull the PR into a local branch and push to your phone.

Launch the browser to use up some memory and have a realistic load on the phone

Launch the UITests app

Scroll to the bottom and tap on Big Image Decode test

Click on the Start button and watch.

You'll see a counter appear. Each time it increments the app has successfully decoded another 8mp jpeg image. It may increment a few times, but you'll probably see an OOM pretty soon.  If not, then attempting to switch to the browser app will probably cause the OOM.

The first line of test_apps/uitest/js/bigimage.js is |var pause = 0|. If you change that value to 1000, you may find that the OOM will stop happening.  If not, an even higher value may work.  For the gallery app, I need to use over 3000, but in this test case it doesn't seem as bad.

I think that without a sufficiently long pause, the test case is allocating memory for a second image before the memory for the first image has been released. So if the system does not have enough memory for two 8mp images to be decoded at the same time we get an OOM even though we only need to have one image decoded at a time.

There is a similar test online at http://djf.net/bigimage.html  I can't reproduce the OOM when running that test in the browser app, though.
Comment 3 David Flanagan [:djf] 2013-03-26 12:52:21 PDT
Justin: I asked for your feedback on that attachment just as a way go bring the test case to your attention.
Comment 4 Milan Sreckovic [:milan] 2013-03-26 14:23:10 PDT
These "delay" tactics would probably need to be re-examined once we get to 22 with the multi-threaded image decoding.
Comment 5 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-02 17:59:39 PDT
So this is fun.  img.src = "" works, unless img is not actually in the document ...
Comment 6 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-02 18:00:08 PDT
Created attachment 732616 [details] [diff] [review]
Patch
Comment 7 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-02 18:10:29 PDT
If we're still taking b2g patches this could be a nice win ...
Comment 8 Justin Lebar (not reading bugmail) 2013-04-02 18:38:30 PDT
Comment on attachment 732616 [details] [diff] [review]
Patch

r=me, but is the comment right?  You're talking about images which were "never in the document", but that does not seem to be a necessary condition to hit this bug.
Comment 9 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-02 18:47:04 PDT
Yeah I should just say not in the document.
Comment 10 David Flanagan [:djf] 2013-04-02 20:59:31 PDT
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #7)
> If we're still taking b2g patches this could be a nice win ...

It would be *such* a nice win!  Nominating for tef.  

Bug 847060 was tef+ and is the one that introduced the 3 second delay between preview-less images when scanning. If we can uplift this fix, we can remove that delay, which would be really nice for gallery startup.
Comment 11 Justin Lebar (not reading bugmail) 2013-04-02 21:16:09 PDT
I don't think khuey meant to close this?
Comment 12 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-03 06:04:26 PDT
(In reply to Justin Lebar [:jlebar] from comment #11)
> I don't think khuey meant to close this?

Uh, no.  I'm not going to pretend to know what happened there.
Comment 13 Ryan VanderMeulen [:RyanVM] 2013-04-03 15:22:18 PDT
https://hg.mozilla.org/mozilla-central/rev/0caf5937f8bc
Comment 14 Boris Zbarsky [:bz] (Out June 25-July 6) 2013-04-03 23:39:06 PDT
Could this have caused Tp5 regressions?  It's in the ranges for the ones from today....
Comment 15 Ryan VanderMeulen [:RyanVM] 2013-04-04 05:32:57 PDT
(In reply to Boris Zbarsky (:bz) from comment #14)
> Could this have caused Tp5 regressions?  It's in the ranges for the ones
> from today....

Holding off on uplifting this until this is answered.
Comment 16 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-04 06:00:49 PDT
(In reply to Boris Zbarsky (:bz) from comment #14)
> Could this have caused Tp5 regressions?  It's in the ranges for the ones
> from today....

Anything is possible but I think it's pretty unlikely.
Comment 17 Justin Lebar (not reading bugmail) 2013-04-04 07:24:38 PDT
It's possible that a site sets some img.src = '' and then inserts say another image with the old src into the DOM, and we have to re-decode it.

It's not clear to me how we fix that without breaking the b2g use-case.  But you could certainly detect it on tp5 with judiciously-placed printfs.
Comment 18 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-04 07:27:40 PDT
That's possible.  Bug 839103 is in the regression range too though and already claims to regress Dromaeo (comment 30) so I'm going to blame froydnj ;-)
Comment 19 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-04 07:28:49 PDT
Plus if we land this on b2g and see a regression there we'll know it's this.
Comment 20 Boris Zbarsky [:bz] (Out June 25-July 6) 2013-04-04 08:11:04 PDT
Bug 839103 is in some of the mails, but only one or two.  But yes, it's the obvious other candidate.

This patch makes us sync-discard on ~HTMLImageElement.  How expensive is discarding?  Could it be the action of discarding all the images during page unload that matters?  Or is it possible we were keeping the images decoded until Tp looped back to the page again?
Comment 21 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-04 08:12:40 PDT
Justin can correct me if I'm wrong, but discarding appears to be pretty cheap.  Re-decoding, of course, is not.  It may be possible that we were keeping images around until Tp looped, depending on how long that loop takes ...
Comment 22 Justin Lebar (not reading bugmail) 2013-04-04 08:14:08 PDT
I'd be very surprised if discarding itself was slowing us down.
Comment 23 Boris Zbarsky [:bz] (Out June 25-July 6) 2013-04-04 08:29:31 PDT
As a side note, I wonder whether ~HTMLImageElement should in fact force-discard or whether it should be limited to src="".  Especially in cases when there are multiple images pointing to the same src...
Comment 24 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-04 08:32:20 PDT
I'm pretty sure that RequestDiscard doesn't actually discard if other things have the image locked.
Comment 25 Matt Brubeck (:mbrubeck) 2013-04-04 10:30:13 PDT
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #16)
> (In reply to Boris Zbarsky (:bz) from comment #14)
> > Could this have caused Tp5 regressions?  It's in the ranges for the ones
> > from today....
> 
> Anything is possible but I think it's pretty unlikely.

This bug or one of the others from the same push (Bug 849654, Bug 855804) is pretty clearly the culprit for this Windows PGO Tp5 regression:

http://graphs.mozilla.org/graph.html#tests=[[255,131,31]]&sel=1364931977000,1365104777000&displayrange=7&datatype=running
Comment 26 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-04 10:36:39 PDT
(In reply to Matt Brubeck (:mbrubeck) from comment #25)
> This bug or one of the others from the same push (Bug 849654, Bug 855804) is
> pretty clearly the culprit for this Windows PGO Tp5 regression:
> 
> http://graphs.mozilla.org/graph.html#tests=[[255,131,31]]&sel=1364931977000,
> 1365104777000&displayrange=7&datatype=running

That's a non-PGO build for x64 Windows 8, which is like triple unsupported.  I really wish graphserver would show me the other data series though.
Comment 27 Matt Brubeck (:mbrubeck) 2013-04-04 10:55:03 PDT
You can use the links here to view before/after numbers and graphs for other builds:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=tp5&fromchange=d30f993f121f&tochange=0d688b7c4411

For example, this Windows 7 non-PGO graph shows a clear regression where the only pushes in range are yours and mine (and mine was NPOTB on desktop):
http://graphs.mozilla.org/graph.html#tests=[[275,131,12]]&sel=none&displayrange=7&datatype=running

Windows 7 PGO had a similar-sized (5-6%) regression at the same time, though the range is larger because of fewer PGO runs:
http://graphs.mozilla.org/graph.html#tests=[[255,63,12]]&sel=none&displayrange=7&datatype=running
Comment 28 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-04 10:57:00 PDT
Ok so I guess the next question is whether or not comment 20's conjecture is true.
Comment 29 Daniel Coloma:dcoloma 2013-04-06 16:57:02 PDT
Marking the status flags accordingly so this can be uplifted to the appropriate branches
Comment 30 John Daggett (:jtd) 2013-04-07 16:40:08 PDT
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #28)
> Ok so I guess the next question is whether or not comment 20's conjecture is
> true.

Um, maybe this is a naive question but did you do a tryserver push with/without the patch?  Was there a difference in the perf numbers?

The Win7 opt tp numbers show a 5% regression between 798d734dc32a and 6bcb665562b0, so I pushed some tryserver builds to try and narrow down the exact culprit.

changeset:   127495:6bcb665562b0
user:        Kyle Huey <khuey@kylehuey.com>
date:        Wed Apr 03 09:57:11 2013 -0700
summary:     Bug 857186: Make virtualenv paths relative. r=gps

changeset:   127494:b436cad50c33
user:        Kyle Huey <khuey@kylehuey.com>
date:        Wed Apr 03 09:54:35 2013 -0700
summary:     Bug 849654: Kill nsRefPtrHashtableMT. r=bsmedberg

changeset:   127493:0caf5937f8bc
user:        Kyle Huey <khuey@kylehuey.com>
date:        Wed Apr 03 09:52:25 2013 -0700
summary:     Bug 854799: Make image.src='' discard the image immediately even if the image is not in the document. r=jlebar
https://tbpl.mozilla.org/?tree=Try&rev=a4d42bd6dc9a

changeset:   127492:ef873e1fb7e9
user:        Kyle Huey <khuey@kylehuey.com>
date:        Wed Apr 03 09:49:17 2013 -0700
summary:     Bug 855804: Add hashtable helpers for cycle collection. r=bz
https://tbpl.mozilla.org/?tree=Try&rev=02f68ce1b761

changeset:   127491:798d734dc32a
user:        Jan de Mooij <jdemooij@mozilla.com>
date:        Wed Apr 03 18:09:26 2013 +0200
summary:     No bug - Fix merge conflict on a CLOSED TREE. r=red
https://tbpl.mozilla.org/?tree=Try&rev=bda444ebb97d
Comment 31 John Daggett (:jtd) 2013-04-07 18:32:38 PDT
This patch is the cause of the 5% regression in tp5 that took place on 4/3:

without patch (798d734dc32a): https://tbpl.mozilla.org/?tree=Try&rev=bda444ebb97d
winxp opt: 347.48 win7 opt: 362.18 win8 opt: 250.23

with patch (0caf5937f8bc): https://tbpl.mozilla.org/?tree=Try&rev=a4d42bd6dc9a
winxp opt: 370.55 win7 opt: 381.64 win8 opt: 264.79

http://graphs.mozilla.org/graph.html#tests=[[255,131,12]]&sel=1364997395719.561,1365027600597.6099&displayrange=7&datatype=running

note: the changeset in between changesets 798d734dc32a and 0caf5937f8bc has similar tp results to 798d734dc32a.

I think we should back this out until the performance regression is resolved.
Comment 32 John Daggett (:jtd) 2013-04-07 18:39:59 PDT
(In reply to Boris Zbarsky (:bz) from comment #20)

> This patch makes us sync-discard on ~HTMLImageElement.  How expensive is
> discarding?  Could it be the action of discarding all the images during page
> unload that matters?  Or is it possible we were keeping the images decoded
> until Tp looped back to the page again?

The current Tp test suite loads on the same page repeatedly rather than cycling through the other pages before loading the same page again.  Which makes it very sensitive to whether something is cached or already decoded or not.
Comment 33 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-07 18:42:25 PDT
(In reply to John Daggett (:jtd) from comment #32)
> (In reply to Boris Zbarsky (:bz) from comment #20)
> 
> > This patch makes us sync-discard on ~HTMLImageElement.  How expensive is
> > discarding?  Could it be the action of discarding all the images during page
> > unload that matters?  Or is it possible we were keeping the images decoded
> > until Tp looped back to the page again?
> 
> The current Tp test suite loads on the same page repeatedly rather than
> cycling through the other pages before loading the same page again.  Which
> makes it very sensitive to whether something is cached or already decoded or
> not.

Mmm.  This is just an artifact of the way Talos measures then.
Comment 34 Justin Lebar (not reading bugmail) 2013-04-07 19:56:44 PDT
> The current Tp test suite loads on the same page repeatedly rather than cycling through > the other pages before loading the same page again.  Which makes it very sensitive to 
> whether something is cached or already decoded or not.

Given this, it sounds to me pretty unlikely we'd want to do anything differently here.  What do you think?

In any case, this bug should stay FIXED unless it's actually backed out.
Comment 35 John Daggett (:jtd) 2013-04-07 21:55:11 PDT
roc, wondering if you think we should take the 5% tp regression here or not?
Comment 36 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2013-04-07 22:03:14 PDT
I wouldn't have thought so. Reloading the current page (or part of the current page) is common.

Shouldn't this bug be prevented by having a memory pressure signal which flushes the image cache?
Comment 37 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2013-04-07 22:04:23 PDT
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #36)
> Shouldn't this bug be prevented by having a memory pressure signal which
> flushes the image cache?

Flushes decoded image data, of course.
Comment 38 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2013-04-07 22:06:11 PDT
Another approach would be to have a way to decode an image directly to a thumbnail. Not b2g18 material of course, but it could be a lot more efficient than decoding to a full-size image.
Comment 39 John Daggett (:jtd) 2013-04-07 22:17:31 PDT
Created attachment 734473 [details] [diff] [review]
patch, backout changeset 0caf5937f8bc due to tp regression
Comment 40 John Daggett (:jtd) 2013-04-07 22:50:19 PDT
Backout changeset 0caf5937f8bc pushed to mozilla-inbound
https://hg.mozilla.org/integration/mozilla-inbound/rev/6286b4578f14
Comment 41 David Flanagan [:djf] 2013-04-07 23:15:25 PDT
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #38)
> Another approach would be to have a way to decode an image directly to a
> thumbnail. Not b2g18 material of course, but it could be a lot more
> efficient than decoding to a full-size image.

That's the topic of bug 854795, which I'd love to have your thoughts on. I'm told that our jpeg library has the ability to do this, so we'd just need to hook it up somehow.
Comment 42 David Flanagan [:djf] 2013-04-07 23:42:28 PDT
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #36)
> I wouldn't have thought so. Reloading the current page (or part of the
> current page) is common.
> 
> Shouldn't this bug be prevented by having a memory pressure signal which
> flushes the image cache?

On FirefoxOS devices, the images I'm concerned with require something like 15% of available memory to decode. And during the scan process, I need to decode one right after the other. Because I'm using memory in such big chunks, I'm not convinced that a memory pressure model will do the right thing.

Could Kyle's patch land if it was behind a pref?  Or if it was modified so it only discarded large images immediately?  (Perhaps where the definition of large was a pref?)

If I'm understanding the regression correctly, this patch is causing images to be re-decoded when a document is reloaded. Is there any way the patch could be modified so that it only applies to images that have never been in the document? (See comments 8&9).  That would do the right thing for the offscreen images I use, but wouldn't affect reloading.
Comment 43 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2013-04-07 23:51:52 PDT
(In reply to David Flanagan [:djf] from comment #42)
> Could Kyle's patch land if it was behind a pref?  Or if it was modified so
> it only discarded large images immediately?  (Perhaps where the definition
> of large was a pref?)

That sounds like a good idea. Megapixel images are relatively rare on Web sites AFAIK.

> If I'm understanding the regression correctly, this patch is causing images
> to be re-decoded when a document is reloaded. Is there any way the patch
> could be modified so that it only applies to images that have never been in
> the document? (See comments 8&9).  That would do the right thing for the
> offscreen images I use, but wouldn't affect reloading.

That might affect real sites that construct the page programmatically.
Comment 44 Boris Zbarsky [:bz] (Out June 25-July 6) 2013-04-08 05:09:43 PDT
What if we modified the patch to _only_ discard on actual explicit src="" sets (as in, sets to the empty string)?  Would that be sufficient for b2g for the moment until we can do something smarter?
Comment 45 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-08 05:23:47 PDT
I guess this patch does effectively revert bug 791731.

*sigh*
Comment 46 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-08 05:31:18 PDT
Created attachment 734571 [details] [diff] [review]
Patch

Like every other problem in nsImageLoadingContent, we can solve this with more flags.
Comment 47 Justin Lebar (not reading bugmail) 2013-04-08 06:42:06 PDT
> Shouldn't this bug be prevented by having a memory pressure signal which flushes the image cache?

On most B2G devices, we have a memory-pressure signal which flushes the decoded image cache.  It is, however, an emergency measure that we can't rely on.

The problem is that large malloc()'s don't return null; they cause the kernel to kill some process (perhaps ours).  So we have to know if the malloc() is going to fail before we call it.

We can choose when we get a low-memory notification, but we only get one notify threshold, so that doesn't help us much with this sort of thing.

We could instead try to read the amount of free memory on the system before making a big allocation, but that doesn't take into account the fact that this process may or may not be the one to be killed when we run out of memory.  It also doesn't take into account race conditions in reading the value, and it doesn't protect us from getting close to the limit and then some other non-imagelib module (or another process altogether) OOMing us.
Comment 48 Justin Lebar (not reading bugmail) 2013-04-08 06:49:07 PDT
Not to be contrarian, but I think that the old patch's behavior was almost correct.

Although refreshing the page may be a common operation compared to the universe of all possible operations one can take in a browser, it's much less common than navigating to a new page.

Given a choice, we should optimize for navigating between pages, not for refreshing pages.  Therefore, I think we should throw out all images immediately when we navigate to a new page.  Keeping them around for 20s is just asking for excessive memory usage.  I am very happy to take a hit on TP5 to get this behavior; the problem is the benchmark, not our code.

> I guess this patch does effectively revert bug 791731.

Of course we don't want to regress bug 791731.  Is there a way to tell the difference between "document goes away" and "image loading content goes away, but document stays alive"?
Comment 49 David Flanagan [:djf] 2013-04-08 07:18:43 PDT
(In reply to Boris Zbarsky (:bz) from comment #44)
> What if we modified the patch to _only_ discard on actual explicit src=""
> sets (as in, sets to the empty string)?  Would that be sufficient for b2g
> for the moment until we can do something smarter?

That would more than suffice for the Gallery app.
Comment 50 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-08 08:07:06 PDT
(In reply to Justin Lebar [:jlebar] from comment #48)
> Not to be contrarian, but I think that the old patch's behavior was almost
> correct.
> 
> Although refreshing the page may be a common operation compared to the
> universe of all possible operations one can take in a browser, it's much
> less common than navigating to a new page.
> 
> Given a choice, we should optimize for navigating between pages, not for
> refreshing pages.  Therefore, I think we should throw out all images
> immediately when we navigate to a new page.  Keeping them around for 20s is
> just asking for excessive memory usage.  I am very happy to take a hit on
> TP5 to get this behavior; the problem is the benchmark, not our code.

FWIW, I agree.

> > I guess this patch does effectively revert bug 791731.
> 
> Of course we don't want to regress bug 791731.  Is there a way to tell the
> difference between "document goes away" and "image loading content goes
> away, but document stays alive"?

Not easily.  Since cycle collection doesn't order dtors the best we could do, even with silly things like weakrefs, is "image loading content goes away, but document hasn't gone away yet".

(In reply to David Flanagan [:djf] from comment #49)
> (In reply to Boris Zbarsky (:bz) from comment #44)
> > What if we modified the patch to _only_ discard on actual explicit src=""
> > sets (as in, sets to the empty string)?  Would that be sufficient for b2g
> > for the moment until we can do something smarter?
> 
> That would more than suffice for the Gallery app.

This is what the latest patch does, more or less.
Comment 51 Justin Lebar (not reading bugmail) 2013-04-08 08:15:08 PDT
> Since cycle collection doesn't order dtors the best we could do, even with silly things like 
> weakrefs, is "image loading content goes away, but document hasn't gone away yet".

Perhaps we could throw away the images when the document gets detached from its docshell?

But that's getting outside the scope of this bug.
Comment 52 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-08 08:16:24 PDT
(In reply to Justin Lebar [:jlebar] from comment #51)
> Perhaps we could throw away the images when the document gets detached from
> its docshell?
> 
> But that's getting outside the scope of this bug.

Well that doesn't solve the refresh problem does it?  Won't the previous document get detached before the new copy gets attached?

But yes, we're creeping far away from this bug.
Comment 53 Justin Lebar (not reading bugmail) 2013-04-08 08:17:03 PDT
Comment on attachment 734571 [details] [diff] [review]
Patch

> +  void ClearCurrentRequest(nsresult aReason, uint32_t aFlags = REQUEST_DISCARD);
> +  void ClearPendingRequest(nsresult aReason, uint32_t aFlags = REQUEST_DISCARD);

Flags that default to 0 are fine, but defaulting flags to REQUEST_DISCARD is asking for trouble, I think.
Comment 54 Justin Lebar (not reading bugmail) 2013-04-08 08:17:57 PDT
> Well that doesn't solve the refresh problem does it?

Indeed, but per comment 48 I don't think that's something we should worry about.
Comment 55 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-08 08:19:46 PDT
(In reply to Justin Lebar [:jlebar] from comment #53)
> Comment on attachment 734571 [details] [diff] [review]
> Patch
> 
> > +  void ClearCurrentRequest(nsresult aReason, uint32_t aFlags = REQUEST_DISCARD);
> > +  void ClearPendingRequest(nsresult aReason, uint32_t aFlags = REQUEST_DISCARD);
> 
> Flags that default to 0 are fine, but defaulting flags to REQUEST_DISCARD is
> asking for trouble, I think.

It makes sense though because that's what all the current callers want.  I could invert the sense of the flags here ... flags suck :-/
Comment 56 Justin Lebar (not reading bugmail) 2013-04-08 08:25:14 PDT
> It makes sense though because that's what all the current callers want.

Explicit is better than implicit?

I suppose you could invert the direction of these flags, but then we'd have a mismatch between this flag and the nsIDocument::REQUEST_DISCARD one.

> flags suck :-/

No argument here...  :)
Comment 57 Boris Zbarsky [:bz] (Out June 25-July 6) 2013-04-08 10:37:41 PDT
So the thing is.... when navigating between pages we want to not redecode the images the pages share, right?
Comment 58 Justin Lebar (not reading bugmail) 2013-04-08 12:05:09 PDT
(In reply to Boris Zbarsky (:bz) from comment #57)
> So the thing is.... when navigating between pages we want to not redecode
> the images the pages share, right?

If we want to avoid regressing TP, yes.  But it seems to me that this became the current behavior by accident.
Comment 59 Boris Zbarsky [:bz] (Out June 25-July 6) 2013-04-08 13:14:50 PDT
It seems like if we want to avoid regressing common browsing behavior, we want to avoid such redecodes...
Comment 60 Justin Lebar (not reading bugmail) 2013-04-08 13:22:53 PDT
(In reply to Boris Zbarsky (:bz) from comment #59)
> It seems like if we want to avoid regressing common browsing behavior, we
> want to avoid such redecodes...

Well, it's a trade-off.  Right now when I leave a page with lots of images, my memory usage doesn't go down for 20s or so, and that's bad too.

Perhaps we can have a hack where we throw away decoded images when we navigate between two pages of different origins, but we keep them otherwise.  Might that be satisfactory?
Comment 61 Boris Zbarsky [:bz] (Out June 25-July 6) 2013-04-08 13:25:08 PDT
Or perhaps not discarding until the onload of the new page?

Or both?
Comment 62 Justin Lebar (not reading bugmail) 2013-04-08 13:33:22 PDT
> Or perhaps not discarding until the onload of the new page?

The problem is, by that point the images on the new page have finished loading, and we might have resultantly OOMed the content process on B2G.

Moving images into discardable memory on B2G would help, if that's feasible.  So would lowering the limit on how much unlocked decoded data we keep around; that would help avoid pathological cases on desktop, although it doesn't help much for B2G.
Comment 63 Ryan VanderMeulen [:RyanVM] 2013-04-08 17:19:18 PDT
https://hg.mozilla.org/mozilla-central/rev/6286b4578f14
Comment 64 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2013-04-09 04:30:56 PDT
(In reply to Justin Lebar [:jlebar] from comment #60)
> Well, it's a trade-off.  Right now when I leave a page with lots of images,
> my memory usage doesn't go down for 20s or so, and that's bad too.

It's not bad if you're on desktop and have gobs of memory. In that case, hanging onto decoded images is definitely the right thing to do.

Waiting until the docshell's new page has finished onload sounds good. But as you say, it's still OOM-risky if the new page ends up not sharing images with the old page. I think making the right decisions will sometimes require us to be aware of memory pressure and there's no way around that. It sounds like we need better machinery for that.
Comment 65 Justin Lebar (not reading bugmail) 2013-04-09 07:27:03 PDT
> It's not bad if you're on desktop and have gobs of memory. In that case, hanging onto decoded images 
> is definitely the right thing to do.

If you're on desktop /and/ you have gobs of memory, it probably doesn't matter much what we do, because you probably have a fast, many-core machine, and with multithreaded image decoding, we can likely re-decode the images very quickly.

> I think making the right decisions will sometimes require us to be aware of memory pressure and 
> there's no way around that.

I've written extensively about the challenges with this approach, but unfortunately I can't find my most recent mail.  So to summarize:

If we change our behavior based on the amount of free memory on the machine -- e.g. if we drop unnecessary images and bfcache when memory gets low -- then we're allowing the other processes the user is running on the machine to change our behavior.  In essence we become the nicest process on the machine, trying never to push memory into the page file.

This might be a good strategy, and it might not be.  You can imagine machines that are chronically low on memory where this strategy would essentially mean that we disable bfcache for the user, or where we end up re-decoding lots of images.  Both of these would be especially painful because presumably the machine is slow and also loaded with processes sucking up the CPU.  Maybe letting the OS page something out would be better than taking this approach.  (Or maybe not; we are not in a position to judge.)

Separately, low-memory notifications are difficult because you have to fire them early enough that you have time to react (e.g. a GC takes a while), but not so early that you end up dropping things unnecessarily.  And you can't react synchronously to memory pressure; you have to wait for the main thread event loop to finish what it's doing.  Otherwise e.g. any call to malloc() can trigger a gc, and that is craziness.

I've experimented with this stuff on Windows -- we have the ability to intercept VirtualAlloc calls, read the amount of free memory on the system, and trigger a memory-pressure notification at the next possible moment which triggers a gc/cc and drops images and bfcache -- but I wasn't able to overcome these challenges, so this code remains mostly unused.
Comment 66 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-09 09:52:02 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/5943881c5298
Comment 67 Ryan VanderMeulen [:RyanVM] 2013-04-09 17:10:15 PDT
https://hg.mozilla.org/mozilla-central/rev/5943881c5298
Comment 68 Ryan VanderMeulen [:RyanVM] 2013-04-10 05:17:00 PDT
This has some conflicts on b2g18 that I'm not comfortable resolving. Please post a branch-specific patch.
Comment 69 Ryan VanderMeulen [:RyanVM] 2013-04-11 08:59:06 PDT
Pretty please?
Comment 70 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-11 10:56:00 PDT
Created attachment 736375 [details] [diff] [review]
Patch for b2g
Comment 72 David Flanagan [:djf] 2013-04-16 00:03:50 PDT
This patch does not solve the issue I'm having with the Gallery app.  If I wait 2-3 seconds after processing each 5mp image, I generally do not crash. With this patch I was not supposed to need to wait. But when I remove the timeouts, I get an OOM.  I've tested this with nightly unagi builds for b2g18-1.0.1, b2g18.

With a 3000 ms timeout I don't crash (at least not right away). Without it I do.

Kyle: any idea what might be happening here?  Does your patch include some kind of pref that I need to set in order to enable it?
Comment 73 David Flanagan [:djf] 2013-04-16 00:06:36 PDT
It doesn't look to me like there are any prefs to set... Just double checking: setting img.src = '' is still the official way to release the image memory, right?
Comment 74 Justin Lebar (not reading bugmail) 2013-04-16 01:04:45 PDT
Come find me (I'm on IRC) and let's debug this f2f.
Comment 75 David Flanagan [:djf] 2013-04-16 10:22:53 PDT
Justin: I'm not at the Madrid work week, so I can't work on this with you f2f.

As a test case, see my pull request here: https://github.com/mozilla-b2g/gaia/pull/9216  It removes the setTimeout() idle time after scanning a large image and inserts a bunch of console.log() statements.

Combine that with the 5 images attached to bug 847060 (Make a couple of copies of each one and put them on your sdcard.) You should see the gallery OOM while trying to scan them.  (You can always adb shell rm -r /data/local/indexedDB/*gallery*) to clear the gallery data and force the scan to resart from the beginning.
Comment 76 Justin Lebar (not reading bugmail) 2013-04-17 08:42:50 PDT
> Justin: I'm not at the Madrid work week, so I can't work on this with you f2f.

So I discovered!  Sorry to hear that.

khuey should work with you on this.
Comment 77 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-17 16:03:19 PDT
Ok so I still see some weirdness here.  On my Windows desktop running the test case with this patch still leaves a large buildup of memory somewhere outside the heap.  So something is mmapping a bunch of memory or something.  I'll dig in more tonight.
Comment 78 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-18 11:27:26 PDT
Alright what I'm seeing appears to be a windows specific problem.  So unfortunately I can't immediately help with comment 75 :-(
Comment 79 Justin Lebar (not reading bugmail) 2013-04-18 15:14:12 PDT
We should file a separate bug on the Windows issue, which looks really scary.

Can we get you a b2g device?  Or otherwise, can you steal someone's Mac?  If not, we need to ask for someone else's help here.  I just feel bad pulling someone off other B2G work when you're free for the picking...
Comment 80 Johnny Stenback (:jst, jst@mozilla.com) 2013-04-19 07:27:05 PDT
Reopening as this did not fix the b2g issue reported in bug 861965.
Comment 81 Milan Sreckovic [:milan] 2013-04-22 10:38:02 PDT
Kyle, what do you think?
Comment 82 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-04-22 10:39:12 PDT
I haven't been able to steal another machine yet.
Comment 83 Andreas Gal :gal 2013-04-25 17:17:10 PDT
I think we are hanging on to the image in the canvas image cache. This patch is correct, it just didn't fix all of it. I will file a blocking bug.

Note You need to log in before you can comment on or make changes to this bug.