Closed Bug 1166136 Opened 5 years ago Closed Last year

Too many images are locked in the B2G Gallery app

Categories

(Core :: Layout: Images, Video, and HTML Frames, defect, P2)

defect

Tracking

()

RESOLVED WONTFIX
blocking-b2g 2.5+

People

(Reporter: seth, Assigned: seth)

References

(Depends on 1 open bug)

Details

(Whiteboard: [platform])

Attachments

(1 file)

I haven't had a chance to investigate in detail, but examining about:memory reports from the B2G Gallery app shows pretty clearly that we're locking *way* too many images. Something seems to be going wrong with the thumbnails in particular.

We need to get this fixed, as it's causing the Gallery app to use way too much memory with image locking turned on.
Depends on: 1168985
David, I can resolve this issue by modifying the CSS for the |thumbnail| class as follows:

- Set "height: auto" instead of "height: 0px".
- Set "padding-top: 0px", and the same for the other padding values.

It looks like the Gallery's approach of having each thumbnail div be 0px in height, and displaying each <img> element in the padding area, is triggering a bug in our image visibility code.

I'm going to try to fix this on the Gecko end, but I wanted to make you aware of this ASAP (hence the needinfo) so you can start contemplating what a Gaia fix might look like, if we end up needing to go that route.
Flags: needinfo?(dflanagan)
Thank you, Seth. That's a great find. I'll start looking into that right away. Leaving needinfo set so I don't forget.
The padding-top seems to be an old workaround for the fact that we didn't have vh units supported yet.
I can't get the thumbnails to work right with height:auto. (Maybe I just couldn't figure out "the other padding values" you were talking about)

But I've had success with vh units.  The patch I'm trying is below. Does it work for you, Seth? I can't tell for sure whether this is working. It seems like maybe it is a little harder to get the app to display gray thumbnails, but it is still possible. (I'm testing on master with 300 test images. I haven't tried 2.2 yet).

Here's my patch. I assume you'd like me to open a new bug for this gaia patch so you can use this one for the underlying gecko fix, right?

diff --git a/apps/gallery/style/gallery.css b/apps/gallery/style/gallery.css
index 99320d5..f2759ea 100644
--- a/apps/gallery/style/gallery.css
+++ b/apps/gallery/style/gallery.css
@@ -261,9 +261,8 @@ body.thumbnailListView #thumbnails > li:last-child .thumbnail:last-child {
    * Padding-bottom is used here in order to set div height with respect
    * to device width.
    */
-  width: calc((100% - 0.2rem) / 3 );
-  height: 0;
-  padding-bottom: calc((100% - 0.2rem) / 3 );
+  width: calc((100vw - 0.2rem) / 3 );
+  height: calc((100vw - 0.2rem) / 3 );
 }
 
 .thumbnail:nth-child(3n) {
@@ -287,8 +286,8 @@ html[dir=rtl] #thumbnails-number-selected {
 @media (orientation: landscape) {
   /* 4 thumbnails for each row in landscape mode. */
   .thumbnail {
-    width: calc((100% - 0.3rem) / 4);
-    padding-bottom: calc((100%  - 0.3rem) / 4);
+    width: calc((100vw - 0.3rem) / 4);
+    height: calc((100vw - 0.3rem) / 4);
   }
   .thumbnail:nth-child(3n) {
     -moz-margin-end: 0.1rem;
Flags: needinfo?(dflanagan) → needinfo?(seth)
Also setting needinfo for Dominic, so that he's aware of this issue while converting the Music app to use <img> instead of background-image.  I think I may have seen some CSS in the music app that uses this same padding-bottom hack instead of vw units.
Flags: needinfo?(dkuo)
See Also: → 1169341
I've filed bug 1169341 for the required Gallery app changes to workaround this issue.
(In reply to David Flanagan [:djf] from comment #3)
> The padding-top seems to be an old workaround for the fact that we didn't
> have vh units supported yet.
> I can't get the thumbnails to work right with height:auto. (Maybe I just
> couldn't figure out "the other padding values" you were talking about)

I just meant that I set the padding on all sides to 0px.
 
> But I've had success with vh units.  The patch I'm trying is below. Does it
> work for you, Seth? I can't tell for sure whether this is working. It seems
> like maybe it is a little harder to get the app to display gray thumbnails,
> but it is still possible. (I'm testing on master with 300 test images. I
> haven't tried 2.2 yet).

Checking for gray thumbnails will make evaluating the fix way too hard. =)

To evaluate the fix, what I've been doing is letting all the images in the Gallery get scanned (I'm using 300, just like you), then force killing the Gallery app and relaunching it. I then wait for all of the thumbnails to get loaded, scroll just slightly down and back to the top (since there seems to be another bug that scrolling helps mitigate; we'll fix that separately), and then use |tools/get_about_memory.py| to take a memory report.

At that point, you can check whether the fix worked by going to the bottom of the report for the Gallery app and comparing the two numbers "imagelib-surface-cache-estimated-locked" and "imagelib-surface-cache-estimated-total". If the two numbers are pretty similar, the fix did _not_ work. "imagelib-surface-cache-estimated-locked" should be much smaller.

> Here's my patch. I assume you'd like me to open a new bug for this gaia
> patch so you can use this one for the underlying gecko fix, right?

Yeah. =) Looks like you've already done so. It'd be good to verify that your patch fixes the problem, using the methodology above.

I'm still investigating the underlying Gecko issues. What I've determined so far is that nsLayoutUtils::GetDisplayPort() is returned a *huge* value sometimes in the Gallery app - on the order of 5000 CSS pixels. We expand that to almost 10,000 CSS pixels when calculating the region in which images should be considered visible. It's no surprise that that ends up picking up every single thumbnail.

I'll continue to investigate and try to get to the bottom of this today.

Additionally, as I mentioned above, there continues to be an issue where we fail to unlock images when they become non-visible through indirect means. I'll keep investigating that in bug 1168985.
Flags: needinfo?(seth)
(In reply to David Flanagan [:djf] from comment #4)
> Also setting needinfo for Dominic, so that he's aware of this issue while
> converting the Music app to use <img> instead of background-image.  I think
> I may have seen some CSS in the music app that uses this same padding-bottom
> hack instead of vw units.

The v2.2 patch I have in bug 1168549, I didn't change the current padding-bottom hack, after I changed the background-image to <img>, it still works. I guess that's because the tile's width/height directly use 66% without calc()? but if vw unit is the right way to apply width/height styles on squares, maybe music should also consider to use it.
Flags: needinfo?(dkuo)
Bug 1161734 was the original "blank and flashing thumbanils" bug. It was 2.2+. And it was closed after the "flashing" part was resolved. But we still have the "blank thumbnails" part. This bug is the root cause of that blank thumbnails issues that we had already decided was blocking.

So we should really block on this.  See also https://bugzilla.mozilla.org/show_bug.cgi?id=1164164#c19 where Milan suggests the same thing.
blocking-b2g: --- → 2.2?
I need to tweak my instructions for checking whether your patch works. There are actually two cases:

- When testing memory usage after rescanning all images, it's worth checking about:memory immediately after the scanning is done (without scrolling at all) and then after scrolling slightly down and backup, as I mentioned in comment 6.

- When testing memory usage on a clean launch of the Gallery app when there are no new images to scan, the best way to test is to scroll all the way to the bottom and then check about:memory.

Sorry for the confusion. There are multiple bugs in existence here at the same time, so we need to check for multiple things, unfortunately.

The good news is that I think I've identified all of the issues and a series of Gecko patches that should fix them all is coming up shortly.
Depends on: 1169879
Depends on: 1169880
No longer depends on: 1168985
Depends on: 1169881
David, it appears to me that this is fixed on central. I've requested uplift of all the patches to 2.2. I think we can resolve this bug now, right?
Flags: needinfo?(dflanagan)
When I test the gallery app in today's nightly master build with ddd enabled, I find:

- the thumbnail creation failures (bug 1164164 and bug 1163225) are gone

- the blank gray thumbnail bug (bug 1161734) is gone. No matter how much I scroll around, I always see thumbanils, even if it sometimes takes a while before they appear.

I've tested with 500 images on my device.

If I set image.downscale-during-decode.enabled to false (as you suggested), and test again I find:

- bugs 1164164 and 1163225 are still fixed

- the blank thumbnail bug still occurs.  With my 500 thumbnails, if I scroll quickly enough that APZ kicks in, then I will very easily get into a persistent missing thumbnail state.  If I scroll only a page at a time without every flinging and activating APZ, then I can scroll through all 500 thumbnails and back with no problem. But as soon as I start scrolling really quickly, I get blank gray squares that never resolve. When I do an about:memory report, I see that most of the images are not locked. But for whatever reason, the unlocked ones are still not being evicted from the cache, I guess, and we're running out of image memory.  Here are the relevant lines from about:memory

  6.98 MB ── imagelib-surface-cache-estimated-locked
 26.71 MB ── imagelib-surface-cache-estimated-total

I'm hoping that this just means that one of your patches is somehow being disabled when I change the pref and that you can fix this easily. Or that I won't have this same issue when your patches are uplifted to 2.2. 

The subject of this bug is "too many images are locked". If we interpret that narrowly, then I think you can resolve this bug. I've been treating this bug as the new bug 1161734, however, and have been assuming that it is about the blank thumbnail issue in the gallery app. With that broader interpretation then we shouldn't close it. If you do close it now, then we should re-open 1161734, I think.
Flags: needinfo?(dflanagan)
(In reply to David Flanagan [:djf] from comment #11)
> If I set image.downscale-during-decode.enabled to false (as you suggested),
> and test again I find:
> 
> - bugs 1164164 and 1163225 are still fixed
> 
> - the blank thumbnail bug still occurs.  With my 500 thumbnails, if I scroll
> quickly enough that APZ kicks in, then I will very easily get into a
> persistent missing thumbnail state.  If I scroll only a page at a time
> without every flinging and activating APZ, then I can scroll through all 500
> thumbnails and back with no problem. But as soon as I start scrolling really
> quickly, I get blank gray squares that never resolve.

That is *very* odd. Especially the fact that DDD makes a difference. Up to now I've been testing with 300 images; I'll try testing with 500 and see if I can reproduce this.

> When I do an
> about:memory report, I see that most of the images are not locked. But for
> whatever reason, the unlocked ones are still not being evicted from the
> cache, I guess, and we're running out of image memory. 

Based on the numbers you posted, it doesn't look like we're running out. (At least, we weren't running out when you took the memory report.) Only the locked image memory really matters, because we can't discard locked images if we're low on memory and need to allocate something new. Unlocked images can be discarded in that situation - i.e., we can throw away that memory whenever we need it for something else.

> I'm hoping that this just means that one of your patches is somehow being
> disabled when I change the pref and that you can fix this easily.

That's definitely not the case, unfortunately. That pref has no effect on what these patches do.

> Or that I
> won't have this same issue when your patches are uplifted to 2.2. 

Hard to know. Hopefully we can get them uplifted ASAP and actually test this.

> The subject of this bug is "too many images are locked". If we interpret
> that narrowly, then I think you can resolve this bug.

I'd prefer to treat it in the narrow sense, as we're dealing with a specific platform-level issue in this bug. It may be that there's another issue, and we should probably fix it separately.

I'm going to see if I can reproduce the blank thumbnails with 500 images, and check whether it seems likely that image locking is to blame. If not, I think we should still probably resolve this, and file a new bug. (And we can reopen bug 1161734.)
Some more data points:

I actually see this issues with just 300 images on master with ddd disabled.

And in 2.1, I can scan and scroll through 500 images without problems. So even though we're doing much better than we were on the blank thumbnails issue, we've still got a regression here.
In comment 13, I was testing with a 319mb flame.  If I switch to 512mb, then 300 images seems to work okay. But with 400, I saw the problem again a couple of times. Very hard to reproduce, though. With 500 images, it also reproduces, but not often.

It looks like I also managed to bring back the thumbnail creation failure by scrolling quickly up and down while the app was scanning for new images. Previously, this would have caused many, many failed thumbnails. In this case, I only got one, so that is a big improvement.

Seth said over IRC that during APZ scrolling all images are locked, so we're speculating that while scrolling quickly the surface cache fills up wiht locked images and there is no room to decode any new images. Also, once an image has failed to decode once, it seems that gecko does not try again, so any thumbnails that fail to decode once remain blank for the app's lifetime, it seems.
The issue is easy to reproduce with a flame configured to 400mb and 500 images.
Depends on: 1170680
OK, I believe the remaining issues should be resolved by bug 1170680, which was just pushed to inbound. I'll request uplift for it as soon as it gets merged to central.
Even with bug 1170680 applied, I can see gray thumbnails in the gallery app.

I'm using a Flame with 319mb.

I've pushed 500 copies of the reference image in test_media/reference_workload to the device.  (This is a progressive jpeg, which means that it takes more memory than it should to create the thumbnail. But once the thumbnails are created, it the pjpeg shouldn't really affect the display of the thumbanils at all).

The gallery app can scan the 500 images successfully. Without causing any black thumbnails.

If I scroll right after the scan I can see gray thumbnails.

If I relaunch the gallery app and scroll while it is still loading the thumbnails, I can pretty easily get gray thumbanils.

If I launch the gallery, view and zoom in on one of the images, and then go back to the thumbnails and scroll, I can get gray thumbnails.

If I launch gallery, wait for it to completely load and scan, and then I scroll as fast as I can (so that 5 swipes takes me all the way to the bottom) I will often get some gray thumbnails. Finding them can be difficult because I have to scroll slowly up through the thumbnails to find the ones that failed.

With the new about:memory reporting, though I can see that some failed because of lines like this:

       29 ── imagelib-surface-cache-overflow-count

If I'm counting right, the overflow count matches the number of gray squares I see.
Depends on: 1171295
Depends on: 1171356
Depends on: 1171371
I've just been testing the latest inbound build.  I think it has 1171356 applied, or maybe the other new one. Not exactly sure what I'm testing, but it was the build here: https://pvtbuilds.mozilla.org/pvt/mozilla.org/b2gotoro/tinderbox-builds/mozilla-inbound-flame-kk-eng/20150604110944/

This is on a 319mb flame.

I put 500 copies of the test image from test_media/reference_workload on the device, plus the test images from apps/gallery/test/images/ and a few photos including two 8mp and one 20mp photo from an Aries phone.

1) Test images 34.png through 38.png failed to get thumbnails. These are all PNG images of about 3 megapixels, so they would need 12mb of image memory to decode. 64.png also failed. It is a 5mp PNG.
I got a blank thumbnail and a blank preview. The only way to see the images is to tap on the thumbnail and then zoom in on the blank preview, to force it to load the original image.  (In 2.2 I would not expect blanks, I'd expect the images to be rejected as corrupt when the thumbnail can not be created.)  I tried this again without the 500 reference_workload images, and these png thumbnails did not fail.  I tried a third time with 200 reference_workload images and these PNG thumbnails did fail.

2) Twice I was able to make the app OOM. It didn't happen often, and doing it seemed to involved zooming in on some of the biggest images on the phone, then scrolling very aggressively up and down through the thumbnails. While scrolling, I would see the scrollbar stop responding to my flings. I think this was because zmem was thrashing. The dmesg log do show lots of messages like "zram: Error allocating memory for compressed page: 14499, size=4096", so I think that is why the system was slowing to a crawl. Anyway, the two times I got an actual OOM it was after this kind of zmem swapping.  I'll attach a dmesg log taken right after the OOM.

3) Twice I was able to get the device to show gray thumbnails. I don't know how to reproduce this, but I saw it. After seeing gray thumbnails, about:memory showed this:

 26.79 MB ── imagelib-surface-cache-estimated-locked
 26.79 MB ── imagelib-surface-cache-estimated-total
    1,003 ── imagelib-surface-cache-overflow-count

It looks like I managed to get to a state where everything in the surface cache was locked.

It used to be that the overflow count matched the number of gray thumbnails. That is no longer the case, so obviously the patch or patches I'm testing here are allowing images to fail and then still be displayed later.

I still want to test image editing (which is very memory intensive) and the pick activity. When another app needs to pick an image from gallery the system has to keep both apps alive, so much less memory is available.
If I edit one of the 3mp png images (35.png, for example) and then scroll though the thumbnails, I get the gray thumbnail thing again, and this time the gray thumbnails do not go away.
Attached file dmesg log
David, thanks so much for your thorough reports and for being willing to test this stuff for me.

I'm sorry we couldn't get everything fixed in time for 2.2, but at this point I feel like we're in a significantly better state than we were two weeks ago, and the problems are hopefully starting to be things that fewer users will run into.

I'm about to go on PTO, so I wanted to give a quick high-level summary of what's been achieved so far and what still needs to happen, from my perspective (which is a Gecko-centric perspective).

At this point we've handled image memory issues which were triggered by bad interactions with async pan/zoom. We seem to be in a pretty good state there.

We still have a problem with allocating too much memory for the first frame of images when we're creating image::Decoder objects. I've worked around this to some extent, but the real fix is to not allocate the first frame until we actually start running the decoder. Bug 1117607 will fix that; we need to dust that off and get it in shape to land.

We also have issues as soon as CSS background images get into the mix, because we consider those visible forever, as long as they're in the document, once they're painted once. We *really* need to fix that. Bug 1157546 will address that issue.

We also need some way of prioritizing canvas.drawImage() above other kinds of images, so we can reduce the likelihood of "permanent" failures (which usually happen because a drawImage call failed). I haven't filed a bug on that yet. I'm still working through the design space on this one, as there are many factors to consider, but I hope to have a fully fleshed-out proposal for this soon.

And finally, things will be a lot better once downscale-during-decode lands - that will take a *lot* of pressure off the rest of the system. We're getting there but there's still a lot of slog remaining before we can totally rely on DDD, unfortunately. (Though I think we'll be able to totally rely on it for *JPEG* images specifically a lot sooner.)
Depends on: 1169680
Nominate this to 3.0 as we can continue fixing rest of the issues.
blocking-b2g: 2.2? → 3.0+
Blocks: 1177940
Depends on: 1183852
Depends on: 1176124
Hi Seth, from your perspective is there still critical work that needs to be done before the 2.5 release? Triage team is trying to determine if this should still be marked as a blocker.
Flags: needinfo?(seth)
(In reply to Dylan Oliver [:doliver] from comment #23)
> Hi Seth, from your perspective is there still critical work that needs to be
> done before the 2.5 release? Triage team is trying to determine if this
> should still be marked as a blocker.

Yes. Right now the situation is pretty good (though not perfect) for images in <img> elements. Unfortunately, the Gallery app also uses CSS background images, and our story is not great there right now. The most important remaining work is improving memory management of CSS background images.

I am currently working on enabling downscale-during-decode for all image formats, which is another feature that will reduce memory usage of the Gallery app, and that work is almost done, so I hope to start in on improving our CSS background image story soon.
Flags: needinfo?(seth)
Ok, we will leave this as a blocker then. Or if there is a different bug specifically for the CSS background image work then we should probably instead block that one directly as this has become a bit of a meta bug.
Priority: -- → P2
Whiteboard: [platform]
Jet, 

If Seth is still recovering, can you please find another assignee for this one?

THanks
Flags: needinfo?(bugs)
(In reply to Mahendra Potharaju [:mahe] from comment #26)
> Jet, 
> 
> If Seth is still recovering, can you please find another assignee for this
> one?
> 
> THanks

I'm recovered.
Flags: needinfo?(bugs)
FWIW: this is a _metabug_. Please do _not_ expect patches to appear in this bug.

I just noticed that bug 1218990, which is one of the things I've been working on lately, is not blocking this bug. It should be.

We are making progress on this in every release, but there is a *lot* of plumbing and internal changes necessary to fix this stuff. At this point most of that is in place; bug 1218990 is the culmination of a lot of that work.

One that lands, we should reevaluate where we are in the Gallery app with respect to the number of images locked. I'd expect that there will no longer be any problem.

Bug 1171371 is the only blocker that hasn't landed other than bug 1218990, and while I will finish up that work as I think it makes sense to land that change, it's honestly not of major importance compared to bug 1218990. Bug 1171371 mostly helps in situations where we're already near OOM, but we really want to avoid getting near OOM in the first place, and that's what bug 1218990 does for us.
Depends on: 1218990
Mass closing as we are no longer working on b2g/firefox os.
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → WONTFIX
Mass closing as we are no longer working on b2g/firefox os.
Product: Core → Core Graveyard
Product: Core Graveyard → Core
You need to log in before you can comment on or make changes to this bug.