Closed Bug 1060869 Opened 10 years ago Closed 9 years ago

Store the first frame of a RasterImage in the SurfaceCache

Categories

(Core :: Graphics: ImageLib, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla36

People

(Reporter: seth, Assigned: seth)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

Attachments

(4 files, 19 obsolete files)

37.61 KB, patch
Details | Diff | Splinter Review
78.79 KB, patch
Details | Diff | Splinter Review
10.07 KB, patch
Details | Diff | Splinter Review
9.41 KB, patch
Details | Diff | Splinter Review
We should store the first frame of each RasterImage in the SurfaceCache. This will enable us to support downscale-during-decode, decoding multiple resolution layers of the same icon, and other important features on the roadmap for imagelib.
Depends on: 1057904
Alright, here's part 1. This patch significantly enhances SurfaceCache by adding the concept of locking and unlocking images, and the notion of a lifetime for surfaces - transient or persistent. I'm hoping this is comprehensible from the documentation I've added, so rather than launch into an explanation here I'll just recommend that you review SurfaceCache.h before the other files.

I do think it's worth saying, though, that while I think image locking is here to stay, we may not need the lifetime concept forever. I primarily added this to support the distinction in RasterImage between decoded frames (which must be persistent) and HQ scaled frames (which must be transient). However, once we get to the point where downscale-during-decode merges those two concepts, we may be able to get rid of the idea of lifetime and effectively have all surfaces be persistent. In other words: if this seems more complex than would be ideal, I agree, and I want to make this simpler, but we're not there yet.
Attachment #8483997 - Flags: review?(dholbert)
In this patch we start storing the first frame of a RasterImage in the SurfaceCache, and we only construct a FrameBlender (and store frames in it) when we actually have an animated image.

This has various implications that motivate the changes in this patch. Most importantly:

- Now when we have only one frame, there's no FrameBlender keeping it alive, so
  decoders need to hold a RawAccessFrameRef to the frame they're decoding themselves.

- We now want to be able to look up frames based upon their size and the decode
  flags used to produce them, not just their index in the animation like we used
  before. (We don't actually make much use of this capability yet, but we're
  laying the foundation here.) We also need to handle the situation where the
  frame was evicted from the cache.

- Because things can be evicted from the cache if the image isn't locked, we
  can't count on having a frame available to answer questions like "is the first
  frame opaque". We clearly don't want to have to sync decode to find that out,
  either, so we record the information we need when decoding is complete.

It's a little bit ugly that single-frame images are stored in the SurfaceCache and animated images store their frames in a FrameBlender, but we'll eventually get to the point where everything's unified.

BTW, there will be a part 3 to this bug, because I think to land this change we also need to make the SurfaceCache memory reporter into an explicit reporter, or else we'll regress AWSY metrics.
Attachment #8484009 - Flags: review?(tnikkel)
Comment on attachment 8483997 [details] [diff] [review]
(Part 1) - Add lifetime control to SurfaceCache

Review of attachment 8483997 [details] [diff] [review]:
-----------------------------------------------------------------

Partial review (I've just skimmed the SurfaceCache.cpp changes so far; more thorough review on that part coming soon):

::: image/src/SurfaceCache.cpp
@@ +141,5 @@
> +  void SetLocked(bool aLocked)
> +  {
> +    if (aLocked && mLifetime == Lifetime::Persistent) {
> +      // This may fail, and that's OK. We make no guarantees about whether
> +      // locking is successful if you call LockImage() after Insert().

This comment needs s/LockImage()/SetLocked(true)/, I think?

@@ +144,5 @@
> +      // This may fail, and that's OK. We make no guarantees about whether
> +      // locking is successful if you call LockImage() after Insert().
> +      mDrawableRef = mSurface->DrawableRef();
> +    } else {
> +      mDrawableRef = DrawableFrameRef();

"mDrawableRef.reset()" in the second case would be clearer. (That looks more different from the first case, and it's more obviously clearing things.)

(It looks like we have that  method available, based on https://bugzilla.mozilla.org/attachment.cgi?id=8481882&action=diff )

@@ +503,5 @@
> +  }
> +
> +  static PLDHashOperator DoUnlockSurface(const SurfaceKey&,
> +                                       CachedSurface*    aSurface,
> +                                       void*             aCache)

Fix indentation of params here.

::: image/src/SurfaceCache.h
@@ +176,5 @@
>     * Insert a surface into the cache. It is an error to call this function
>     * without first calling Lookup to verify that the surface is not already in
>     * the cache.
>     *
> +   * Each surface in the cache has a lifetime, either transient or persistent.

Maybe capitalize "transient" & "persistent" in this sentence, so that they match the actual enum value names.

@@ +275,5 @@
> +  /**
> +   * Removes all cached surfaces associated with the given image from the cache.
> +   * If the image is locked, it is automatically unlocked.
> +   *
> +   * This MUST be called, at a minimum, when the image is destroyed. If another

I think the phrase "when the image is destroyed" here is talking about the actual "Image" object, right? (as opposed to e.g. some other "image"-flavored object like the PNG image data itself, or the imgFrame that we're storing, or something else)

Since this is an important point ("MUST" etc), it's probably worth making it extra-clear what "image" refers to here. e.g. maybe capitalize "Image" in this paragraph, or maybe add a parenthetical, like:
<< ...when the image (i.e. the instance of an "Image" subclass) is destroyed. >>

@@ +286,5 @@
>  
> +  /**
> +   * Evicts all cached surfaces from the cache.
> +   *
> +   * Persistent surfaces associated with locked images are not evicted. Such

Maybe add ", with one exception:" to the first line. (or else that first line -- "Evicts all cached surfaces" -- is pretty absolute-sounding.)
(In reply to Daniel Holbert [:dholbert] from comment #3)
> ::: image/src/SurfaceCache.cpp
> @@ +141,5 @@
> > +  void SetLocked(bool aLocked)
> > +  {
> > +    if (aLocked && mLifetime == Lifetime::Persistent) {
> > +      // This may fail, and that's OK. We make no guarantees about whether
> > +      // locking is successful if you call LockImage() after Insert().
> 
> This comment needs s/LockImage()/SetLocked(true)/, I think?

Correction -- this comment really means to say "SurfaceCache::LockImage()" and "SurfaceCache::Insert()", I think. (right?)  Probably good to include the "SurfaceCache::" qualifier, since there are various Insert / LockImage / SetLocked() methods, and the ones you're talking about don't live on the class where this comment appears (CachedSurface).
Comment on attachment 8483997 [details] [diff] [review]
(Part 1) - Add lifetime control to SurfaceCache

Review of attachment 8483997 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with this and previous comment 2 addressed. (modulo correction in comment 3)

::: image/src/SurfaceCache.cpp
@@ -126,2 @@
>      : mSurface(aSurface)
> -    , mTargetSize(aTargetSize)

Looks like the mTargetSize/aTargetSize removal isn't really related to this bug.  The last usage of these variables seems to be in bug 1054079, so this removal probably belongs there.

But if it's too messy to shift the removal over there, though, I'm fine with it here as well.  (It affects several layers of functions, which are in context for this patch's chunks, so I suspect it might be too messy to split out.)

@@ +148,5 @@
> +      mDrawableRef = DrawableFrameRef();
> +    }
> +  }
> +
> +  bool IsLocked() { return bool(mDrawableRef); }

This accessor (CachedSurface::IsLocked) should be const.

@@ +218,5 @@
>      mSurfaces.EnumerateRead(aFunction, aData);
>    }
>  
> +  void SetLocked(bool aLocked) { mLocked = aLocked; }
> +  bool IsLocked() { return mLocked; }

IsLocked() should be a const method here, too. (in ImageSurfaceCache)

@@ +283,3 @@
>  
> +    // Remove elements in order of cost until we can fit this in the cache. Note
> +    // that locked surfaces aren't in mCosts, so we never removed them here.

s/removed/remove/,  I think? (not sure why this is past-tense)

@@ +357,5 @@
>      CostEntry costEntry = aSurface->GetCostEntry();
>  
> +    if (aSurface->IsLocked()) {
> +      MOZ_ASSERT(mLockedCost >= costEntry.GetCost(), "Costs don't balance");
> +      mLockedCost -= costEntry.GetCost();

We should assert that costEntry is *not* in mCosts here.  (It seems reasonable to worry that it might be, if we made a bookkeeping mistake with an image going from unlocked to locked -- so, good to quash such suspicions with an assertion.)

I think something like this would do:
 MOZ_ASSERT(nsTArray<CostEntry::NoIndex> ==
              mCosts.IndexOfFirstElementGt(costEntry),
            "Locked surfaces' costs should not be in cost array");

(Or you could use mCosts.Contains(), but that doesn't benefit from binary-search, and would require walking the whole array every time.)

@@ +416,5 @@
>    }
>  
> +  bool CanHoldAfterDiscarding(const Cost aCost) const
> +  {
> +    return aCost + mLockedCost <= mMaxCost;

Do we really need two separate CanHoldAfterDiscarding() vs. CanHold() methods? I think we should just update the existing CanHold() to do what CanHoldAfterDiscarding() does here. (and then we only need the one method)

In particular, I only see one CanHold caller that will remain after this patch -- VectorImage::CreateDrawableAndShow() -- and there's really no reason it needs the mLockedCost-oblivious CanHold() behavior. It would benefit from being able to get the stronger CanHoldAfterDiscarding() behavior up-front, instead of bouncing off of it when we call Insert().

(We'd need to update the CanHold documentation, too, of course.)

@@ +437,5 @@
> +  void UnlockImage(const ImageKey aImageKey)
> +  {
> +    nsRefPtr<ImageSurfaceCache> cache = GetImageCache(aImageKey);
> +    if (!cache)
> +      return;  // Already unlocked.

Maybe clarify to:
  // Already unlocked (and removed).

(Otherwise, it gives the impression that unlocking the image removes it from the cache, which isn't really true).
Attachment #8483997 - Flags: review?(dholbert) → review+
Flags: in-testsuite-
Version: unspecified → Trunk
Thanks for the review, Daniel!

(In reply to Daniel Holbert [:dholbert] from comment #4)
> Correction -- this comment really means to say "SurfaceCache::LockImage()"
> and "SurfaceCache::Insert()", I think. (right?)

Right.
(In reply to Daniel Holbert [:dholbert] from comment #5)
> s/removed/remove/,  I think? (not sure why this is past-tense)

What's going on here is that I'm referring to the old state of the code in the comment, which is bad practice. I need to reword that.

> Do we really need two separate CanHoldAfterDiscarding() vs. CanHold()
> methods? I think we should just update the existing CanHold() to do what
> CanHoldAfterDiscarding() does here. (and then we only need the one method)

I kinda don't want to expose CanHoldAfterDiscarding(), because I'm not sure that SurfaceCache is going to remain main-thread-only, and if it doesn't then I can't support that API. (The answer wouldn't be accurate, since other threads might interleave insertions/removals/locks/unlocks, and giving any answer at all would require locking, so it'd be worse than useless - better to just check in SurfaceCache::Insert().) I expect that cases where CanHoldAfterDiscarding() and CanHold() have results that differ will be quite rare in practice, because on desktop platforms the surface cache is quite large, and on mobile platforms with small surface caches we disable image locking, so the two give the same result.

The fact that CanHoldAfterDiscarding() only exists at all to make the preconditions of the loop in Insert() cleaner, so the logic in the loop can be less complex. (And also so we can avoid evicting things pointlessly if the insert will still fail.)
I should note, BTW, that CanHold() in general is mostly intended to avoid callers which are creating surfaces programmatically from creating insanely huge surfaces, intending to insert them in the surface cache. VectorImage could be coaxed to do that before I added that check. It's not really about checking whether a "reasonably sized" surface will fit at this particular time.
OK, sounds good. In that case:

* Could you make CanHoldAfterDiscarding() a private function, then? (since per comment 7, it sounds like you really only want to use it internally)

* Could you add a comment above CanHoldAfterDiscarding() with explaining the difference between it and CanHold()? e.g. with a summary of comment 7, something like:

  // This is similar to the public CanHold() method, but a bit stricter.
  // We don't expose this version publicly because {threading etc etc}.
(In reply to Daniel Holbert [:dholbert] from comment #9)
> OK, sounds good. In that case:

Sure.
So I decided to take the patch that would have been part 3 of this bug (which adds explicit memory reporting for the surface cache) and place it in bug 923302, which is a preexisting bug about the issue. I will land this bug together with that one.
Depends on: 923302
Comment on attachment 8484009 [details] [diff] [review]
(Part 2) - Store the first frame of a RasterImage in the SurfaceCache

>@@ -676,22 +688,33 @@ RasterImage::GetRequestedFrameIndex(uint
> RasterImage::FrameIsOpaque(uint32_t aWhichFrame)
>+  if (GetNumFrames() == 1) {
>+    return mFirstFrameIsOpaque;
>+  }

Is it possible that the image has more than 1 frame but we've only decoded 1 and aWhichFrame isn't the first frame? Similarly for FrameRect.

>@@ -915,22 +953,21 @@ RasterImage::GetCurrentImage()
>   RefPtr<SourceSurface> surface = GetFrame(FRAME_CURRENT, FLAG_NONE);
>   if (!surface) {
>-    // The OS threw out some or all of our buffer. Start decoding again.
>-    // GetFrame will only return null in the case that the image was
>-    // discarded. We already checked that the image is decoded, so other
>-    // error paths are not possible.
>-    ForceDiscard();
>-    RequestDecodeCore(ASYNCHRONOUS);
>+    surface = GetFrame(FRAME_CURRENT, FLAG_NONE);
>+  }

What purpose does the second GetFrame call serve that the first one didn't do?

I also took a look at SurfaceCache during this review. It looks like the SurfaceCache size will be 8mb on 512mb devices. This seems really small. For example, there was a bug for tarako (a 128mb device) where the lowest suggested value to set image.mem.max_decoded_image_kb for that device was 15mb (it's set to 30mb for b2g in general, which is I think mostly 256mb devices).
Attachment #8484009 - Flags: review?(tnikkel) → review+
After some more thinking, on b2g we don't lock any images and instead rely on the discard tracker. We tweak the prefs so images only expire after a day, so they stick around until we reach the 30mb limit.

For the surface cache it looks like surfaces expire after a minute? That could lead to discarding and re-decoding more often. I think the current design is going to need changes and tweaks before it will work as well as the current solution on b2g.

Also, how does the surface cache interact with locked images and it's max size? It seems like it would say it can't hold an image if our locked images take up all the "available" space in the cache. This doesn't seem good.
Blocks: SVGTabs
(In reply to Timothy Nikkel (:tn) from comment #13)
> For the surface cache it looks like surfaces expire after a minute? That
> could lead to discarding and re-decoding more often. I think the current
> design is going to need changes and tweaks before it will work as well as
> the current solution on b2g.

So the pref you're talking about (to control how quickly images expire) already exists today for the surface cache. We probably do need to tweak these pref values for B2G; you're right, too, that the current surface cache size is almost certainly too small on B2G. I've been planning to make those kinds of tweaks as part of a patch that will also remove the discard tracker.

I do think it's worth thinking about the consequences of freeing that memory so slowly, though. We might get an overall better user experience by being more dynamic about how memory on B2G is allocated. (Or we might not... there are clearly benefits to predictability, too.)

> Also, how does the surface cache interact with locked images and it's max
> size? It seems like it would say it can't hold an image if our locked images
> take up all the "available" space in the cache. This doesn't seem good.

It does indeed refuse to let you insert more surfaces if it's full (and that could be because of locked surfaces, sure). I don't see why that's a problem, though. (Let me know if I'm missing something!) One of the reasons I'm trying to move everything into the surface cache is that then we can have a single policy that affects *all* decoded data in imagelib, which I think is much easier to reason about.
(In reply to Timothy Nikkel (:tn) from comment #12)
> Is it possible that the image has more than 1 frame but we've only decoded 1
> and aWhichFrame isn't the first frame? Similarly for FrameRect.

So mAnim (which tracks what the current frame is) doesn't get created until we have a second frame. That means GetCurrentFrameIndex()/GetRequestedFrameIndex() would return 0 (i.e. the first frame). In other words, if we've only decoded one frame, then the first frame is always the current frame (and aWhichFrame can only request the first frame or the current frame).

> What purpose does the second GetFrame call serve that the first one didn't
> do?

Whoops! That's leftover code from when I was debugging this patch. You're totally right, that shouldn't be there, and I'll remove it.

> I also took a look at SurfaceCache during this review. It looks like the
> SurfaceCache size will be 8mb on 512mb devices. This seems really small. For
> example, there was a bug for tarako (a 128mb device) where the lowest
> suggested value to set image.mem.max_decoded_image_kb for that device was
> 15mb (it's set to 30mb for b2g in general, which is I think mostly 256mb
> devices).

As I mentioned in comment 14, that's definitely a pref that needs tweakin - it was set on the assumption that it'd only hold rasterized SVG images. The surface cache's size is set by taking the size of main memory and applying a scale factor. Based on what you said above, a reasonable first time would be to make that scale factor 8 times larger, which should give us 16mb on 128mb devices, 32mb on 256mb devices, and 64mb on 512mb devices. (We also have the ability to set a cap beyond which the surface cache won't grow even if main memory grows, BTW, so it'd probably be reasonable to use this same scale factor even on desktop.)
Addressed the reviewed comments on part 1.
Attachment #8483997 - Attachment is obsolete: true
Addressed review comments on part 2.
Attachment #8484009 - Attachment is obsolete: true
Here's a try job. I'm a bit nervous about the results, since there are several patches in the stack that I haven't pushed to try yet (beyond the ones in this bug). Let's see how things go:

https://tbpl.mozilla.org/?tree=Try&rev=bf5b3ce48724
So it looks like the failures in the try push above are a combination of the failures from bug 1057904 (which is in the stack) and some sort of bug in how I'm managing locked surfaces. I'll debug this and update the patches as needed.
Alright, I think I've fixed the bugs in the patches lower in the queue. Here's a rebased version of part 2.
Attachment #8490504 - Attachment is obsolete: true
Here's another try job, based on top of the latest m-c:

https://tbpl.mozilla.org/?tree=Try&rev=88f6ec247b3e
OK, the remaining oranges were caused by a bug in the assertion, not a bug in the actual surface cache code. Let's see if there's anything else left! New try job here:

https://tbpl.mozilla.org/?tree=Try&rev=5c08a0797031
Attachment #8490503 - Attachment is obsolete: true
Blocks: 923302
No longer depends on: 923302
Here's an updated version of part 2. A lot of the oranges in the try job above are caused by issues related to padding. Chiefly:

(1) In some places I assumed that imgFrame::GetNeedsBackground() told you whether, well, you needed a background. (In other words, I thought it returned the same thing as RasterImage::FrameIsOpaque.) That's not true, though; it needs to take padding into account.

(2) Looking up frames that have padding from the SurfaceCache was always failing since we look them up by mSize, but their size is different due to padding.

I solved (1), and a whole class of other padding-related bugs that keep cropping up lately, by making imgFrame's aware of the size of the entire image (which allows them to handle padding totally transparently; RasterImage doesn't need to know about it anymore when drawing). Similarly, we now store imgFrame's in the SurfaceCache using the size of the *entire* image as a key, not the size of the individual frame. This means in rare cases (when the first frame of an image has padding) we slightly overestimate the cost of storing a frame, but I don't think this is worth worrying about.

I also fixed some other trivial issues.

There's still a problem with this patch, though. There's a bug related to animations that's causing some tests to fail. I'll dig more into that next.
Attachment #8492549 - Attachment is obsolete: true
Here's a new try job. I'm aware of the animation issue but there may be more:

https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=c1ce1239aaa4
Alright, here's another try job for a revised version of the patch that I'm *hoping* should fix the remaining oranges. Locally, it fixes the invalidation-related reftests; I haven't run the entire suite locally, though.

https://tbpl.mozilla.org/?tree=Try&rev=f4e16f7f0ed9
I'm still trying to track down the problem. I have a hunch (totally unconfirmed at this stage) that bug 1063973 is related. If so, then on those orange tests it should be the case that we fail to draw even though the image's load event has been delivered, because we request a sync decode (due to drawWindow) and we're already in the decoder.

I cooked up a patch which will cause tons of bustage (it'll break anything involving alpha), but which may give us some insight. It draws a purple rect unconditionally whenever imgIContainer::Draw is called, even if Draw will fail to draw anything due to an error. I'd expect the test failures on Linux and Windows that the previous try job showed to also fail here, and to display a purple rect in the reftest analyzer.

https://tbpl.mozilla.org/?tree=Try&rev=b8088119ac1c
Another try job to investigate a suspicious warning I saw in the previous try results:

https://tbpl.mozilla.org/?tree=Try&rev=793658e0a442
Some trivial final tweaks to the patch.
Attachment #8499351 - Attachment is obsolete: true
So the last try job confirmed that the remaining oranges seem to be caused by images that are too large to fit in the cache. Some of those tests involve images that are just under 64MB in size, and for example the Linux test VMs only have 3.75GB of memory, which at current pref settings results in a surface cache size of 62.9MB - too small to hold the images!

Given that with this bug we are transitioning to a world where almost all image-related surfaces are managed by the surface cache (the exception is animated images, and I will fix that soon), the prefs for the surface cache need to start looking more like the prefs for the soon-to-be-obsolete DiscardTracker. In particular:

1. On desktop, we'll use the formula |min(1/4 of main memory, 1GB)| to compute the size of the surface cache. With the old paradigm there was no hard limit on desktop, though a soft limit existed on unlocked images. The surface cache's maximum size limit applies to both locked and unlocked images, and I'm not convinced we should change that - I think it's easier to reason about one limit that applies to both. We can tune the behavior of unlocked images by increasing the speed with which we discard surfaces that haven't been touched recently, though. (Such discarding obviously applies only to unlocked images.) I've left that at 60 seconds for now.

2. On B2G, we'll use the formula |min(1/4 of main memory, 128MB)|. On a 256MB device, this results in a surface cache size of 64MB, which is almost identical to the 65MB value we used before. Of course, in the past we used 65MB on all B2G devices, whatever their memory size. With these new prefs, we'd use more memory on devices with more available, up to a limit of 128MB. Just as with the discard tracker, I've set the expiration time for the surface cache to 24 hours on B2G. I think that's something worth revisiting in the future, though.

When reviewing these changes, please keep in mind that the discard tracker will be removed totally soon.
Attachment #8501923 - Flags: review?(tnikkel)
Attachment #8501923 - Flags: review?(dholbert)
Attachment #8501923 - Flags: feedback?(fabrice)
Here's a new try job (hopefully the last):

https://tbpl.mozilla.org/?tree=Try&rev=9df28777d4f2
Comment on attachment 8501923 [details] [diff] [review]
(Part 3) - Update SurfaceCache prefs to increase the cache size

>Bug 1060869 (Part 3) - Update SurfaceCache prefs to increase the cache size. r=dholbert,tn

Maybe include a brief explanation for the increase in the commit message, to make it clearer that there's a good reason for this change (i.e. we're not just knob-twiddling).

e.g. "because it now stores images that were previously controlled by the DiscardTracker" or something like that.

r=me regardless
Attachment #8501923 - Flags: review?(dholbert) → review+
Comment on attachment 8501923 [details] [diff] [review]
(Part 3) - Update SurfaceCache prefs to increase the cache size

64mb on a 256mb device seems a little high given that effectively we were keeping around 30mb in most cases in the discard tracker. I know that this also includes cached vector images, but we will also be keeping around upto 30mb of animated images in the discard tracker still (at least temporarily until patches land to change that).
(In reply to Timothy Nikkel (:tn) from comment #36)
> Comment on attachment 8501923 [details] [diff] [review]
> (Part 3) - Update SurfaceCache prefs to increase the cache size
> 
> 64mb on a 256mb device seems a little high given that effectively we were
> keeping around 30mb in most cases in the discard tracker. I know that this
> also includes cached vector images, but we will also be keeping around upto
> 30mb of animated images in the discard tracker still (at least temporarily
> until patches land to change that).

Yeah, per our discussion on IRC, I think you're right that it makes sense to view 'image.mem.max_decoded_image_kb' as the effective limit on devices with no image locking - which is to say, on B2G. That suggests that we should aim for 32MB instead of 64MB for a 256MB device. I'll make that change.

Re: animated images, maybe I should ratchet down the discard tracker limit on B2G until we get animated images in the surface cache?
I tweaked the patch to use |min(1/8 of main memory, 128MB)| on B2G, which works out to 32MB on a 256MB device.

I also limited the old discard tracker prefs to 16MB on B2G.

It looks like there are still oranges on some platforms on the old try run, which almost certainly just means the limits need further tuning on those platforms. I'll wait till the previous try job is complete before I submit a new one.
Attachment #8502013 - Flags: review?(dholbert)
Attachment #8502013 - Flags: feedback?(fabrice)
Attachment #8501923 - Attachment is obsolete: true
Attachment #8501923 - Flags: review?(tnikkel)
Attachment #8501923 - Flags: feedback?(fabrice)
Attachment #8502013 - Flags: review?(dholbert) → review?(tnikkel)
Comment on attachment 8502013 [details] [diff] [review]
(Part 3) - Update SurfaceCache prefs to increase the cache size

>+++ b/layout/reftests/bugs/reftest.list
>-== 370629-1.html 370629-1-ref.html
>+skip-if(B2G) == 370629-1.html 370629-1-ref.html
> skip-if(B2G) == 370629-2.html 370629-2-ref.html

If this part is supposed to land with this patch (instead of being a temporary debugging tweak), could you file a bug filed for this new "skip-if" annotation, and annotate it here?

(If this test is going from working to not-reliably-working, we don't want to lose track of that...)
Attachment #8502013 - Flags: feedback?(fabrice) → feedback+
Comment on attachment 8502013 [details] [diff] [review]
(Part 3) - Update SurfaceCache prefs to increase the cache size

I think the hard limit on b2g should be larger than the "max", we should be allowed to go over temporarily.

It looks like the surface cache discards everything when it gets a memory pressure? I'm not sure if they is ideal on b2g. I have the impression that memory pressure is somewhat normal and common on b2g, so to dump all images only to have to redecode most of them again doesn't seem like a good idea. Can we just discard the oldest images until we are under a certain threshold?

You probably also want to adjust the android prefs.

Most desktop's these days probably have at least 4GB of ram, so that means the surface cache size is always going to be 1GB on desktop? That seems really huge to me, especially considering that the similar value on desktop is currently 50mb.
(In reply to Timothy Nikkel (:tn) from comment #40)
> I think the hard limit on b2g should be larger than the "max", we should be
> allowed to go over temporarily.

I'm against this. If we put in such a large image (and keep in mind: what triggers this is *a single image so large it doesn't fit in the cache*), it would be guaranteed to be the next thing to be discarded (since we discard larger images first), and we'd be very likely to have to redecode the very next time we needed to paint it. (Especially if there's more than one image on the page.) The penalty for that is obviously at its worst for very large images. If we do it the way SurfaceCache does it now, then we can avoid even *decoding* the large image. And keep in mind that downscale-during-decode will reduce the number of cases where temporarily going over would be useful, which I think is already very low, even further.

> It looks like the surface cache discards everything when it gets a memory
> pressure? I'm not sure if they is ideal on b2g. I have the impression that
> memory pressure is somewhat normal and common on b2g, so to dump all images
> only to have to redecode most of them again doesn't seem like a good idea.
> Can we just discard the oldest images until we are under a certain threshold?

It might be a good idea to do something smart here, definitely. Perhaps we could attempt to reduce the memory usage of the SurfaceCache by 50% each time we get a memory pressure notification, rather than 100% as we do now.

I suspect this is something better suited for a followup, but I'd be happy to do it. It's not too hard.

> You probably also want to adjust the android prefs.

Agreed.

> Most desktop's these days probably have at least 4GB of ram, so that means
> the surface cache size is always going to be 1GB on desktop? That seems
> really huge to me, especially considering that the similar value on desktop
> is currently 50mb.

Yes, but desktops have image locking. You should compare to the hard limit, which is currently *infinite* on desktops. In other words, I've reduced the limit here, not increased it. What happens now is that we will allow an unlimited amount of locked images on desktop, and we limit unlocked images only. After this change, we'd allow up to 1GB of locked images on desktop, and while unlocked images are subject to the same limit, they also expire rapidly from the surface cache if not touched. I think the practical effects of this change are not going to be significant on desktop, though obviously we can revisit if they are.
(In reply to Daniel Holbert [:dholbert] from comment #39)
> If this part is supposed to land with this patch (instead of being a
> temporary debugging tweak), could you file a bug filed for this new
> "skip-if" annotation, and annotate it here?

I intentionally broke this test as part of the changes in comment 38, yeah. I'm not sure I think this not working should be considered a bug, honestly. Allocating a single 64MB image on a device with <=256MB of total memory is sketchy. (Although downscale-during-decode will fix it anyway.)
(In reply to Seth Fowler [:seth] from comment #42)
> Allocating a single 64MB image on a device with <=256MB of total memory is
> sketchy. (Although downscale-during-decode will fix it anyway.)

OK. If you want to leave it skipped & don't think it's really a bug, could you maybe add a comment saying this (about image-size / available memory) in the manifest?

We have too many mysterious "skip-if(b2g)" annotations sprinkled through our manifests, and I'd rather not add more (unless there's a bug link or a brief explanation to remove the mystery).
I did file it, though, since downscale-during-decode will fix it. Bug 1080241 will turn 370629-1.html back on. If I need to turn any other tests off on B2G for allocating gigantic images, I'll update bug 1080241, because downscale-during-decode will probably fix them all.
Ah, nice. Thanks!
(In reply to Daniel Holbert [:dholbert] from comment #43)
> OK. If you want to leave it skipped & don't think it's really a bug, could
> you maybe add a comment saying this (about image-size / available memory) in
> the manifest?

Sure! I'll link to the bug I just filed.
(We are having a strange conversion here since we keep having mid-air collisions. =)
Comment on attachment 8502013 [details] [diff] [review]
(Part 3) - Update SurfaceCache prefs to increase the cache size

Okay, you convinced me that it wouldn't hurt to try these values. But I do think that not throwing out the whole surface cache on b2g on memory pressure is important to avoid a lot of wasteful redecodes. Do you think it's reasonable to implement that and land before or at the same time as this?
Attachment #8502013 - Flags: review?(tnikkel) → review+
OK, so per our discussions in this bug and on IRC, it seems wise to not necessarily clear the entire SurfaceCache on memory pressure. This patch gives us the ability to do that, by add a pref that sets a "discard factor" controlling how much we'll free when we get a memory pressure notification. It's interpreted as a recriprocal; for example, if the discard factor is 1, then we'll free all the data in the SurfaceCache (except for locked surfaces), if it's 2 then we'll free half of the data, etc.

A discard factor of 1 is the same as the current behavior, and this is still appropriate on desktop systems since image locking means that we'll never free anything visible. (And we are unlikely to see a memory pressure notification anyway.) On B2G, since we don't have image locking, I'll change part 4 (which was part 3 before) to set the discard factor to 2. We can tweak further from there.
Attachment #8526505 - Flags: review?(dholbert)
I noticed a couple of typos in the comments.
Attachment #8526507 - Flags: review?(dholbert)
Attachment #8526505 - Attachment is obsolete: true
Attachment #8526505 - Flags: review?(dholbert)
Here's the updated part 4, which sets the discard factor on B2G to 2.
Attachment #8502013 - Attachment is obsolete: true
Fix an integer overflow bug that could be triggered by users with more than 4GB of memory.
Attachment #8497200 - Attachment is obsolete: true
Bug fixes. Two main issues:

- Apply decode flags before attempting to sync decode in LookupFrame.
- Do not send discard notifications when discarding in LookupFrame. (We
  shouldn't send discard notifications unless the discarding is due to the
  expiration of the surface, really. Tests that listen for the DISCARD
  notification want to know about that, not situations where we had to redecode
  with non-premultipled alpha or the like.)
Attachment #8501913 - Attachment is obsolete: true
Nothing has changed here; this patch just needed a rebase.
Attachment #8526886 - Flags: review?(dholbert)
Attachment #8526507 - Attachment is obsolete: true
Attachment #8526507 - Flags: review?(dholbert)
Removing some SurfaceCache changes that got mistakenly qref'ed into this patch. They belonged in part 1 and they've now moved there.
Attachment #8526509 - Attachment is obsolete: true
Depends on: 1103439
Alright, a tweak for part 2 has resolved the remaining oranges. Although chrome:// and resource:// images still cannot be discarded by the DiscardTracker, I've made them discardable by RasterImage itself. This prevents sync decoding failures because discarding fails, which were the cause of those oranges.

Once I land the next patch in this series, which makes us store animated frames in the SurfaceCache as well, there will be no more need for discarding before redecoding and the issue will be moot.
Attachment #8526882 - Attachment is obsolete: true
Comment on attachment 8526886 [details] [diff] [review]
(Part 3) - Make the SurfaceCache free only a fraction of its data on memory pressure

>+++ b/gfx/thebes/gfxPrefs.h
>@@ -236,16 +236,17 @@ private:
>   DECL_GFX_PREF(Live, "image.mem.max_ms_before_yield",         ImageMemMaxMSBeforeYield, uint32_t, 400);
>   DECL_GFX_PREF(Once, "image.mem.surfacecache.max_size_kb",    ImageMemSurfaceCacheMaxSizeKB, uint32_t, 100 * 1024);
>   DECL_GFX_PREF(Once, "image.mem.surfacecache.min_expiration_ms", ImageMemSurfaceCacheMinExpirationMS, uint32_t, 60*1000);
>+  DECL_GFX_PREF(Once, "image.mem.surfacecache.discard_factor", ImageMemSurfaceCacheDiscardFactor, uint32_t, 1);

Per comment at the top of this list, it's supposed to be in alphabetical order -- so, bump this new line up 2 lines (since "discard" sorts before "max")

>+++ b/image/src/SurfaceCache.cpp
>+  void DiscardForMemoryPressure()
>+  {
[...]
>+    // Our target is to raise our available cost by (1 / mDiscardFactor) of our
>+    // discardable cost - in other words, we want to end up with about (1 /
>+    // mDiscardFactor) bytes stored in the surface cache after we're done.

I think the second half of this is wrong -- the part that says...
  "we want to end up with about (1 / mDiscardFactor) bytes"

Clearly we don't want to end up with "1/2 bytes" or "1/4 bytes". :)

I think maybe you meant to say *fewer* bytes? (and left out the word "fewer")  So e.g. 1/2 fewer bytes, or 1/4 fewer bytes (stored in the cache).

>+    printf("*** targetCost = %zu discardableCost = %zu mAvailableCost = %zu mDiscardFactor = %u mLockedCost = %zu mMaxCost = %zu\n",
>+           targetCost, discardableCost, mAvailableCost, mDiscardFactor, mLockedCost, mMaxCost);

This printf probably doesn't want to be there in the version that lands.

>+    // Discard surfaces until we've reduced our cost to our target cost.
>+    while (targetCost > mAvailableCost) {
>+      MOZ_ASSERT(!mCosts.IsEmpty(), "Removed everything and still not done");
>+      Remove(mCosts.LastElement().GetSurface());
>+    }

This makes me a little uneasy... If we get any of our math wrong here (or massage the target-calculation up above to be a bit more liberal), and we somehow end up in the loop-body with an empty "mCosts", then release builds will crash on mCosts.LastElement() here.

Maybe promote the MOZ_ASSERT into a check, like so:

    if (MOZ_UNLIKELY(mCosts.IsEmpty())) {
      MOZ_ASSERT_UNREACHABLE("Removed everything and still didn't reach target");
      break;
    }

r=me with the above, though I'll trust your judgement about whether I'm being too paranoid about mCosts.IsEmpty().
Attachment #8526886 - Flags: review?(dholbert) → review+
Fixed an uninitialized value pointed out by valgrind - mPendingAnimation should have been initialized in the initializer list, but wasn't.
Attachment #8527345 - Attachment is obsolete: true
Thanks for the review, Daniel!

(In reply to Daniel Holbert [:dholbert] from comment #57)
> I think maybe you meant to say *fewer* bytes? (and left out the word
> "fewer")  So e.g. 1/2 fewer bytes, or 1/4 fewer bytes (stored in the cache).

Yeah. =)

> r=me with the above, though I'll trust your judgement about whether I'm
> being too paranoid about mCosts.IsEmpty().

I can understand why you're concerned! I think it's probably enough to check that |targetCost + mLockedCost <= mMaxCost| before the loop, though. If that's true and the loop still crashes, SurfaceCache's internal data structures are corrupted to the point where I think I'd prefer for us to crash so we find out that it's happening.
Sounds good to me -- thanks.
(In reply to Seth Fowler [:seth] from comment #59)
> (In reply to Daniel Holbert [:dholbert] from comment #57)
> > I think maybe you meant to say *fewer* bytes? (and left out the word
> > "fewer")  So e.g. 1/2 fewer bytes, or 1/4 fewer bytes (stored in the cache).
> 
> Yeah. =)

Actually this is the correct phrasing:

> Our target is to raise our available cost by (1 / mDiscardFactor) of our
> discardable cost - in other words, we want to end up with about
> (discardableCost / mDiscardFactor) bytes stored in the surface cache
> after we're done.

(In other words, my typo was to write '1' instead of 'discardableCost' in the second fraction.)
Argh, plus the word 'fewer', which I somehow left out a second time there!
Attachment #8526886 - Attachment is obsolete: true
Comment on attachment 8527873 [details] [diff] [review]
(Part 3) - Make the SurfaceCache free only a fraction of its data on memory pressure

That new language RE "(discardableCost / mDiscardFactor) fewer bytes" sounds good, thanks!

One nit on the new "if" check & early-return:

>+  void DiscardForMemoryPressure()
>+  {
[...]
>+    if (targetCost > mMaxCost - mLockedCost) {
>+      MOZ_ASSERT_UNREACHABLE("Target cost is more than we can discard");
>+      DiscardAll();
>+      return;
>+    }

This condition should be marked as MOZ_UNLIKELY, since we really expect it to be impossible, barring corruption or arithmetic mistakes higher up in this function.

(Looks like this file doesn't yet have an #include for mozilla/Likely.h yet -- it'll need that, too.)

Feel free to punt to a followup patch if you like.
I've disabled a couple of additional tests on B2G because they trip the new limits. I also disabled one test (test_undisplayed_iframe.html) for rare intermittent failures. I'm not sure whether those failures are caused by this bug or not, since they're so hard to reproduce, but I don't want to hold up landing this because of that test. (I'll definitely try to get it reenabled soon, though.)
Attachment #8526887 - Attachment is obsolete: true
Fixed wrong reftest syntax in the previous version of the patch.
Attachment #8527906 - Attachment is obsolete: true
Here's a try job that doesn't include the fix in comment 66 (and thus has a failure on B2G):

https://tbpl.mozilla.org/?tree=Try&rev=90124f3bc2a0
Here's a B2G-only try job that includes the fix in comment 66 (and thus doesn't have the failure):

https://tbpl.mozilla.org/?tree=Try&rev=8874a333ffc5
Blocks: 1104622
Pushed a followup to fix my mochitest.ini syntax. Apparently you should use |disabled = reason| and not |skip-if = true|.

https://hg.mozilla.org/integration/mozilla-inbound/rev/7fbcaefd5052
sorry seth had to back this out in https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&revision=e04692266f17 since one of this changes caused test failures like https://treeherder.mozilla.org/ui/logviewer.html#?job_id=4191914&repo=mozilla-inbound

Seth i noticed this failure also on https://tbpl.mozilla.org/?tree=Try&rev=90124f3bc2a0 and did a retrigger, maybe this helps
Flags: needinfo?(seth)
The test failure is:
> INFO - 1064 INFO TEST-UNEXPECTED-FAIL | /tests/dom/base/test/test_fileapi_slice.html |
>     uncaught exception - NS_ERROR_NOT_AVAILABLE: at
>     http://mochi.test:8888/tests/dom/base/test/test_fileapi_slice.html:96 
> INFO - JavaScript error: http://mochi.test:8888/tests/dom/base/test/test_fileapi_slice.html,
>     line 96: NS_ERROR_NOT_AVAILABLE

...which points to this line of the test:
> testcanvas.getContext("2d").drawImage(image, 0, 0);

So it appears this bug's patches makes canvas "drawImage" throw NS_ERROR_NOT_AVAILABLE in cases where it previously did not.

(There is a randomorange bug filed for this test -- Bug 1092453 -- but that's for a different sort of test-failure.)
I'm disabling test_fileapi_slice.html. It was already disabled on B2G, Android, and E10S. With my patches its failing on Mulet too (which kinda makes sense, since Mulet is sort-of-B2G). This is in addition to the intermittent orange it's also responsible for, bug 1092453, which Daniel mentions in the previous comment.

It's the only known issue keeping me from landing, and given how many issues there are with this test, I think disabling it is reasonable. I'll file a followup where I'll continue debugging it until I resolve the problem, but I don't think it should delay landing the patch.

(FWIW, I reached this decision after putting significant debugging effort into it already. It's quite slow going since I can only debug this on try, but the results so far make me think that it's not unlikely that this failure is specific to this test, rather than being some sort of general issue.)
Attachment #8528191 - Attachment is obsolete: true
Looks like Tomcat missed rev 82fd2eef7630 when he did this backout. And to the point, something in Seth's push made OSX Gip near-permafail. Those permafails persisted past the backout push, which leads me to suspect that it's the pref changes that did it. We'll find out!
https://hg.mozilla.org/integration/mozilla-inbound/rev/4553524f671f
The Gip issues this caused are:

https://bugzilla.mozilla.org/show_bug.cgi?id=1104699
In this bug when trying to edit/crop an image in the Gaia Gallery app, the image does not load in the canvas so the crop cannot be applied.

https://bugzilla.mozilla.org/show_bug.cgi?id=1105243
In this one the test is trying to pick a photo from the gallery [gallery app] and apply it to the contact [contacts app] but the picture us not making its way to the contacts app within 10 seconds (which is usually more than enough).
So I repushed this before seeing comment 76 and 77. Almost certainly, those failures are the fault of the hard limit pref. I will push a followup in just a sec to fix.

remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/b0f351e99e1a
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/5078e4014e21
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/6bacacfaf339
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/1cbca597b025
Blocks: 1105243
Blocks: 1104699
Target Milestone: mozilla37 → mozilla36
Depends on: 1106252
Depends on: 1106423
Depends on: 1106650
No longer depends on: 1106650
Depends on: 1111041
Depends on: 1113855
Blocks: 622816
Blocks: 739159
Blocks: 739245
(In reply to Seth Fowler [:seth] from comment #73)
> I'm disabling test_fileapi_slice.html. It was already disabled on B2G,
> Android, and E10S. With my patches its failing on Mulet too (which kinda
> makes sense, since Mulet is sort-of-B2G). This is in addition to the
> intermittent orange it's also responsible for, bug 1092453, which Daniel
> mentions in the previous comment.
> 
> It's the only known issue keeping me from landing, and given how many issues
> there are with this test, I think disabling it is reasonable. I'll file a
> followup where I'll continue debugging it until I resolve the problem, but I
> don't think it should delay landing the patch.

Well played - looks like you only caused it to start failing on Mulet, but you disabled it on every platform it still ran on, and didn't actually file any sort of followup that might have told the test's owner that you had disabled it everywhere.

Reenabled except on Mulet in https://hg.mozilla.org/integration/mozilla-inbound/rev/750730cebb40
Depends on: 1122643
Blocks: 622928
Depends on: 1128188
Depends on: 1135411
No longer depends on: 1135411
No longer depends on: 1145284
Blocks: 1026670
Depends on: 1262269
You need to log in before you can comment on or make changes to this bug.