Closed Bug 1284350 Opened 4 years ago Closed 3 years ago

every image locked on huge page

Categories

(Core :: ImageLib, defect)

50 Branch
defect
Not set

Tracking

()

VERIFIED FIXED
mozilla51
Tracking Status
e10s - ---
firefox47 --- unaffected
firefox48 --- unaffected
firefox49 --- unaffected
firefox50 --- fixed
firefox51 --- verified

People

(Reporter: tnikkel, Assigned: tnikkel)

References

(Blocks 1 open bug, )

Details

(Keywords: regression, Whiteboard: [MemShrink:P1])

Attachments

(19 files)

626.41 KB, application/x-zip-compressed
Details
1.59 KB, patch
Details | Diff | Splinter Review
1.91 KB, patch
Details | Diff | Splinter Review
4.47 KB, patch
Details | Diff | Splinter Review
24.95 KB, patch
Details | Diff | Splinter Review
15.15 KB, patch
Details | Diff | Splinter Review
6.81 KB, patch
Details | Diff | Splinter Review
11.84 KB, patch
Details | Diff | Splinter Review
5.69 KB, patch
Details | Diff | Splinter Review
17.19 KB, patch
Details | Diff | Splinter Review
1.59 KB, patch
Details | Diff | Splinter Review
27.18 KB, patch
Details | Diff | Splinter Review
3.95 KB, patch
Details | Diff | Splinter Review
57.67 KB, patch
Details | Diff | Splinter Review
2.21 KB, patch
mstange
: review+
Details | Diff | Splinter Review
3.42 KB, patch
kaku
: review+
Details | Diff | Splinter Review
1.67 KB, patch
kaku
: review+
Details | Diff | Splinter Review
1.38 KB, patch
kaku
: review+
Details | Diff | Splinter Review
50.48 KB, application/x-zip-compressed
Details
Goto https://satoshimatrix.wordpress.com/2012/03/23/top-100-nesfamicom-games-list-all-in-one/
Scroll through the whole page to the bottom.
Goto about:memory in a new window. Minimize memory usage.
Then generate a memory report. We are using 500mb for images. It appears as though every image on the page is kept decoded. And not only that, but locked.

Bisected this to

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=6fe85a187e707bfc3fbd6a3bba71db29aae5c71f&tochange=8b76cbb69bef628b93c165c78a55c361ad3e5109

-> bug 1261554
[Tracking Requested - why for this release]: this is a huge memory regression, we should not ship. especially since there don't appear to be any obvious wins we get from the regressing patch that I can tell from a quick look.
Flags: needinfo?(seth)
Attached file testcase
The page isn't doing anything fancy. Here is a testcase that reproduces the same problem. It's just images and nothing else. We keep them all locked after scrolling through the page. Even though we reach the maximum size allowed by the surface cache (as evidenced when the images at the bottom refuse to decode) we never unlocked any images even after minimizing memory. This is very broken.
When disabled e10s, it seems to be reproduced even on Nightly50.0a1 Windows10.
tracking-e10s: --- → ?
Track this as this is a memory regression.
Seth, Markus, is bug 1261554 something we can back out easily or somehow "disable"?

[Tracking Requested - why for this release]: I would like to consider this blocking 48, rather than just tracking+.
Assignee: nobody → seth
Flags: needinfo?(mstange)
Flags: needinfo?(bugs)
Version: unspecified → 48 Branch
Not sure, I would have to read the patches again. Seth will be able to answer this question :-)
Flags: needinfo?(mstange)
Yeah, I'm taking a look.
I can confirm that it reproduces on Nightly with e10s disabled. Backgrounding the tab unlocks the images, so it doesn't appear that we're locking images an extra time and losing track of them or anything like that. I'm digging in deeper now, but superficially it appears that we consider the entire page visible for some reason.
Perhaps once the cause is located we'll understand this, but I'm really puzzled as to how this got by AWSY. AWSY runs only in non-e10s mode (I confirmed this with erahm) and definitely should've set off alarm bells immediately. How on earth did this get all the way to beta?
The STR require scrolling. Does AWFY just load pages and not scroll? Then it wouldn't see this.
Per comment #5, tagging this as a blocker for Fx48.
OK, I've located the issue, though I haven't had time to fully analyze it. It seems that PresShell::Paint() is not called for each pres shell which gets painted when we run on non-e10s. (And possibly in other circumstances; again, I need to look over the code a bit now that I understand where the issue comes from.) I've confirmed that the following changes fix the issue:

(1) Stop creating an AutoVisibilityUpdate in PresShell::Paint().
(2) If we're painting to the window in nsDisplayListBuilder::EnterPresShell(), create an AutoVisibilityUpdate.
(3) Destroy the AutoVisibilityUpdate created in (2) in nsDisplayListBuilder::LeavePresShell().

Without doing this, we would never remove a frame from the IN_DISPLAYPORT visible frames set during painting if the frame was associated with a nested pres shell. That's what caused all the images to become locked permanently as we scrolled.

Creating a patch to fix this should be straightforward; I'll tackle that tomorrow.
Flags: needinfo?(seth)
Flags: needinfo?(bugs)
(In reply to Seth Fowler [:seth] [:s2h] from comment #12)
> OK, I've located the issue, though I haven't had time to fully analyze it.
> It seems that PresShell::Paint() is not called for each pres shell which
> gets painted when we run on non-e10s. (And possibly in other circumstances;

PresShell::Paint only gets called for toplevel documents in a process. This is true of e10s and non-e10s. So sounds like the bug would appear in e10s if the scrolled page with images was in an iframe (or any non-toplevel document).
(In reply to Timothy Nikkel (:tnikkel) from comment #14)
> PresShell::Paint only gets called for toplevel documents in a process. This
> is true of e10s and non-e10s. So sounds like the bug would appear in e10s if
> the scrolled page with images was in an iframe (or any non-toplevel
> document).

You're right; I can confirm that this occurs. I tested this situation when originally writing the patch, but I must've broken it at some point (unsurprising given that I was operating on a mistaken assumption), and it didn't get noticed because there are no direct tests for this. I've been thinking about this and I've come up with an approach that should make it fairly easy to write tests for the visibility code; I'm going to write some and get them landed ASAP.
As for this bug, I think Timothy is right that it's not important that we have this on beta, and I don't think we need it on aurora either. The benefit of shipping the code in those releases was mainly to shake out any bugs, and it seems we've done that! Nothing depends on these changes in those releases, though. (I'll need to double check for aurora, but I'm pretty sure that's true for beta.) So I'd say let's back this stuff out of beta and aurora and land the fix on central. This will let me write a cleaner fix, since I won't have to worry about keeping the patch minimal to simplify backporting.
Sounds like a good plan, let's back out of both beta and aurora. Hoping that won't cause other issues but we still have some time to catch them in late beta.
Flags: qe-verify+
Whiteboard: [MemShrink]
Duplicate of this bug: 1279760
We'll follow up on the AWSY side to make sure we detect this in the future.
Whiteboard: [MemShrink] → [MemShrink:P1]
Now that this is backed out of 48 and 49, we can call it "regression in 50".
Version: 48 Branch → 50 Branch
It would be good if we can land this in July, to avoid having to uplift.
alright, this seems to have caused an improvement in tps on aurora, but now that we merged v.50 to Aurora, we see a regression:
This regression patch c2ffbe222854 has been merged into Aurora, so please make sure to uplift any fixes to Aurora, thank you.

https://treeherder.mozilla.org/perf.html#/alerts?id=2185

  tps summary linux64 pgo - 7.79% worse
  tps summary linux64 pgo e10s - 4.21% worse 
  tps summary osx-10-10 opt - 26.26% worse
  tps summary windows7-32 pgo - 5.89% worse
  tps summary windows7-32 pgo e10s - 10.64% worse
  tps summary windows8-64 pgo - 9.99% worse
  tps summary windows8-64 pgo e10s - 8.57% worse
  tps summary windowsxp pgo - 12.92% worse

  Reading this bug, it appears expected.
I just wanted to post on this bug to point out that I have not forgotten about it. As I've mentioned to several people CC'd on this bug, I have two projects I'm trying to complete ASAP, and this is one of them. I'm almost done with the other, but unfortunately it's taking up all the time I have to devote to Mozilla stuff at the moment. Sorry we're going to end up having to uplift. =\
If we can't fix this on 50, let's back it out before it goes to beta.
Flags: needinfo?(tnikkel)
Should we just take this out of central until it's fixed properly?
That was my plan.
Should we back this out of Aurora now?
Working on it (the backout.)  It isn't simple, because stuff has subsequently landed that needs to now be rewritten using the "old" way of doing things.  In other words, we can't just back this out, we have to put up another patch with that backout.
Attached patch backout1Splinter Review
Flags: needinfo?(tnikkel)
Attached patch backout2Splinter Review
Attached patch backout3Splinter Review
Assignee: seth.bugzilla → tnikkel
Attached patch backout4Splinter Review
Attached patch backout5Splinter Review
Attached patch backout6Splinter Review
Attached patch backout7Splinter Review
Attached patch backout8Splinter Review
Attached patch backout9Splinter Review
Attached patch backout10Splinter Review
Attached patch backout11Splinter Review
Attached patch backout12Splinter Review
Attached patch backout13Splinter Review
Comment on attachment 8789645 [details] [diff] [review]
backout13

Review of attachment 8789645 [details] [diff] [review]:
-----------------------------------------------------------------

Bug 1282710 replaced a video element's video decoder to a lightweight one while the element becomes invisible.  We relied on the original three states (NONVISIBLE, MAY_BECOME_VISIBLE, IN_DISPLAYPORT) to implement a conservative mechanism. That is only when an element gets into NONVISIBLE state will its video decoder be switched to the lightweight one; otherwise, we keep using the original video decoder whether the element is "visible" or "may become visible".

So, I am wondering how much can I rely on the upcoming two states mechanism?

::: layout/generic/Visibility.h
@@ +22,5 @@
>    UNTRACKED,
>  
> +  // Indicates that the frame is probably nonvisible. Visible frames *may* be
> +  // APPROXIMATELY_NONVISIBLE because approximate visibility is not updated
> +  // synchronously. Some truly nonvisible frames may be marked

Does this means that a visible frame has a chance to be marked as APPROXIMATELY_NONVISIBLE temporarily and will finally be marked as APPROXIMATELY_VISIBLE once certain async event been ran?
@Dan, since I will take a long leave from next Wednesday on, so I would like to keep you informed in case that I didn't settle this issue down.
(In reply to Tzuhao Kuo [:kaku] from comment #50)
> Comment on attachment 8789645 [details] [diff] [review]
> backout13
> 
> Review of attachment 8789645 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Bug 1282710 replaced a video element's video decoder to a lightweight one
> while the element becomes invisible.  We relied on the original three states
> (NONVISIBLE, MAY_BECOME_VISIBLE, IN_DISPLAYPORT) to implement a conservative
> mechanism. That is only when an element gets into NONVISIBLE state will its
> video decoder be switched to the lightweight one; otherwise, we keep using
> the original video decoder whether the element is "visible" or "may become
> visible".
> 
> So, I am wondering how much can I rely on the upcoming two states mechanism?

Before the patches here we have:

(NONVISIBLE, MAY_BECOME_VISIBLE, IN_DISPLAYPORT)

After these patches we have:

(NONVISIBLE, MAY_BECOME_VISIBLE + IN_DISPLAYPORT)

where we have merged MAY_BECOME_VISIBLE and IN_DISPLAYPORT into one state. We name these states

(APPROXIMATELY_NONVISIBLE, APPROXIMATELY_VISIBLE)

Since the video element did nothing different between MAY_BECOME_VISIBLE and IN_DISPLAYPORT this should be equivalent.

> ::: layout/generic/Visibility.h
> @@ +22,5 @@
> >    UNTRACKED,
> >  
> > +  // Indicates that the frame is probably nonvisible. Visible frames *may* be
> > +  // APPROXIMATELY_NONVISIBLE because approximate visibility is not updated
> > +  // synchronously. Some truly nonvisible frames may be marked
> 
> Does this means that a visible frame has a chance to be marked as
> APPROXIMATELY_NONVISIBLE temporarily and will finally be marked as
> APPROXIMATELY_VISIBLE once certain async event been ran?

That comment should probably be updated by the patch "Ensure painted frames are in visible list". Specifically, frames that were IN_DISPLAYPORT will not be marked APPROXIMATELY_NONVISIBLE.
Blocks: 1299718
Thanks for your explanation, :tnikkel, then it should be quite safe! Will review the patches again.
Attachment #8789649 - Flags: review?(kaku) → review+
Attachment #8789650 - Flags: review?(kaku) → review+
Attachment #8789651 - Flags: review?(kaku) → review+
Comment on attachment 8789647 [details] [diff] [review]
Ensure painted frames are in visible list

Review of attachment 8789647 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the explanation in the commit message.
Attachment #8789647 - Flags: review?(mstange) → review+
Pushed by tnikkel@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/76b2bc20a727
Backed out changeset 06bf533a2bdd (Bug 1299065 - invisible elements in the foreground should also be recorded; r=gerald)
https://hg.mozilla.org/integration/mozilla-inbound/rev/f6bd48b932a5
Backed out changeset 103dc4eddacf (Bug 1282710 - part 2 - Plumb the visibility event from nsIFrame to nsIDOMMediaElemnt; r=seth)
https://hg.mozilla.org/integration/mozilla-inbound/rev/92e04b8efcaa
Backed out changeset 1bbb1ab928c7 (Bug 1282710 - Part 1 - implement the suspend and resume logics in HTMLMediaElement.cpp according to visibility events; r=cpearce r=kamidphish)
https://hg.mozilla.org/integration/mozilla-inbound/rev/4964f5d7354b
Backed out changeset 719d6d5d9d21 (Bug 1259281 - Mark frames NONVISIBLE if their pres shell is inactive or frozen. r=mstange)
https://hg.mozilla.org/integration/mozilla-inbound/rev/e36e9de24a09
Backed out changeset d0aa5cf74699 (Bug 1269937 - Manage updating visible frames and regions using RAII. r=mstange)
https://hg.mozilla.org/integration/mozilla-inbound/rev/ade684af2dea
Backed out changeset 45c3308d49c9 (Bug 1269935 - Replace PresShell::DecVisibleCount() with a general map function. r=mstange)
https://hg.mozilla.org/integration/mozilla-inbound/rev/efc8b6da54df
Backed out changeset 4517cddd204e (Bug 1269934 - Handle visible frame sets more generically in PresShell. r=mstange)
https://hg.mozilla.org/integration/mozilla-inbound/rev/8864fa8e88fc
Backed out changeset e9df21facccf (Bug 1269931 - Send visible region updates for pres shells associated with nested views. r=botond)
https://hg.mozilla.org/integration/mozilla-inbound/rev/90ef23c8f057
Backed out changeset d6a286242f2d (Bug 1268348 - Pass the previous visibility state to OnVisibilityChange(). r=mstange)
https://hg.mozilla.org/integration/mozilla-inbound/rev/9230703875bb
Backed out changeset a909de86c183 (Bug 1261554 (Followup) - Fix memory reporting for PresShell::mVisibleRegions. r=me)
https://hg.mozilla.org/integration/mozilla-inbound/rev/a49a5a669a2c
Backed out changeset bb3bf463c0ec (Bug 1261554 (Part 3) - Visualize Visibility::IN_DISPLAYPORT regions in the APZ minimap visibility debugger. r=botond)
https://hg.mozilla.org/integration/mozilla-inbound/rev/7dc535385a4f
Backed out changeset 82c3b4b81d82 (Bug 1261554 (Part 2) - Mark frames which are added to the display list when painting to the window as having Visibility::IN_DISPLAYPORT. r=mstange)
https://hg.mozilla.org/integration/mozilla-inbound/rev/22b0444e1e3a
Backed out changeset 69abdc731a99 (Bug 1261554 (Part 1) - Prepare for implementing in-displayport visibility tracking. r=mstange)
https://hg.mozilla.org/integration/mozilla-inbound/rev/a3bbde2006a0
Ensure that if a frame is painted it is added to the approximately visible list. r=mstange
https://hg.mozilla.org/integration/mozilla-inbound/rev/5187bdfbe760
Reland on top of backouts "Bug 1282710 - Part 1 - implement the suspend and resume logics in HTMLMediaElement.cpp according to visibility events; r=cpearce r=kamidphish" r=kaku
https://hg.mozilla.org/integration/mozilla-inbound/rev/8de973bd95ad
Reland on top of backouts "Bug 1282710 - part 2 - Plumb the visibility event from nsIFrame to nsIDOMMediaElemnt; r=seth" r=kaku
https://hg.mozilla.org/integration/mozilla-inbound/rev/f56deec592bd
Reland on top of backouts "Bug 1299065 - invisible elements in the foreground should also be recorded; r=gerald" r=kaku
Rebasing these patches on aurora is a little complicated because we don't need to backout and reland some things that haven't landed on aurora.

Because I don't want to upload that many patches again I'll just upload a zip with the patches and a series file to indicate the order. (Or I can land this on aurora myself.)

Approval Request Comment
[Feature/regressing bug #]: bug 1261554
[User impact if declined]: too many images held in memory (a regression)
[Describe test coverage new/current, TreeHerder]: none
[Risks and why]: as this is a (mostly) backout is should be low risk
[String/UUID change made/needed]: none
Attachment #8791060 - Flags: approval-mozilla-aurora?
Comment on attachment 8791060 [details]
patches for aurora in a zip

Backout of a pretty big code change to avoid regressing on mem usage, this seems a bit risky but 50 has an entire Beta cycle to go through and shakeout any unwanted regressions. Aurora50+
Attachment #8791060 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Hello! Can you please give me some details about reproducing and verifying this issue? Where exactly in about:memory do I have to notice that every image on the page is kept decoded and locked?
Flags: needinfo?(tnikkel)
(In reply to Iulia Cristescu, QA [:IuliaC] from comment #62)
> Hello! Can you please give me some details about reproducing and verifying
> this issue? Where exactly in about:memory do I have to notice that every
> image on the page is kept decoded and locked?

Search for the images section in about:memory, look for locked and unlocked.
Flags: needinfo?(tnikkel)
(In reply to Timothy Nikkel (:tnikkel) from comment #63)
> (In reply to Iulia Cristescu, QA [:IuliaC] from comment #62)
> > Hello! Can you please give me some details about reproducing and verifying
> > this issue? Where exactly in about:memory do I have to notice that every
> > image on the page is kept decoded and locked?
> 
> Search for the images section in about:memory, look for locked and unlocked.
 
First of all, thank you! 
Yes, this way I tried to reproduce the issue, but I didn't manage to notice a remarkable difference between the about:memory sections. 
For example, using the testcase attachment or https://satoshimatrix.wordpress.com/2012/03/23/top-100-nesfamicom-games-list-all-in-one/, I compared the about:memory sections:
- on 51.0a2 (2016-11-09) 
  - e10s on
    1.46 MB ── imagelib-surface-cache-estimated-locked
          1.46 MB ── imagelib-surface-cache-estimated-total
                0 ── imagelib-surface-cache-overflow-count

    127.30 MB ── imagelib-surface-cache-estimated-locked
        127.30 MB ── imagelib-surface-cache-estimated-total
                0 ── imagelib-surface-cache-overflow-count   

  - e10s off 
    43.06 MB ── imagelib-surface-cache-estimated-locked
         43.06 MB ── imagelib-surface-cache-estimated-total
                0 ── imagelib-surface-cache-overflow-count                

    0.00 MB ── imagelib-surface-cache-estimated-locked
          0.00 MB ── imagelib-surface-cache-estimated-total
                0 ── imagelib-surface-cache-overflow-count

- on 50.0a1 (2016-07-04)
  - e10s on 
    1.51 MB ── imagelib-surface-cache-estimated-locked
          1.51 MB ── imagelib-surface-cache-estimated-total
                0 ── imagelib-surface-cache-overflow-count

    127.29 MB ── imagelib-surface-cache-estimated-locked
        127.29 MB ── imagelib-surface-cache-estimated-total
                0 ── imagelib-surface-cache-overflow-count
  
  - e10s off
    1,017.61 MB ── imagelib-surface-cache-estimated-locked
      1,017.61 MB ── imagelib-surface-cache-estimated-total
               14 ── imagelib-surface-cache-overflow-count 

I didn't find any "unlocked" for image sections.
Flags: needinfo?(tnikkel)
(In reply to Iulia Cristescu, QA [:IuliaC] from comment #64)
>   - e10s off
>     1,017.61 MB ── imagelib-surface-cache-estimated-locked
>       1,017.61 MB ── imagelib-surface-cache-estimated-total
>                14 ── imagelib-surface-cache-overflow-count 

The problem is this 1GB of locked images, which is gone in the newer builds, so considered this verified.
Flags: needinfo?(tnikkel)
Status: RESOLVED → VERIFIED
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.