Closed Bug 845147 Opened 11 years ago Closed 11 years ago

Janky scrolling on pages with many small images as a result of bug 689623

Categories

(Core :: Graphics: ImageLib, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla24

People

(Reporter: justin.lebar+bug, Assigned: tnikkel)

References

(Depends on 1 open bug)

Details

(Whiteboard: [snappy][MemShrink])

Attachments

(3 files, 2 obsolete files)

(njn reported this; I haven't tried it myself)

STR that I expect will work but haven't tried myself:

* http://mpdrolet.tumblr.com/archive
* Scroll down so there are a few pages of images
* Wait 30s
* Grab the scrollbar and drag it to the top of the page.  This is not janky in a nightly without bug 689623, but the assertion is that it is with bug 698623.

I suspect this is because we're sync-decoding all these small images.

Perhaps a heuristic we can apply is not to sync-decode images that have been discarded (or have had their decodes canceled).  The sync decoding is really needed only for our tests, right?

I think this should be a high-priority follow-up to bug 698623, because I suspect it will be whine-worthy (that is, we'll get a lot of people complaining about it).
Blocks: 689623
No longer blocks: 698623
I see this on Linux and Mac, but it's not (to my eye) terrible.
I thought we synchronously did a small amount of decoding for all images? And for small images that small amount decodes the whole image, whereas for larger images it just does a piece of the decoding work?
I guess nothing you said contradicts that.

Anyways I thought we did the sync decoding so that we do get too many images flashing into view on tab switch.
> Anyways I thought we did the sync decoding so that we do get too many images flashing into view on 
> tab switch.

That may have been a justification at one point, but I'd bet money that Taras would trade some image flashing for faster tab switching.  The tests were a real impediment to experimentation here, last time I checked.
I have no idea if this will work, but if it compiles, I'll try it out on my slow netbook.

https://tbpl.mozilla.org/?tree=Try&rev=af9e2f293ae6
Could it be so simple?  Still waiting for test results.

The difference in jank on my crappy netbook with this patch is night-and-day.

There's a visible downside, which is that images take longer to come into view.  I presume this is because we're devoting more CPU to scrolling and less to decoding.  I think this is the right trade-off, though.
Alas this appears to be causing reftest failures.

I'm kind of tired of using reftest as an excuse for not fixing jank.

How about we make this change but hide it behind a pref and set that pref to false for reftest?
Flags: needinfo?(joe)
(In reply to Justin Lebar [:jlebar] from comment #7)
> How about we make this change but hide it behind a pref and set that pref to
> false for reftest?

I guess this patch doesn't necessarily do what you'd expect due to the fact that we share images between windows.  So an image might have been decoded before, but in a different window.

So perhaps the right thing to do is to hide /all/ SOMEWHAT_SYNC decodes behind a pref, instead of all but the first.
It'll be pretty temporary anyways - most of the time we'll probably be switching SOMEWHAT_SYNCHRONOUS to NORMAL after bug 716140. (Switching from StartDecoding() to RequestDecode() in users, that is - I don't want to interpret anything in the depths of RequestDecode().)

Bug 716140 will probably make this a little better regardless, because the non-synchronous part of the decode will be off the main thread, but given that you can basically instantaneously start the decoding on a different thread by calling RequestDecode() in a multithreaded decoding world, there's not much reason to do StartDecoding() unless you actually want to block the main thread.
Flags: needinfo?(joe)
IOW, if this actually helps, a=me to hide this behind a non-reftest pref, and we'll revisit in general as we tune in a multithreaded world.
\o/  I'll give you a patch soon.
You seem to be failing only one reftest, and only on some platforms. Maybe it would be better to disable one test then disable this patch for all reftests.
(In reply to Timothy Nikkel (:tn) from comment #12)
> You seem to be failing only one reftest, and only on some platforms. Maybe
> it would be better to disable one test then disable this patch for all
> reftests.

Sure, it sounds like waiting for full test results would be reasonable.  Another option (if reftest supports it) would be to set the pref only for the relevant tests.
I'm happy to have a look at the tests we're failing to see if I can understand what's going on; I think that's a reasonable burden to meet before hacking prefs.
(In reply to Justin Lebar [:jlebar] from comment #13)
> Another option (if reftest supports it) would be to set the pref only for
> the relevant tests.

It does. http://mxr.mozilla.org/mozilla-central/source/layout/tools/reftest/README.txt#160
I wonder if we could just make reftest wait until all images have been decoded.

What's going on in layout/reftests/bugs/816948-1.html is we set an image to be re-decoded in a "reftest-wait" test, and then we immediately remove "reftest-wait" from the html element.  There's no reason that the image should be decoded at this point.
Yeah, you can use special powers to observe the decode complete notification. http://mxr.mozilla.org/mozilla-central/source/image/test/crashtests/ownerdiscard.html?force=1 does this.
I tried hacking something into reftest so it waits for all images to be decoded.  We'll see if this works.

https://tbpl.mozilla.org/?tree=Try&rev=4d5234c503bc
> We'll see if this works.

> 16:18:33 INFO - REFTEST TEST-UNEXPECTED-FAIL | | program error managing snapshots

Apparently not.  I got this error locally a few times, but then it seemed to stop happening.
Okay, so the attempt at fixing things in comment 21 fell short.

Judging by the logs, reftest is successfully waiting for the decode requests to finish, and no decodes finish after reftest takes its snapshot.  So it appears that when reftest gets notified that the decodes are done, those decodes haven't (yet?) triggered a pending paint event which would otherwise cause reftest to wait.

AFAICT (and I'll push to try to verify this), we will always call FlushInvalidations() on the image before we notify reftest that decoding is done.  (When the decoder shuts down, it calls FlushInvalidations(), and all decoders should shut down before I fire the notification.)

So assuming I'm right and we are calling FlushInvalidations() before we notify reftest, my question is: Should FlushInvalidations() be sufficient to get reftest to wait at [1], or do we need to do something harder than this?  dholbert or dbaron, do you understand this code (perhaps including the changes I made [1]) well enough to say?

[1] https://hg.mozilla.org/try/rev/2496a1254d53#l5.65
Flags: needinfo?(dholbert)
Flags: needinfo?(dbaron)
Yes, afaict we flush invalidations before sending the notification to reftest.
And I don't see anything in FlushInvalidations which interacts with layout, so I wonder how we know to paint the image.  Do we just get around to it when we get around to it?
(In reply to Justin Lebar [:jlebar] from comment #22)
> dholbert or dbaron, do you understand this code (perhaps including the
> changes I made [1]) well enough to say?

Well, the simple answer is just:  no.
Flags: needinfo?(dbaron)
In my (brief) gdb'ing, one thing looked suspicious.  AFAICT, layout is listening to the LOAD_COMPLETE notification but not the FRAME_COMPLETE notification.  Again AFAICT LOAD_COMPLETE means we've finished downloading the image but may not have finished decoding it yet, while FRAME_COMPLETE signals that we're done decoding.  It doesn't make much sense to me that we'd redraw when we're done downloading the image but not when we're done decoding it, so I tried changing it.

https://hg.mozilla.org/try/rev/378fd3add332

I can't reproduce the test failures on my mac, so I guess we'll see on try if this has any effect.

https://tbpl.mozilla.org/?tree=Try&rev=378fd3add332
It seems that ImageLoader::DoRedraw never does anything in my tests.  This is because FrameLayerBuilder::IterateRetainedDataForFrame is iterating over a null array.  As a result, we don't call SchedulePaint() on the image's frame when it's done decoding.

In fact, we never call ImageLoader.cpp::InvalidateImagesCallback when I run reftest 816948-1.html.
(In reply to Justin Lebar [:jlebar] from comment #27)
> In my (brief) gdb'ing, one thing looked suspicious.  AFAICT, layout is
> listening to the LOAD_COMPLETE notification but not the FRAME_COMPLETE
> notification.  Again AFAICT LOAD_COMPLETE means we've finished downloading
> the image but may not have finished decoding it yet, while FRAME_COMPLETE
> signals that we're done decoding.  It doesn't make much sense to me that
> we'd redraw when we're done downloading the image but not when we're done
> decoding it, so I tried changing it.

I can offer some insight here. Imagelib guarantees that at least the first frame is decoded by the time LOAD_COMPLETE was fired. Bug 841579 will remove that requirement in support of OMTD, though. It's probably quite near to landing.
Thanks, Seth.

I guess since all of this is slated to change soon, it makes sense to do the simplest thing that could possibly work and switch to all async decodes except for the tests that need sync decodes.  If my tests ever run, we'll see how green I am this time.
Flags: needinfo?(dholbert)
As it turns out, it seems I now need a "all images finished decoding" notification for B2G, to fix bug 846177.  So maybe we need to go back to that approach.

I could use some help from a layout/gfx person, because I'm not familiar with this stuff at all.  Aha, Seth just pinged me!
Justin, if all you need is a notification that guarantees that an image has downloaded and at least the first frame has decoded, the image's load event will do that for you. Also, if these images were all present when the document first loads, the _window's_ load event (not the document's; my apologies) should only be fired after the first frame has decoded for all images.
(In reply to Seth Fowler [:seth] from comment #36)
> Justin, if all you need is a notification that guarantees that an image has
> downloaded and at least the first frame has decoded, the image's load event
> will do that for you.

But that doesn't work if we're re-decoding the image, which happens in bug 846177.  So I really want a notification which says "all images are done decoding and, if you wait until the next mozAfterPaint, they will be displayed."
khuey asked me to post a summary of where this bug is since he might be able to take it over for me, since I just got stuck with a sizable b2g blocker.

Essentially I think we want to take the approach from comment 21, which adds a function to DOMWindowUtils that lets chrome JS be notified when all image decodes have finished.  We then make the reftest harness wait for this notification.  I think I can make B2G do a similar thing in bug 846177.

So if we want to take this code essentially unchanged, the only remaining task is to fix up the tests that are failing.  (There's a test failure on 816948-1.html and additional assertions on 364066-1.html.  I haven't looked at the latter; they may just go away when you rebase the patch.)

I'll post a copy of the last patch I had in a sec.
See the try build in comment 21; perhaps this just needs two changes to our tests.
I filed bug 848377 , which appears to be a dupe.
Depends on: 695763
This is a slightly altered version of Justin's patch. It uses |mHasBeenDecoded| to decide what kind of decode StartDecoding() should request instead of checking the decode count. This means that it behaves a little differently; in particular, StartDecoding() now passes SOMEWHAT_SYNCHRONOUS to RequestDecodeCore until at least one decode has _finished_, while the original starts passing ASYNCHRONOUS to RequestDecodeCore as soon as one decode _starts_. These two behaviors will almost always appear the same - after all, if a decoder is running, StartDecoding() is ignored anyway - but there are edge cases involving decodes being started and then immediately canceled without finishing where the two would differ.
Assignee: nobody → seth
Comment on attachment 719161 [details] [diff] [review]
Patch, v1: Don't decode images synchronously if we've decoded them once before.

Justin, hopefully you don't find it presumptuous if I mark your version obsolete. =) After thinking about it I think the new version is the way to go, though.
Attachment #719161 - Attachment is obsolete: true
Whoa there. =) Resetting assignee.
Assignee: seth → nobody
(In reply to Seth Fowler [:seth] from comment #44)
> Whoa there. =) Resetting assignee.

I think I liked this bug better that way.
> After thinking about it I think the new version is the way to go, though.

I don't have an opinion either way.

I think ideally we'd never decode somewhat_sync, since that's a recipe for jank.  Maybe we can look at making that change in a separate bug.
It might be that that'll end up happening in bug 854394. I'd support that change as well.
Justin, what are your thoughts on this patch? Should we land it, or look at a different approach, or what?
Flags: needinfo?(justin.lebar+bug)
I'd love to land this, but I thought we were blocked on bug 695763.

If we can't land this before m-c branches to aurora, I think we should consider disabling bug 689623.
Flags: needinfo?(justin.lebar+bug)
We are - sorry, that was poorly phrased. I hope to get us unblocked there ASAP now that that patch's dependencies have landed. I was trying to make sure that we actually wanted to land this patch; wasn't sure if we were still in the discussion stages. It sounds like we do, so I'm going to request review for the patch.
Attachment #728823 - Flags: review?(joe)
Comment on attachment 728823 [details] [diff] [review]
Don't decode images synchronously if we've decoded them once before.

Review of attachment 728823 [details] [diff] [review]:
-----------------------------------------------------------------

you should also see whether it blows up Try :)

::: image/src/RasterImage.cpp
@@ +2776,5 @@
>  /* void startDecode() */
>  NS_IMETHODIMP
>  RasterImage::StartDecoding()
>  {
> +  return RequestDecodeCore(mHasBeenDecoded ? ASYNCHRONOUS : SOMEWHAT_SYNCHRONOUS);

This definitely needs a comment, likely referring to this bug, and saying that we're explicitly trading off flashing for responsiveness in the case that we're redecoding an image.
Attachment #728823 - Flags: review?(joe) → review+
Thanks for the review, Joe. I'll add the comment. As for try, this isn't ready to go until the bug it depends on is ready, but I wanted to get a review from you in case you wanted bigger changes to this patch.
Seth, how's it going here?

I think we should probably disable bug 689623 until we resolve this bug.  If you agree and you can't resolve this bug within a few days, are you able to write and land that patch (on branches if necessary)?
FWIW I tried disabling somewhat-sync decodes on b2g18, and it didn't make any appreciable difference in jank when scrolling pinterest, even after a memory-pressure event knocked out our whole cache of decoded images.
Justin, let's triage the situation on Monday. I want to get bug 695763 in but I can't land it without resolving bug 859377, or else its dependencies will be backed out. Bug 859377 is very high priority for me right now.
Depends on: 878612
We've added a new decode type since, so re-requesting review to be sure.
Attachment #728823 - Attachment is obsolete: true
Attachment #758101 - Flags: review?(joe)
Comment on attachment 758101 [details] [diff] [review]
Don't decode images synchronously if we've decoded them once before.

Review of attachment 758101 [details] [diff] [review]:
-----------------------------------------------------------------

::: image/src/RasterImage.cpp
@@ +2765,5 @@
>  NS_IMETHODIMP
>  RasterImage::StartDecoding()
>  {
> +  return RequestDecodeCore(mHasBeenDecoded ?
> +    SYNCHRONOUS_NOTIFY : SYNCHRONOUS_NOTIFY_AND_SOME_DECODE);

I shall repeat my earlier comment: "This definitely needs a comment, likely referring to this bug, and saying that we're explicitly trading off flashing for responsiveness in the case that we're redecoding an image."
Attachment #758101 - Flags: review?(joe) → review+
(In reply to Joe Drew (:JOEDREW! \o/) from comment #57)
> I shall repeat my earlier comment: "This definitely needs a comment, likely
> referring to this bug, and saying that we're explicitly trading off flashing
> for responsiveness in the case that we're redecoding an image."

Thank you for remembering. This probably would have fallen through the cracks after Justin had to move on from this bug.
I added the comment:
  // Here we are explicitly trading off flashing for responsiveness in the case
  // that we're redecoding an image (see bug 845147).
Let me know if you'd like a change.

And pushed
https://hg.mozilla.org/integration/mozilla-inbound/rev/022bc808586b
This seems like the most likely culprit to have made list-simple-1.html start failing around once every few pushes, with the reference case lot loading its two lime-green-square bullet-point PNG images.

There appears to have been a delay of a few pushes before this started happening, but this is by far the most related-looking push, so philor & I both tend to think this was the cause.
Sample logs:
 https://tbpl.mozilla.org/php/getParsedLog.php?id=23796196&tree=Mozilla-Inbound
 https://tbpl.mozilla.org/php/getParsedLog.php?id=23794047&tree=Mozilla-Inbound
(In reply to Daniel Holbert [:dholbert] from comment #60)
> This seems like the most likely culprit to have made list-simple-1.html
> start failing around once every few pushes, with the reference case lot
> loading its two lime-green-square bullet-point PNG images.

s/reference case lot loading/reference case not rendering/
[So basically, in the failures, it looks like list-simple-1-ref.html's bullet images aren't ready to paint (aren't finished decoding, perhaps?) when onload fires and the reftest-snapshot gets taken, or something.  I believe that image decode used to block onload.]

I backed this out, since I'm pretty sure it's the cause. Apologies if it's not.
 https://hg.mozilla.org/integration/mozilla-inbound/rev/b39437839d53
Retriggering on this push did produce quite a few instances on it.
This is almost certainly the cause. If we *ever* reuse images in a reftest session the first occasion will block onload for the decode, but the second and subsequent won't.

One thought I have is to have a reftest-specific pref that disables this behaviour.
Flags: needinfo?(tnikkel)
Is what happens before/after the load event even relevant in this situation though? For reftests we use the sync decode flag when we are taking the snapshots for comparison so that should trigger a sync decode when we Draw() the image, that should decode it no matter what. The change to StartDecoding shouldn't have any effect on that as Draw() calls SyncDecode, not StartDecoding.
Flags: needinfo?(tnikkel)
Doesn't taking the snapshot just do a glReadPixels(), though?
Nah. CanvasRenderingContext2D::DrawWindow calls nsIPresShell::RenderDocument here:

https://mxr.mozilla.org/mozilla-central/source/content/canvas/src/CanvasRenderingContext2D.cpp#3225

This calls nsLayoutUtils::PaintFrame, which builds a display list and paints it:

https://mxr.mozilla.org/mozilla-central/source/layout/base/nsLayoutUtils.cpp#1847

Note that this failure is exactly what prevented me from landing bug 695763, though given the patches that Timothy has landed recently I would've expected it to be resolved.
We build a display list and use that to determine if we need to invalidate anything in our retained layers. We'll paint anything invalidated, and then composite the retained layers to a resulting surface.

The problem is that if the image is not decoded when the reftest harness takes it's snapshot, even though it asks for sync decodes, no decoding happens because the contents of the retain layers are still marked as valid, so the painting of the display items for those images is optimized away, and hence we don't get any calls into imagelib with a sync decode flag set. The fix is to check if we are sync decoding and we are not decoding and invalidate, which is what the patches in bug 695763 were doing (but they missed some cases).
Relanded with enough patches from bug 695763 that it shouldn't cause problems

https://hg.mozilla.org/integration/mozilla-inbound/rev/ef8d03814470
https://hg.mozilla.org/mozilla-central/rev/ef8d03814470
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla24
tn, can you clarify what the status of 689623 is now?  AIUI it's enabled on trunk but disabled on Beta and maybe Aurora, or something like that?  Thanks!
Whiteboard: [snappy] → [snappy][MemShrink]
comment 72
Flags: needinfo?(nobody)
Sorry, somehow I overlooked the bugmail for your comment.

689623 is present in 22 and up. It is (currently) pref'ed off in 22 and 23. It is currently pref'ed on in 24 and up. Since we just had a merge day that means 689623 is present but pref'ed off in release and beta, and pref'ed on in aurora and nightly.
Flags: needinfo?(nobody)
So as it stands we should be able to keep bug 689623 pref'ed on from now on (knock on wood), resolving this bug was the only thing blocking us from doing that.
There is of course room to improve on the current situation so that we only skip sync decoding when scrolling image into view. Bug 847221 is the first step on that path.
FF24 FTW!  Thanks, tn.
I just noticed that I accidentally landed a version of the patch that I had been using for testing that made us never do any sync decoding when we call RequestDecodeCore (notice the comment markers):

  return RequestDecodeCore(//mHasBeenDecoded ?
    SYNCHRONOUS_NOTIFY /*: SYNCHRONOUS_NOTIFY_AND_SOME_DECODE*/);

It is good news that this is green and survived several days on nightly with no problems reported so far.

Do we want to keep the current behaviour? Or keep the mHasBeenDecoded guard?
Flags: needinfo?(joe)
I think we want to keep the mHasBeenDecoded guard. Otherwise, there's no difference between StartDecoding and RequestDecode. 

Maybe we don't *want* to have a difference between the two — I'd be willing to listen to that argument — but for now you should probably fix that.
Flags: needinfo?(joe)
Assignee: nobody → tnikkel
> Maybe we don't *want* to have a difference between the two — I'd be willing to listen to that  
> argument — but for now you should probably fix that.

Can we file a bug to have this discussion?  I'm with tn in that I think it might be a very nice thing to have.
Attachment #767959 - Flags: review?
Attachment #767959 - Flags: review? → review?(joe)
Attachment #767959 - Flags: review?(joe) → review+
(In reply to Justin Lebar [:jlebar] from comment #80)
> > Maybe we don't *want* to have a difference between the two — I'd be willing to listen to that  
> > argument — but for now you should probably fix that.
> 
> Can we file a bug to have this discussion?  I'm with tn in that I think it
> might be a very nice thing to have.

We have bug 854394 for such a discussion in the context of multithread image decoding. Does that cover it?
Comment on attachment 767959 [details] [diff] [review]
fix landing the wrong patch

[Approval Request Comment]
Bug caused by (feature/regressing bug #): landed the wrong patch in this bug
User impact if declined: possibly more images flashing into view while we decode them
Testing completed (on m-c, etc.): this patch just brings us to what we always intended to land
Risk to taking this patch (and alternatives if risky): the patch that was accidentally landed is more risky than this patch
String or IDL/UUID changes made by this patch: none
Attachment #767959 - Flags: approval-mozilla-aurora?
Attachment #767959 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Whee, my head is spinning now ... just finished reading through all 86 (!!) comments.
Anyways, excellent work tn (et al) on this hard-to-fix issue.

While reading, another fairly old issue came into my mind, that *might* (or might not be related) to the quadruple ('images', 'scrolling', 'jerky', 'image-heavy web page') ... this:

 - bug 629234 -

Though it might look only marginally related, I can't really say that bug 988594 is related any closer related with its Nexus4-typical behavior reported ;-) 

I'm wondering in particular if your fixes would at least minimize the (largely Nvidia-specific) issues reported under #629234 ?
oops, strike the excess 'related' pls
Depends on: 1113855
You need to log in before you can comment on or make changes to this bug.