Zooming in on an image sometimes flashes a smaller image in upper left first

RESOLVED FIXED in Firefox 46

Status

()

defect
P3
normal
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: djf, Assigned: seth)

Tracking

(Blocks 1 bug, {regression})

unspecified
mozilla48
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:2.5+, firefox45 affected, firefox46 fixed, firefox47 fixed, firefox48 fixed, b2g-v2.2 unaffected, b2g-v2.5 affected, b2g-master affected)

Details

(Whiteboard: gfx-noted)

Attachments

(3 attachments, 2 obsolete attachments)

On both Aries and Flame (on master, I haven't tested 2.5) certain images sometimes incorrectly display a small version when I first double-tap to zoom in.

I suspect (and hope) that this is a MediaFrame regression caused by my patch in bug 1207792 or by Punam's patch in bug 1188286.  If it is not either of those, then it is probably something to do with Seth's recent gecko changes.

This bug does not seem to occur, or at least not obviously with images from the camera on Aries or Flame. But it is easy to reproduce with the test images in apps/gallery/test/images. 65.jpg, 63.jpg and 31.jpg, for example.
Assignee: nobody → dflanagan
Reverting bug 1188286 does not fix this. Bug 1207792 only affects videos not photos, but I tried reverting it anyway. No luck.

So I'm going to have to actually investigate this one.
I assumed that something was going wrong with the backgroundImage code in media_frame.js and it was the background image that was not being displayed correctly. But if I comment out all the backgroundImage stuff I still see this bug.
If I add more delay between setting the CSS background image and displaying the full-size image then this problem seems to go away.  

So perhaps the attempt to decode the full-size image prevents the CSS background image from displaying, and we're left with only the original preview image, but it is small because we're zoomed in.

But that does not explain why this only seems to affect jpeg images and not png images....

I need to figure out just what kinds of previews are involved for the affected images. If they are jpegs without previews then the fact that we're using #-moz-samplesize for the preview might have something to do with it. An image caching issue, perhaps.
I also need to check whether this reproduces in 2.5 or not.
[Blocking Requested - why for this release]: It seems like a pretty bad graphics glitch, and would be nice to fix if we can.

This does affect 2.5.

And it does affect large png images: I can reproduce it easily with the test image 35.png

This affects jpegs with no EXIF preview, or with invalid EXIF preview. And it affects them if they've got an external preview (like large pngs have) or if they create a preview by downsampling.  I have not been able to reproduce it with jpegs from the Aries camera, even though they are big and don't have usable EXIF previews.
blocking-b2g: --- → 2.5?
Regression with a bad user experience.
blocking-b2g: 2.5? → 2.5+
Keywords: regression
Priority: -- → P3
I've figured out that this is not related to the preview-to-fullsize switching code in _displayImage. If I modifiy the zoom() function so that it does not even call switchToFullsizeImage() I still see this just with the zoom and no image switching.

If I also take out the zoom animation, then the bug does not occur. (But if I take out the zoom animation but leave in the image switching, then the bug still occurs for images where there is actually something to switch.)

So I think what is happening is that the image is being decoded at an original small size, then when we do an animated zoom it needs to be re-decoded at a larger size. But something is going wrong with the gecko code that is supposed to just upscale the existing pixels.

I suspect it should be possible to create a reduced test case that just displays an image and a zoom button, then zooms in on the image using a CSS animated transform.

Seth: I haven't quite narrowed this down, but I think I suspect it is a regression caused by one of your recent downsample-on-decode patches.
Flags: needinfo?(seth)
I think we should get a regression range on this one.
QAwanted to check 2.2 first.
Keywords: qawanted
Confirmed that 2.2 is unaffected. Image zooms in as a whole, and not showing a small image on left first.

For test image 31 (31.jpg), the device was also able to display the image, where as on master it just shows a black image and it will only show after zooming in. This could be another bug.

Device: Flame 2.2
BuildID: 20151119032505
Gaia: 885647d92208fb67574ced44004ab2f29d23cb45
Gecko: e772f343b736
Gonk: bd9cb3af2a0354577a6903917bc826489050b40d
Version: 37.0 (2.2) 
Firmware Version: v18Dv4
User Agent: Mozilla/5.0 (Mobile; rv:37.0) Gecko/37.0 Firefox/37.0

Working on getting the window.
Flags: needinfo?(jmercado)
Keywords: qawanted
QA Contact: pcheng
Posted file reduced test case
I've created a reduced test case for this bug.

Load http://djf.net/1125934/ into the FirefoxOS browser and follow the directions. 

This does not reproduce on my MacOS desktop in Nightly or Developer Edition, but it does in the FirefoxOS browser
(In reply to Pi Wei Cheng [:piwei] from comment #11)
> Confirmed that 2.2 is unaffected. Image zooms in as a whole, and not showing
> a small image on left first.
> 
> For test image 31 (31.jpg), the device was also able to display the image,
> where as on master it just shows a black image and it will only show after
> zooming in. This could be another bug.
> 

That is bug 1124696. There's a patch under review for it. 
> 
> Working on getting the window.

Thanks!
The reduced test case at http://djf.net/1125934/ does not reproduce in Firefox Nightly for Android.
I can reproduce this bug on Aries and Flame, 2.6 and 2.5, but surprisingly, the reduced test case does not reproduce on my Flame, only on my Aries device.
Seth: do you think this is an imagelib bug related to downsampling, or some kind of graphics bug?

I'm going to change the product and component to core:imagelib, but feel free to alter that if you think it is wrong.

Also, I'm going to unassign myself since I don't think I can make any more progress on it.
Assignee: dflanagan → nobody
Component: Gaia::Gallery → ImageLib
Product: Firefox OS → Core
I just realized that my reduced test case uses the wrong bug number in the URL. Sorry for the extra bit of confusion
Whiteboard: gfx-noted
Note:
For the window below, both last working and first broken builds exhibit bug 1201322.

Mozilla-inbound regression window:

Last Working
Device: Flame
BuildID: 20151029162015
Gaia: 91cac94948094cfdcd00cba5c6483e27e80cb3b0
Gecko: 1d3d17144c072d9def39480e4ad4254c6e05b375
Version: 45.0a1 (2.6 Master) 
Firmware Version: v18Dv4
User Agent: Mozilla/5.0 (Mobile; rv:45.0) Gecko/45.0 Firefox/45.0

First Broken
Device: Flame
BuildID: 20151029163913
Gaia: 91cac94948094cfdcd00cba5c6483e27e80cb3b0
Gecko: d3a2e391df48f1c9389bdc132eb72065442dc2db
Version: 45.0a1 (2.6 Master) 
Firmware Version: v18Dv4
User Agent: Mozilla/5.0 (Mobile; rv:45.0) Gecko/45.0 Firefox/45.0

Gaia is the same so it's a Gecko issue.

Gecko pushlog:
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=1d3d17144c072d9def39480e4ad4254c6e05b375&tochange=d3a2e391df48f1c9389bdc132eb72065442dc2db

This issue is likely caused by changes made in Bug 1207355.
Blocks: 1207355
QA Whiteboard: [QAnalyst-Triage?]
Seth and David this issue seems to have been caused by the changes for bug 1207355.  Can you please take a look?
Flags: needinfo?(jmercado) → needinfo?(dflanagan)
Thanks Jayme. Is it possible to see if the same last working/first broken gecko builds apply to bug 1225027 as well? That one seems to be Aries only. I don't know where to find a Aries build from before 10/29 to test with.
Flags: needinfo?(dflanagan) → needinfo?(jmercado)
Setting qawanted to check the request in comment 20.
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(jmercado) → needinfo?(pcheng)
Keywords: qawanted
Confirmed that bug 1225027 has the same window as this bug.

Window for bug 1225027:

Last working
Device: Aries
BuildID: 20151030002823
Gaia: 91cac94948094cfdcd00cba5c6483e27e80cb3b0
Gecko: 1d3d17144c072d9def39480e4ad4254c6e05b375
Version: 45.0a1 (2.6 Master) 
Firmware Version: D5803_23.1.A.1.28_NCB.ftf
User Agent: Mozilla/5.0 (Mobile; rv:45.0) Gecko/45.0 Firefox/45.0

First broken
Device: Aries
BuildID: 20151030005155
Gaia: 91cac94948094cfdcd00cba5c6483e27e80cb3b0
Gecko: d3a2e391df48f1c9389bdc132eb72065442dc2db
Version: 45.0a1 (2.6 Master) 
Firmware Version: D5803_23.1.A.1.28_NCB.ftf
User Agent: Mozilla/5.0 (Mobile; rv:45.0) Gecko/45.0 Firefox/45.0
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(pcheng) → needinfo?(jmercado)
Keywords: qawanted
This is the same window David.
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(jmercado) → needinfo?(dflanagan)
It's unlikely that this bug actually has anything at all to do with downscale-during-decode or size prediction, as that code simply does not work in a way that would cause that issue. The size at which we draw an image is determined by layout, not ImageLib, and has nothing to do with the size at which the image is decoded. If the image were drawn at the correct size, but pixellated, that would point to an ImageLib bug, but that's not what's happening here. Also, DDD does not apply to layerized images right now, and the image in the reduced testcase is *probably* getting layerized.. The regression window is almost definitely misleading here.

Here's what I strongly suspect is *actually* happening:

- There was an existing race condition in the CSS animation/transition code. (Or, quite possibly, in the code that paints layerized images.)
- The code in the bug identified in the regression range changed the timing of when we start decoding CSS background images; we now trigger decoding for them a little later than we did before.
- As a result, the race condition is now easy to trigger.

The fact that exactly the same code handles this stuff on every platform increases my confidence that we're looking at a race condition; it's likely that on platforms other than the Aries, things are simply fast enough that we don't trigger the race.

My best hypothesis at this point (this is totally a guess based upon what I know of the code; I haven't debugged this yet as I haven't yet been able to reproduce): the issue lies in either layers::ImageContainer or the code that handles ImageLayers, and the cause is that some code in there is wrongly looking at the size of the texture in the ImageContainer to decide what size to paint at, instead of using the size as determined by layout. We can determine if this is true by modifying the code to *always* hand back an ImageContainer with the wrong size; if the hypothesis is correct, then the result is that the image will always be painted at the wrong size when layerized (i.e., during the transition).
Flags: needinfo?(seth)
Reading back over that comment, I feel I should clarify one thing: when I say that DDD does not apply to layerized images, what I mean by that is that regardless of the size of the layerized image as displayed on the screen, we always decode the image to its intrinsic size and use that version of the image.

There *is* an interaction in the following sense: DDD can mean that we have no decoded a version of the image at the intrinsic size at the time that the transition starts, which was much harder to trigger in the past. Thus, what we're really seeing here (according to my guess, which I think is likely but by no means certain) is what happens when we try to perform the transition and the image effectively isn't decoded yet. Apparently, we had a preexisting bug in that case.
I wouldn't know how to distinguish between a layout issue and an imagelib issue, so go ahead and recategorize the bug as appropriate.

(In reply to Seth Fowler [:seth] [:s2h] from comment #25)
> There *is* an interaction in the following sense: DDD can mean that we have
> no decoded a version of the image at the intrinsic size at the time that the
> transition starts, which was much harder to trigger in the past. Thus, what
> we're really seeing here (according to my guess, which I think is likely but
> by no means certain) is what happens when we try to perform the transition
> and the image effectively isn't decoded yet. Apparently, we had a
> preexisting bug in that case.

The image is decoded at one size before the animated scale begins, of course, since we can see it on the screen.  But the fact that this happens only on the first scale and not on subsequent ones indicates to me that the image is not decoded at the size that is needed by the end of the scale.

When you were telling me about your DDD plans we talked about this case because the Gallery code for zooming in has always been kind of fragile where we have to switch from a preview image to a full-size image.  IIRC, you said that gecko would use whatever image it had already decoded and would use a pixelated version of that until the image was decoded at the larger size that was needed.  I can see that happening when I run my test case on desktop. But it isn't happening on the smartphones.

Maybe that's exactly what you are already saying in comment #24...

Anyway, we need to get someone to fix the issue. If this is a layout issue, do you know who should work on it?  Also, do you have any suggestions for workarounds I should try? Like can I just put a will-change declaration on the window or something to indicate that it will have an animated transform?
Flags: needinfo?(dflanagan) → needinfo?(seth)
I suspect this is related to bug 1204470. Adding a speculative dependency.

(In reply to David Flanagan [:djf] from comment #26)
> IIRC,
> you said that gecko would use whatever image it had already decoded and
> would use a pixelated version of that until the image was decoded at the
> larger size that was needed.  I can see that happening when I run my test
> case on desktop. But it isn't happening on the smartphones.

So ImageLib is correctly doing that or we wouldn't see a smaller version of the image, we'd see nothing at all. That part appears to be working. What's not working is the code that decides the rectangle that we should draw the image into on the screen. ("on the screen" is not quite correct but let's try to keep this as simple as possible =)

> Anyway, we need to get someone to fix the issue. If this is a layout issue,
> do you know who should work on it?  Also, do you have any suggestions for
> workarounds I should try? Like can I just put a will-change declaration on
> the window or something to indicate that it will have an animated transform?

So I think lots of people including me are qualified to write the patch, but Matt Woodrow is probably a good person to triage it (basically suggest where in the code the issue probably lies with more detail than I can do offhand) and suggest a possible workaround.
Blocks: 1204470
Flags: needinfo?(seth)
Matt, is this something you could have a look at and help us find an owner for this bug and the possibly related issue in bug 1204470?  Hopefully it will help that we have a test case.
Flags: needinfo?(matt.woodrow)
Posted image firstframe.png
I can reproduce this on OSX.

When we first layerize the image (for the transform transition), the surface that ImageLib gives us is 2560x2048, but has the content scaled to only cover the top 600x480 pixels (the remainder is transparent).

I've attached this image for reference.

After a while we get a surface with the content covering all the pixels, and the transition continues looking correct.
Flags: needinfo?(matt.woodrow)
Back over to Seth, since this seems like an ImageLib bug to me.

Seth, if using only a subset of the SourceSurface is intentional then you have two choices to inform  layout about the relevant rect:
 * Create a new DataSourceSurface wrapping the same data pointer, but with the smaller size and a massive stride so that lines up.
 * Implement layers::Image::GetPictureRect for CairoImage (nobody has needed it yet for that format), and specify the subset rect in the CairoImage ctor.
Flags: needinfo?(seth)
(In reply to Matt Woodrow (:mattwoodrow) from comment #30)
> Back over to Seth, since this seems like an ImageLib bug to me.

I concur. It's definitely not intentional. In fact this is the bug, right here:

https://dxr.mozilla.org/mozilla-central/source/image/RasterImage.cpp#631

The intrinsic size is hardcoded when we create the layers::CairoImage. <smacks forehead> The size needs to come from |surface| instead.

Thanks for triaging this, Matt. This particular failure mode simply didn't occur to me, since after all the SourceSurfaces that the imgFrames manage definitely do have the right size.
Flags: needinfo?(seth)
It's not actually clear why the CairoImage ctor takes a separate size parameter, since all the callers (that I saw from a quick look) appear to just use the surface size.
Seth: are you working on this bug? It is also implicated in bug 1225027.
Flags: needinfo?(seth)
Posted patch image-size (obsolete) — Splinter Review
Assignee: nobody → matt.woodrow
Attachment #8694984 - Flags: review?(seth)
Flags: needinfo?(seth)
Posted patch unbitrotted patch (obsolete) — Splinter Review
I updated the patch.

It doesn't seem to fix the problem in the testcase here though. Maybe it did when it was posted though?
Flags: needinfo?(matt.woodrow)
Interesting, I see this too :(

It was definitely fixed when I tested it initially.
I tried to bisect to see what broke your patch but I still see the bug with your patch applied to https://hg.mozilla.org/mozilla-central/rev/f6ac392322b3 which is from the day you posted the patch.
(In reply to Timothy Nikkel (:tnikkel) from comment #37)
> I tried to bisect to see what broke your patch but I still see the bug with
> your patch applied to
> https://hg.mozilla.org/mozilla-central/rev/f6ac392322b3 which is from the
> day you posted the patch.

Lots of confusion right now. The patch definitely doesn't fix the problem, and never did.

The problem is in RasterImage::CopyFrame.

We call CreateDataSourceSurface with the full image size (2560,2048), but frameRef->GetRect() returns (0,0,600,480).

The copy from the frameRef into the new surface doesn't scale, so we end up with the content only covering the top-left of the resulting surface.

We take the x/y coordinates of frameRef->GetRect() into account, so it would appear that the frameRef can sometimes return only a subset of the full image and the partial coverage is intended behaviour.

I can't see any easy way to differentiate between the frameRef representing a subset and the frameRef being downscaled (or possibly both?).

Seth, how is this meant to be handled?
Flags: needinfo?(matt.woodrow) → needinfo?(seth)
See Also: → 1240680
Duplicate of this bug: 1235093
I'm going to go ahead and speculatively block bug 1225027 because I think there's good reason to think that this is the same issue.
Blocks: 1225027
Flags: needinfo?(seth)
See Also: 1225027
Depends on: 1251804
Depends on: 1251806
Depends on: 1251807
Depends on: 1251808
OK, this bug uncovered a bunch of issues, and in the interests of simplifying
bisecting and allowing partial backouts I've separated them into different bugs.
This patch is the one that fixes the real, fundamental problem here. (Though the
other bugs are all real problems, and we should land those fixes as well!)

It's as simple as this: we should not allow surface substitution when
FLAG_HIGH_QUALITY_SCALING is disabled, because that means that we shouldn't be
doing downscale-during-decode.

The fact that we *were* substituting here was resulting in us returning a
surface of unexpected size to GetCurrentImage(), which was triggering all sorts
of other interesting issues. (You'll find details in the bugs that block this
one.) The fix is as simple as adding the appropriate check in
RasterImage::LookupFrameInternal().

Tim, I'm going to be unavailable next week, but this issue is really high
priority. Could you please set |checkin-needed| on this bug and its blockers
once they get r+'d? (I'll try to go ahead and land them if you review quickly.)

This patch should fix the problem all by itself, without any of the blocker
bugs, but to confident that no issues remain I'd *really* prefer that those bugs
land, so please only change the blocking relationship if something needs to be
backed out.

(It's worth noting that I've moved Matt's original patch, which is still a good
idea, to bug 1251808.)
Attachment #8724360 - Flags: review?(tnikkel)
Attachment #8694984 - Attachment is obsolete: true
Attachment #8708167 - Attachment is obsolete: true
Attachment #8694984 - Flags: review?(seth)
Assignee: matt.woodrow → seth
The try job in comment 42 includes the entire patch stack - i.e., the patch in this bug plus the patches in all the blockers.
Note that bug 1251808 turned out to have a missing |explicit| on a new constructor, so the try job above will include some static analysis failures related to that. I've already pushed a followup to bug 1251808 that fixes that issue, so it's no longer a problem and shouldn't interfere with landing any of the other patches.
https://hg.mozilla.org/integration/mozilla-inbound/rev/396e7cdc25393a818a8e3a07af49284655ad58a3
Bug 1225934 - Never allow surface substitution when FLAG_HIGH_QUALITY_SCALING is disabled. r=tn
Oh drat, this didn't get reviewed yet, and I just pushed it. Tim, can you do a post-landing review? We can back out if it's r-.
Flags: needinfo?(tnikkel)
I had question about it. It seems like we are overloading the high quality scaling flag with two seperate things. 1) the caller is asking for a decode at the size it requests 2) the caller is willing to accept a surface of a different size.
Flags: needinfo?(tnikkel)
(In reply to Timothy Nikkel (:tnikkel) from comment #47)
> I had question about it. It seems like we are overloading the high quality
> scaling flag with two seperate things. 1) the caller is asking for a decode
> at the size it requests 2) the caller is willing to accept a surface of a
> different size.

Well, right now it doesn't make semantic sense to ask for a surface of a size other than the intrinsic size and not pass FLAG_HIGH_QUALITY_SCALING. We should probably have an assert. But given that no caller which does not pass that flag is going to be able to handle DDD at present, I think this patch is the right thing for now.
https://hg.mozilla.org/mozilla-central/rev/396e7cdc2539
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
[Expanding platform to All/All, since this isn't b2g/gonk-specific, per dupes like bug 1235093.  Also, setting status flags for firefox versions that are affected by this bug -- this goes back to Firefox 45 (just released) since that's where regressing bug 1207355 landed.]
OS: Gonk (Firefox OS) → All
Hardware: ARM → All
Duplicate of this bug: 1246509
Comment on attachment 8724360 [details] [diff] [review]
Never allow surface substitution when FLAG_HIGH_QUALITY_SCALING is disabled.

Approval Request Comment
[Feature/regressing bug #]: Not sure exactly, but it's one of the blockers of |ddd|.
[User impact if declined]: Ugly flashing of images in some situations.
[Describe test coverage new/current, TreeHerder]: It's not easy to write a test for this.
[Risks and why]: Very low risk. Unlikely to introduce a new regression. The change is extremely simple.
[String/UUID change made/needed]: None.
Attachment #8724360 - Flags: review?(tnikkel) → approval-mozilla-aurora?
Heh, it looks like the regressing bug is bug 1235093 per comment 50, though I suspect it's only getting the blame because it changes timing.
(In reply to Seth Fowler [:seth] [:s2h] from comment #53)
> Heh, it looks like the regressing bug is bug 1235093 per comment 50

I think you mis-pasted -- bug 1235093 is a duplicate of this bug, and the regressing bug is bug 1207355. (This regressing bug was also identified/suggested earlier here in comment 18.)
Blocks: 1249495
Comment on attachment 8724360 [details] [diff] [review]
Never allow surface substitution when FLAG_HIGH_QUALITY_SCALING is disabled.

Fix has been in Nightly for a week and change seems well scope, taking it in Aurora47.
Attachment #8724360 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Can you request uplift to beta as well if you think that it's applicable? I'm hoping that would fix bug 1240680. This could make it into beta 4.  Thanks!
Flags: needinfo?(seth)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #58)
> Can you request uplift to beta as well if you think that it's applicable?
> I'm hoping that would fix bug 1240680. This could make it into beta 4. 
> Thanks!

Yeah, I think we can safely push this into beta.
Flags: needinfo?(seth)
(In reply to Daniel Holbert [:dholbert] from comment #54)
> I think you mis-pasted -- bug 1235093 is a duplicate of this bug, and the
> regressing bug is bug 1207355. (This regressing bug was also
> identified/suggested earlier here in comment 18.)

Yeah, I did mispaste. Thanks for the correction.
Comment on attachment 8724360 [details] [diff] [review]
Never allow surface substitution when FLAG_HIGH_QUALITY_SCALING is disabled.

Approval Request Comment
[Feature/regressing bug #]: Not sure exactly, but it's one of the blockers of |ddd|.
[User impact if declined]: Ugly flashing of images in some situations.
[Describe test coverage new/current, TreeHerder]: It's not easy to write a test for this.
[Risks and why]: Very low risk. Unlikely to introduce a new regression. The change is extremely simple.
[String/UUID change made/needed]: None.
Attachment #8724360 - Flags: approval-mozilla-beta?
Comment on attachment 8724360 [details] [diff] [review]
Never allow surface substitution when FLAG_HIGH_QUALITY_SCALING is disabled.

Regression from 45, let's uplift this to beta. 
It should land in time for beta 4 early next week.
Attachment #8724360 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Pulsebot from comment #65)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/55ba885e2a85

comment only change to clarify the high quality scaling flag.
Duplicate of this bug: 1259487
Duplicate of this bug: 1260645
Duplicate of this bug: 1261337
Duplicate of this bug: 1261668
Duplicate of this bug: 1262511
Duplicate of this bug: 1266083
You need to log in before you can comment on or make changes to this bug.