Closed Bug 1292290 Opened 8 years ago Closed 8 years ago

Panning page causes images to turn black

Categories

(Core :: Graphics: ImageLib, defect, P1)

50 Branch
defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox49 --- unaffected
fennec 50+ ---
firefox50 blocking fixed
firefox51 --- fixed

People

(Reporter: rbarker, Assigned: tnikkel)

References

Details

(Keywords: regression, Whiteboard: [gfx-noted])

Attachments

(1 file)

When panning/zooming a page. Loaded images will go black. I tried bisecting to find regression window but was unable to. It may be an existing bug that has become much worse recently.

Steps to reproduce:

1) Go to micechat.com
2) scroll to bottom and toggle to desktop theme.
3) scroll and/or zoom page up/down in/out several times.

Result: certain images will turn black.

Possible image decode race condition?
Attached video black-images.mp4 —
Sebastian, this was mentioned at the channel meeting as a severe issue with Aurora50? Could you please help find an owner who can investigate? Thanks!
Flags: needinfo?(s.kaspari)
Adding flags for triage. Keeping NI.
tracking-fennec: --- → ?
Priority: -- → P1
I see this too and it's terrible.
Milan, we need to get an assignee here.
Flags: needinfo?(milan)
Randall, is it a regression?  I understand bisection didn't work, but do we at least know that, say, 48 works, but 51 doesn't, or something narrower than that?

Botond, while Kats is away - can you reproduce this, and what are the chances this is APZ bug, where we're perhaps telling the image it isn't in use?
Flags: needinfo?(rbarker)
Flags: needinfo?(milan)
Flags: needinfo?(botond)
(In reply to Milan Sreckovic [:milan] from comment #6)
> Randall, is it a regression?  I understand bisection didn't work, but do we
> at least know that, say, 48 works, but 51 doesn't, or something narrower
> than that?

I ran mozregression three times on this. Every time I thought I found the regression patch, I would run with the parent and could reproduce in the parent. From what I could tell, this bug has been around for a while but very recently got much worse (possibly around the start of August?).
Flags: needinfo?(rbarker)
I am quite sure that this is a Firefox 50 regression. I use Aurora as my daily browser 49.0a2 did not have this issue 50.0a2 is near unusable.
Timothy, this does feel like an imagelib issue, like when the image is not in the cache.  Any debugging we can turn on to confirm or rule out some of the causes?
Component: Graphics, Panning and Zooming → ImageLib
Flags: needinfo?(botond) → needinfo?(tnikkel)
Product: Firefox for Android → Core
Whiteboard: [gfx-noted]
Version: unspecified → 50 Branch
(In reply to Milan Sreckovic [:milan] from comment #6)
> Botond, while Kats is away - can you reproduce this, 

I can.

> and what are the chances this is APZ bug, where we're perhaps telling the image it isn't in use?

I'm not aware of a mechanism by which this could happen. The only thing that comes close is seth's recent in-displayport visibility work, but my understanding is that that's just to inform an optimization, and shouldn't cause a correctness problem like this.

I'm currently trying my luck with mozregression, that seems like the best way forward if we are able to find a reliable range.
(In reply to Botond Ballo [:botond] from comment #10)
> I'm currently trying my luck with mozregression, that seems like the best
> way forward if we are able to find a reliable range.

Using mozregression I found the following Nightly regression range:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=db3ed1fdbbeaf5ab1e8fe454780146e7499be3db&tochange=2ea3d51ba1bb9f5c3b6921c43ea63f70b4fdf5d2

However, my attempts to get a more precise range were foiled by the intermittent nature of the problem, and this is making me question the reliability of the Nightly range as well...
tracking-fennec: ? → 50+
(In reply to Milan Sreckovic [:milan] from comment #9)
> Timothy, this does feel like an imagelib issue, like when the image is not
> in the cache.  Any debugging we can turn on to confirm or rule out some of
> the causes?

It certainly feels like an imagelib issue, unless there is a more general graphics issue going around that affects things besides images then I think it's on imagelib to prove the problem is somewhere else.

Since there has been a lot of changes to imagelib lately finding the regression would be my first instinct, but the intermittent nature of this bug is obviously making that difficult.

When I reproduce the bug I can see something else drawn first (not sure if it is the correct content or not), and then it's replaced with black, so I don't think it could be an invalidation problem.

In a local build I don't see the decoded image getting discarded, so it doesn't seem to be a situation where the image is getting thrown away.

Doing local builds to narrow down Botond's range in comment 11 maybe will help? I'll try that.
Flags: needinfo?(tnikkel)
If the root cause here is the same as in bug 1290907, then bug 1290907 comment 2 might be relevant.
Following up on comment 13, bug 1290907 comment 2, bug 1290907 comment 5 then suggests that we would "fix" this problem with bug 1260324.
Timothy, does that make sense?

Jamie, can you reproduce this bug?
Flags: needinfo?(tnikkel)
Flags: needinfo?(jnicol)
(In reply to Milan Sreckovic [:milan] from comment #15)
> Following up on comment 13, bug 1290907 comment 2, bug 1290907 comment 5
> then suggests that we would "fix" this problem with bug 1260324.

Do we know that we do in fact have the same root cause as in bug 1290907? This bug is *much* harder to reproduce than that one. And that one's underlying cause must have been around since before that change, so goes back at least as far as March. While I struggle to find an accurate regression range for this, it seems to have started around the end of July. It could indeed be the same underlying cause, but something else must have changed more recently.

> Jamie, can you reproduce this bug?

Yes. (All the time when using my phone for personal uses, and rarely when trying to debug this!!) I will try rebasing the patch from 1260324 see if it has an effect.
(In reply to Milan Sreckovic [:milan] from comment #15)
> Following up on comment 13, bug 1290907 comment 2, bug 1290907 comment 5
> then suggests that we would "fix" this problem with bug 1260324.
> Timothy, does that make sense?

I think they are different issues. When I reproduce this bug the image stays black forever, despite zooming in/out, panning around, etc. Things that would cause different tiles to be used.
Flags: needinfo?(tnikkel)
I finished bisecting and I'm quite confident in the results because:
1) whenever I reproduced the bug it was always quite easy to get the bug to appear, there was never a case where it took a while to reproduce the bug
2) good builds never showed the bug despite putting a lot more time in trying to reproduce
3) I tested a lot of builds, so if I failed to reproduce in a bad build, or reproduced in a good build then I would have expected to hit an inconsistent state at some point, but I didn't.
4) I double checked the final two builds by doing a clobber and making fresh builds and testing again.

My bisection found that https://hg.mozilla.org/mozilla-central/rev/37340346a89e (bug 1289628) was the cause. I plan to back it out.
Blocks: 1289628
I'm curious to know how much memory the phones that we're reproducing this on have.

The combination of the symptoms and the identification of bug 1289628 as the regressing bug suggest that this is related to unlocked surfaces having their volatile buffers freed, since that's the only actual functional change in that bug.

Previous to that bug, if we retrieved a surface from the surface cache and its volatile buffer had been freed, we tossed that surface and attempted another lookup. The second lookup might still succeed. For example if we wanted to display an image at 1024x1024, but the corresponding surface's volatile buffer had been freed, the second lookup might retrieve a downscaled version at 64x64. In such a situation, we'd draw the 64x64 version and trigger an async decode to regenerate the 1024x1024 version.

What changed in bug 1289628 is that we throw out *all* of an image's decoded surfaces if one of them loses their volatile buffer. That means that in the former situation, we'd toss both the 1024x1024 surface and the 64x64 surface and trigger an async decode to regenerate the 1024x1024 surface. This change was made because there is something of an inversion of control going on in bug 1289628 which made it more awkward to implement the former behavior, and it seemed logical in any case to unify this logic with the logic for recovering from other ways we might lose access to surfaces (for example, a GPU driver reset). The underlying assumption was that outside of super-low-memory B2G devices, this was an edge case that users would very, very rarely encounter. This assumption would seem to have proven false; either users are hitting a case where we're quite close to OOM, or Android is more aggressive than expected about freeing volatile buffers.

What's a bit puzzling is that the redecodes of these images cannot be failing. When a decode fails due to OOM, we mark the image as having an error, and we refuse to draw it or decode it from that point forward. This is intended to help the system recover if we're near OOM. If that was happening, we would've seen these black images before the regressing patch as well, because we'd end up decoding the images at the same sizes in both cases.

My best guess as to what is going wrong (and it's just a guess) is that the issue occurs in RasterImage::UpdateImageContainer(). If an image gets an ImageLayer, then every time we redecode it (actually more often than that, but the details aren't important) we'll call UpdateImageContainer() to update the surface that the ImageLayer contains. So what may be happening is this cycle:

(1) We try to draw an image with an intrinsic size of, say, 1024x1024, at a size in layer pixels of, say,  64x64. We don't have a 64x64 version available, (say we've only decoded it at 1024x1024 previously) so we trigger a decoder for it. Triggering the decoder unlocks the surfaces for all previously-decoded versions of the image, so their volatile buffers can be freed - that includes the existing 1024x1024 version.

(2) The decoder calls NotifyProgress() to announce that it's done decoding, which calls UpdateImageContainer() to capture the latest changes. UpdateImageContainer() always does a lookup at the intrinsic size of the image, so it'll do a lookup for 1024x1024. But the 1024x1024 version's volatile buffer has now been freed. Previously, the lookup would've fallen back to the 64x64 version, and UpdateImageContainer() would've done its job, albeit with a low-res version of the image. But now, we don't fall back, and in fact we throw out the 64x64 version! We do trigger a decode for the 1024x1024 version, though, but it's async, so we don't get the results right away.

(3) Because NotifyProgress() sent out an invalidation, we try to draw the image again at 64x64 layer pixels. We don't have the 64x64 version anymore; we tossed it! So we trigger an async decoder for it, which implicitly unlocks the 1024x1024 version, and the cycle repeats.

Now, this *still* doesn't quite clear up everything for me, because in order for this to happen the layers system would need to be holding on to the ImageLayer's ImageContainer, but still ultimately drawing the image to a PaintedLayer. Or, alternatively, we actually are using the ImageContainer for display, but we're calling RequestDecodeForSize() more often than I'd expect, which would have the same effect as a draw call from the point of view of this analysis. Still, this is the best hypothesis I have at the present time.

If this guess is correct, then we can fix this just by restoring the fallback loop (though we have to move it to RasterImage::LookupFrame()). I still feel like this is not the best design in the long term, but it's better to get things working first, and then we can worry about improving the design.
(In reply to Seth Fowler [:seth] [:s2h] from comment #19)
> I'm curious to know how much memory the phones that we're reproducing this
> on have.

The phone on which I reprodued the problem has 1 GB of memory.
I have been seeing this constantly on a Nexus 5x with 2 GB.
Both of those phones sound like they should have enough memory that the system shouldn't need to free any volatile buffers during normal browsing, but of course, it also depends on what else is running on the device.

I can't find any documentation about how aggressive we expect Android to be in freeing volatile buffers, unfortunately.

I'll cook up a patch that implements the speculative fix described above and we can see if it solves the problem.
I've been running into this issue with my Galaxy S6 with 3 GB of memory.
(In reply to Timothy Nikkel (:tnikkel) from comment #18)
>...
> 
> My bisection found that
> https://hg.mozilla.org/mozilla-central/rev/37340346a89e (bug 1289628) was
> the cause. I plan to back it out.

That has now been backed out on inbound, and a request for the aurora backout has been made.
Flags: needinfo?(jnicol)
Assignee: nobody → tnikkel
The backout was merged, the only thing left to do is uplift the backout and verify that this bug is fixed.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(s.kaspari)
Resolution: --- → FIXED
The backout has also been uplifted to 50 now. Marking this bug as fixed on 50 and 51.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: