1292290 - Panning page causes images to turn black

Reporter

Description

•

9 years ago

When panning/zooming a page. Loaded images will go black. I tried bisecting to find regression window but was unable to. It may be an existing bug that has become much worse recently. Steps to reproduce: 1) Go to micechat.com 2) scroll to bottom and toggle to desktop theme. 3) scroll and/or zoom page up/down in/out several times. Result: certain images will turn black. Possible image decode race condition?

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Updated

•

9 years ago

Comment 1

•

9 years ago

Attached video black-images.mp4 — Details

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Updated

•

9 years ago

status-firefox50: --- → ?

status-firefox51: --- → ?

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 2

•

9 years ago

Sebastian, this was mentioned at the channel meeting as a severe issue with Aurora50? Could you please help find an owner who can investigate? Thanks!

Flags: needinfo?(s.kaspari)

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Updated

•

9 years ago

status-firefox50: ? → affected

status-firefox51: ? → affected

tracking-firefox50: --- → blocking

Sebastian Kaspari (:sebastian; :pocmo)

Comment 3

•

9 years ago

Adding flags for triage. Keeping NI.

tracking-fennec: --- → ?

Priority: -- → P1

James Willcox (:snorp) (jwillcox@mozilla.com) (he/him)

Comment 4

•

9 years ago

I see this too and it's terrible.

James Willcox (:snorp) (jwillcox@mozilla.com) (he/him)

Comment 5

•

9 years ago

Milan, we need to get an assignee here.

Flags: needinfo?(milan)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 6

•

9 years ago

Randall, is it a regression? I understand bisection didn't work, but do we at least know that, say, 48 works, but 51 doesn't, or something narrower than that? Botond, while Kats is away - can you reproduce this, and what are the chances this is APZ bug, where we're perhaps telling the image it isn't in use?

Flags: needinfo?(rbarker)

Flags: needinfo?(milan)

Flags: needinfo?(botond)

Randall Barker [:rbarker]

Reporter

Comment 7

•

9 years ago

(In reply to Milan Sreckovic [:milan] from comment #6) > Randall, is it a regression? I understand bisection didn't work, but do we > at least know that, say, 48 works, but 51 doesn't, or something narrower > than that? I ran mozregression three times on this. Every time I thought I found the regression patch, I would run with the parent and could reproduce in the parent. From what I could tell, this bug has been around for a while but very recently got much worse (possibly around the start of August?).

Flags: needinfo?(rbarker)

Kevin Brosnan [Ex-Mozilla]

Comment 8

•

9 years ago

I am quite sure that this is a Firefox 50 regression. I use Aurora as my daily browser 49.0a2 did not have this issue 50.0a2 is near unusable.

status-firefox49: --- → unaffected

Milan Sreckovic [:milan] (needinfo for best results)

Comment 9

•

9 years ago

Timothy, this does feel like an imagelib issue, like when the image is not in the cache. Any debugging we can turn on to confirm or rule out some of the causes?

Component: Graphics, Panning and Zooming → ImageLib

Flags: needinfo?(botond) → needinfo?(tnikkel)

Product: Firefox for Android → Core

Whiteboard: [gfx-noted]

Version: unspecified → 50 Branch

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

9 years ago

Keywords: regression

Botond Ballo [:botond]

Comment 10

•

9 years ago

(In reply to Milan Sreckovic [:milan] from comment #6) > Botond, while Kats is away - can you reproduce this, I can. > and what are the chances this is APZ bug, where we're perhaps telling the image it isn't in use? I'm not aware of a mechanism by which this could happen. The only thing that comes close is seth's recent in-displayport visibility work, but my understanding is that that's just to inform an optimization, and shouldn't cause a correctness problem like this. I'm currently trying my luck with mozregression, that seems like the best way forward if we are able to find a reliable range.

Botond Ballo [:botond]

Comment 11

•

9 years ago

(In reply to Botond Ballo [:botond] from comment #10) > I'm currently trying my luck with mozregression, that seems like the best > way forward if we are able to find a reliable range. Using mozregression I found the following Nightly regression range: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=db3ed1fdbbeaf5ab1e8fe454780146e7499be3db&tochange=2ea3d51ba1bb9f5c3b6921c43ea63f70b4fdf5d2 However, my attempts to get a more precise range were foiled by the intermittent nature of the problem, and this is making me question the reliability of the Nightly range as well...

James Willcox (:snorp) (jwillcox@mozilla.com) (he/him)

Updated

•

9 years ago

tracking-fennec: ? → 50+

Timothy Nikkel (:tnikkel)

Assignee

Comment 12

•

9 years ago

(In reply to Milan Sreckovic [:milan] from comment #9) > Timothy, this does feel like an imagelib issue, like when the image is not > in the cache. Any debugging we can turn on to confirm or rule out some of > the causes? It certainly feels like an imagelib issue, unless there is a more general graphics issue going around that affects things besides images then I think it's on imagelib to prove the problem is somewhere else. Since there has been a lot of changes to imagelib lately finding the regression would be my first instinct, but the intermittent nature of this bug is obviously making that difficult. When I reproduce the bug I can see something else drawn first (not sure if it is the correct content or not), and then it's replaced with black, so I don't think it could be an invalidation problem. In a local build I don't see the decoded image getting discarded, so it doesn't seem to be a situation where the image is getting thrown away. Doing local builds to narrow down Botond's range in comment 11 maybe will help? I'll try that.

Flags: needinfo?(tnikkel)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 13

•

9 years ago

If the root cause here is the same as in bug 1290907, then bug 1290907 comment 2 might be relevant.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 15

•

9 years ago

Following up on comment 13, bug 1290907 comment 2, bug 1290907 comment 5 then suggests that we would "fix" this problem with bug 1260324. Timothy, does that make sense? Jamie, can you reproduce this bug?

Flags: needinfo?(tnikkel)

Flags: needinfo?(jnicol)

Jamie Nicol [:jnicol]

Comment 16

•

9 years ago

(In reply to Milan Sreckovic [:milan] from comment #15) > Following up on comment 13, bug 1290907 comment 2, bug 1290907 comment 5 > then suggests that we would "fix" this problem with bug 1260324. Do we know that we do in fact have the same root cause as in bug 1290907? This bug is *much* harder to reproduce than that one. And that one's underlying cause must have been around since before that change, so goes back at least as far as March. While I struggle to find an accurate regression range for this, it seems to have started around the end of July. It could indeed be the same underlying cause, but something else must have changed more recently. > Jamie, can you reproduce this bug? Yes. (All the time when using my phone for personal uses, and rarely when trying to debug this!!) I will try rebasing the patch from 1260324 see if it has an effect.

Timothy Nikkel (:tnikkel)

Assignee

Comment 17

•

9 years ago

(In reply to Milan Sreckovic [:milan] from comment #15) > Following up on comment 13, bug 1290907 comment 2, bug 1290907 comment 5 > then suggests that we would "fix" this problem with bug 1260324. > Timothy, does that make sense? I think they are different issues. When I reproduce this bug the image stays black forever, despite zooming in/out, panning around, etc. Things that would cause different tiles to be used.

Flags: needinfo?(tnikkel)

Timothy Nikkel (:tnikkel)

Assignee

Comment 18

•

9 years ago

I finished bisecting and I'm quite confident in the results because: 1) whenever I reproduced the bug it was always quite easy to get the bug to appear, there was never a case where it took a while to reproduce the bug 2) good builds never showed the bug despite putting a lot more time in trying to reproduce 3) I tested a lot of builds, so if I failed to reproduce in a bad build, or reproduced in a good build then I would have expected to hit an inconsistent state at some point, but I didn't. 4) I double checked the final two builds by doing a clobber and making fresh builds and testing again. My bisection found that https://hg.mozilla.org/mozilla-central/rev/37340346a89e (bug 1289628) was the cause. I plan to back it out.

Blocks: 1289628

Seth Fowler [:seth] [:s2h]

Comment 19

•

9 years ago

I'm curious to know how much memory the phones that we're reproducing this on have. The combination of the symptoms and the identification of bug 1289628 as the regressing bug suggest that this is related to unlocked surfaces having their volatile buffers freed, since that's the only actual functional change in that bug. Previous to that bug, if we retrieved a surface from the surface cache and its volatile buffer had been freed, we tossed that surface and attempted another lookup. The second lookup might still succeed. For example if we wanted to display an image at 1024x1024, but the corresponding surface's volatile buffer had been freed, the second lookup might retrieve a downscaled version at 64x64. In such a situation, we'd draw the 64x64 version and trigger an async decode to regenerate the 1024x1024 version. What changed in bug 1289628 is that we throw out *all* of an image's decoded surfaces if one of them loses their volatile buffer. That means that in the former situation, we'd toss both the 1024x1024 surface and the 64x64 surface and trigger an async decode to regenerate the 1024x1024 surface. This change was made because there is something of an inversion of control going on in bug 1289628 which made it more awkward to implement the former behavior, and it seemed logical in any case to unify this logic with the logic for recovering from other ways we might lose access to surfaces (for example, a GPU driver reset). The underlying assumption was that outside of super-low-memory B2G devices, this was an edge case that users would very, very rarely encounter. This assumption would seem to have proven false; either users are hitting a case where we're quite close to OOM, or Android is more aggressive than expected about freeing volatile buffers. What's a bit puzzling is that the redecodes of these images cannot be failing. When a decode fails due to OOM, we mark the image as having an error, and we refuse to draw it or decode it from that point forward. This is intended to help the system recover if we're near OOM. If that was happening, we would've seen these black images before the regressing patch as well, because we'd end up decoding the images at the same sizes in both cases. My best guess as to what is going wrong (and it's just a guess) is that the issue occurs in RasterImage::UpdateImageContainer(). If an image gets an ImageLayer, then every time we redecode it (actually more often than that, but the details aren't important) we'll call UpdateImageContainer() to update the surface that the ImageLayer contains. So what may be happening is this cycle: (1) We try to draw an image with an intrinsic size of, say, 1024x1024, at a size in layer pixels of, say, 64x64. We don't have a 64x64 version available, (say we've only decoded it at 1024x1024 previously) so we trigger a decoder for it. Triggering the decoder unlocks the surfaces for all previously-decoded versions of the image, so their volatile buffers can be freed - that includes the existing 1024x1024 version. (2) The decoder calls NotifyProgress() to announce that it's done decoding, which calls UpdateImageContainer() to capture the latest changes. UpdateImageContainer() always does a lookup at the intrinsic size of the image, so it'll do a lookup for 1024x1024. But the 1024x1024 version's volatile buffer has now been freed. Previously, the lookup would've fallen back to the 64x64 version, and UpdateImageContainer() would've done its job, albeit with a low-res version of the image. But now, we don't fall back, and in fact we throw out the 64x64 version! We do trigger a decode for the 1024x1024 version, though, but it's async, so we don't get the results right away. (3) Because NotifyProgress() sent out an invalidation, we try to draw the image again at 64x64 layer pixels. We don't have the 64x64 version anymore; we tossed it! So we trigger an async decoder for it, which implicitly unlocks the 1024x1024 version, and the cycle repeats. Now, this *still* doesn't quite clear up everything for me, because in order for this to happen the layers system would need to be holding on to the ImageLayer's ImageContainer, but still ultimately drawing the image to a PaintedLayer. Or, alternatively, we actually are using the ImageContainer for display, but we're calling RequestDecodeForSize() more often than I'd expect, which would have the same effect as a draw call from the point of view of this analysis. Still, this is the best hypothesis I have at the present time. If this guess is correct, then we can fix this just by restoring the fallback loop (though we have to move it to RasterImage::LookupFrame()). I still feel like this is not the best design in the long term, but it's better to get things working first, and then we can worry about improving the design.

Botond Ballo [:botond]

Comment 20

•

9 years ago

(In reply to Seth Fowler [:seth] [:s2h] from comment #19) > I'm curious to know how much memory the phones that we're reproducing this > on have. The phone on which I reprodued the problem has 1 GB of memory.

Kevin Brosnan [Ex-Mozilla]

Comment 21

•

9 years ago

I have been seeing this constantly on a Nexus 5x with 2 GB.

Seth Fowler [:seth] [:s2h]

Comment 22

•

9 years ago

Both of those phones sound like they should have enough memory that the system shouldn't need to free any volatile buffers during normal browsing, but of course, it also depends on what else is running on the device. I can't find any documentation about how aggressive we expect Android to be in freeing volatile buffers, unfortunately. I'll cook up a patch that implements the speculative fix described above and we can see if it solves the problem.

Trevor Rowbotham [:rowbot]

Comment 23

•

9 years ago

I've been running into this issue with my Galaxy S6 with 3 GB of memory.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 24

•

9 years ago

(In reply to Timothy Nikkel (:tnikkel) from comment #18) >... > > My bisection found that > https://hg.mozilla.org/mozilla-central/rev/37340346a89e (bug 1289628) was > the cause. I plan to back it out. That has now been backed out on inbound, and a request for the aurora backout has been made.

Flags: needinfo?(jnicol)

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

9 years ago

Assignee: nobody → tnikkel

Timothy Nikkel (:tnikkel)

Assignee

Comment 25

•

9 years ago

The backout was merged, the only thing left to do is uplift the backout and verify that this bug is fixed.

Status: NEW → RESOLVED

Closed: 9 years ago

Flags: needinfo?(s.kaspari)

Resolution: --- → FIXED

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 26

•

9 years ago

The backout has also been uplifted to 50 now. Marking this bug as fixed on 50 and 51.

status-firefox50: affected → fixed

status-firefox51: affected → fixed

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Updated

•

9 years ago