Open Bug 1478134 Opened Last year Updated 7 months ago

Intermittent image/test/reftest/bmp/bmpsuite/q/wrapper.html?pal8os2sp.bmp == about:blank | image comparison, max difference: 245, number of differing pixels: 8128

Categories

(Core :: ImageLib, defect, P3)

defect

Tracking

()

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, leave-open, Whiteboard: [stockwell disabled])

Attachments

(2 files)

It seems plausible that may patch may have made this more intermittent, yeah, though it'd be just exposing a pre-existing problem.

Does this happen on qr-only? There may be a problem with the sync-decode path on webrender.

Flags: needinfo?(emilio) → needinfo?(apavel)

Seems like it's split maybe 60/40 webrender/non-webrender.

As a start, I pushed to try to see which of the 3 patches landed are responsible (or clear them if they happen to be innocent).

Seems to be caused by

https://hg.mozilla.org/mozilla-central/rev/3f1e9dae2467
Bug 1395964 - Make image loading changes not reframe.

So I think that makes some sense, because we're invalidating less stuff. See bug 1226748 comment 2, whatever race Benoit was talking about now probably happens almost always. Let me try catching it under rr sometime this week...

Flags: needinfo?(apavel) → needinfo?(emilio)

So, I haven't been able to repro yet with various combinations of RR chaos mode and MOZ_CHAOSMODE, but, looking at the reftest screenshots in here, in order to paint the blue rectangle that the reftest analyzer shows, the following has to be true:

  • We're loading (otherwise we'd paint the broken image icon), but not marked as broken yet (since otherwise we'd be an empty inline given the alt=" ").
  • We've the image size available (otherwise we'd be 0x0).

I suspect what's happening is something like:

  • We stop the image load on onload, since the test waits for that (no frame change, just reflow).
  • We get the SIZE_AVAILABLE notification, update our intrinsic size, request a reflow.
  • We reflow, but we keep the intrinsic size we had around. EnsureIntrinsicSizeAndRatio does nothing because we do have a valid intrinsic size, even though imagelib may already know that the image is bad. This is the key difference compared to before my patch. Before my patch, the frame is new, has no intrinsic size, and pokes at imagelib directly asking for the intrinsic size at this point rather than using the size at the time we were notified of it being available.
  • We paint, sync-decode, image is bad, we're done. But we paint the blue background already since we do have an intrinsic size (that will get invalidated at some point later when the image content gets notified and turns into the broken state).

Timothy, does this sound like a reasonable hypothesis?

How does this work in ImageLib? When you have a size but the image is bad, how do you notify clients that the size has "changed" (since a broken image has no intrinsic size I assume)?

Also, this could be a nice thing to debug with Pernosco if roc could make that happen? :)

Flags: needinfo?(tnikkel)
Flags: needinfo?(roc)
Flags: needinfo?(emilio)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #41)

I suspect what's happening is something like:

  • We stop the image load on onload, since the test waits for that (no frame
    change, just reflow).
  • We get the SIZE_AVAILABLE notification, update our intrinsic size,
    request a reflow.
  • We reflow, but we keep the intrinsic size we had around.
    EnsureIntrinsicSizeAndRatio does nothing because we do have a valid
    intrinsic size, even though imagelib may already know that the image is bad.
    This is the key difference compared to before my patch. Before my patch, the
    frame is new, has no intrinsic size, and pokes at imagelib directly asking
    for the intrinsic size at this point rather than using the size at the time
    we were notified of it being available.
  • We paint, sync-decode, image is bad, we're done. But we paint the blue
    background already since we do have an intrinsic size (that will get
    invalidated at some point later when the image content gets notified and
    turns into the broken state).

Timothy, does this sound like a reasonable hypothesis?

Yeah, something like that. I think the problem must be that we can do a metadata decode succesfully (to get the size, and send the load event for the image), but then when we try to decode the image data we get an unrecoverable error. Because if the error comes during the metadata decode nsImageFrame::OnLoadComplete would clear the intrinsic sizes and ratio. So if we get into this state I think it might be possible that the imageframe never gets notified and keeps the non-zero intrinsic size/ratio.

How does this work in ImageLib? When you have a size but the image is bad,
how do you notify clients that the size has "changed" (since a broken image
has no intrinsic size I assume)?

There isn't really anything that notifies about size changes. If my hypothesis in the previous paragraph is correct then this error case falls into a class that we haven't really considered much. The existing imagelib consumers kind of assume that if they get a load/size available without an error on the image will be good. So we could either teach all the consumers how to deal with this situation, or add an ON_ERROR notification type which would seem to make it easier to deal with this. I'll attach a patch I wrote that does the first for nsImageFrame so we can confirm that it would fix the problem.

Flags: needinfo?(tnikkel)

Sheriffs feel free to disable this test for now. The frequency seems quite high.

Hmm, that patch didn't work, my next guess is along the lines of comment 40. We've loaded the image, no errors, decoding hasn't started, reftest does sync decode paint, we decode, encounter error, but it's too late for this paint, we can't change the size of the frame during a paint, we paint as if we haven't encountered the error yet. After the reftest capturing paint the image frame adjusts (hopefully) and renders correctly.

Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]
Whiteboard: [stockwell needswork:owner] → [stockwell needswork]

Timothy is disabling the test still needed here?

Joel, can you help out with and example on how to disable this?

Flags: needinfo?(tnikkel)
Flags: needinfo?(jmaher)

:apavel, this is failing a lot, odd that it seems to be failing across platforms with the same failure at a pretty high rate.

Here is the line to edit:
https://searchfox.org/mozilla-central/source/image/test/reftest/bmp/bmpsuite/q/reftest.list#85

== wrapper.html?pal8os2sp.bmp about:blank

it should be:
fuzzy(0-245,0-8128) == wrapper.html?pal8os2sp.bmp about:blank

Flags: needinfo?(jmaher)
Pushed by apavel@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/81ecd35d0a7d
disabled wrapper.html?pal8os2sp.bmp on all platforms r=jmaher
Keywords: leave-open
Whiteboard: [stockwell disable-recommended] → [stockwell disabled]

Pushed about a million times to try server yesterday, went from "how is this failing?" to "how does this ever work?" and back a few times. Continuing to investigate this.

Flags: needinfo?(tnikkel)

I think I understand the failure now, it's pretty simple actually. The complicated part is that the failure mode is completely possible before Emilio's patch (as shown by the long history of this bug). The

  1. metadata decode (ie size available) completes
  2. if a (full) decode is triggered between step 1 and 3 (by say nsImageFrame::MaybeDecodeForPredictedSize or by normal painting to the screen) then when that decode encounters an error the error can't be reported to the main thread until after step 3.
  3. reftest does drawwindow with syncdecode. the image is an error mode now, but it's too late, we'd have to reflow/frame construct to fix it.

If the decode error gets reported to the main thread then we mark the image frame as needed a frame reconstruct and reftest drawwindow call flushes so it is guaranteed to happen.

This sequence doesn't seem particular unlikely, actually seems like it would be pretty common actually. So why don't we hit it much more often before Emilio's patch? Based on observing the most likely explanation is that we get the load event for the image right after the size available notification (size available blocks load for the image, and it's a small local file so we likely have all the data right away), this would trigger a frame reconstruct before Emilio's patch. Since the test uses reftest-wait the reftest-wait polling would see this as an pending paint and do another poll. This would usually be enough time for the error to reach the main thread and give us the correct rendering. If it wasn't enough time then we would get the existing lower volume intermittent.

(In reply to Timothy Nikkel (:tnikkel) from comment #55)

  1. if a (full) decode is triggered between step 1 and 3 (by say
    nsImageFrame::MaybeDecodeForPredictedSize or by normal painting to the
    screen) then when that decode encounters an error the error can't be
    reported to the main thread until after step 3.

This might be a little confusing, so I'll clarify. The error can be reported to the main thread at any point. However, for the intermittent failure to happen it must be the case that the error doesn't get reported to the main thread until after step 3 happens.

Flags: needinfo?(roc)
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.