1478134 - Intermittent image/test/reftest/bmp/bmpsuite/q/wrapper.html?pal8os2sp.bmp == about:blank | image comparison, max difference: 245, number of differing pixels: 8128

So I think that makes some sense, because we're invalidating less stuff. See bug 1226748 comment 2, whatever race Benoit was talking about now probably happens almost always. Let me try catching it under rr sometime this week...

Flags: needinfo?(apavel) → needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Comment 41

•

6 years ago

So, I haven't been able to repro yet with various combinations of RR chaos mode and MOZ_CHAOSMODE, but, looking at the reftest screenshots in here, in order to paint the blue rectangle that the reftest analyzer shows, the following has to be true:

We're loading (otherwise we'd paint the broken image icon), but not marked as broken yet (since otherwise we'd be an empty inline given the alt=" ").
We've the image size available (otherwise we'd be 0x0).

I suspect what's happening is something like:

We stop the image load on onload, since the test waits for that (no frame change, just reflow).
We get the SIZE_AVAILABLE notification, update our intrinsic size, request a reflow.
We reflow, but we keep the intrinsic size we had around. EnsureIntrinsicSizeAndRatio does nothing because we do have a valid intrinsic size, even though imagelib may already know that the image is bad. This is the key difference compared to before my patch. Before my patch, the frame is new, has no intrinsic size, and pokes at imagelib directly asking for the intrinsic size at this point rather than using the size at the time we were notified of it being available.
We paint, sync-decode, image is bad, we're done. But we paint the blue background already since we do have an intrinsic size (that will get invalidated at some point later when the image content gets notified and turns into the broken state).

Timothy, does this sound like a reasonable hypothesis?

How does this work in ImageLib? When you have a size but the image is bad, how do you notify clients that the size has "changed" (since a broken image has no intrinsic size I assume)?

Also, this could be a nice thing to debug with Pernosco if roc could make that happen? :)

Flags: needinfo?(tnikkel)

Flags: needinfo?(roc)

Flags: needinfo?(emilio)

Comment hidden (Intermittent Failures Robot)

Timothy Nikkel (:tnikkel)

Comment 43

•

6 years ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #41)

I suspect what's happening is something like:

We stop the image load on onload, since the test waits for that (no frame
change, just reflow).

We get the SIZE_AVAILABLE notification, update our intrinsic size,
request a reflow.

We reflow, but we keep the intrinsic size we had around.
EnsureIntrinsicSizeAndRatio does nothing because we do have a valid
intrinsic size, even though imagelib may already know that the image is bad.
This is the key difference compared to before my patch. Before my patch, the
frame is new, has no intrinsic size, and pokes at imagelib directly asking
for the intrinsic size at this point rather than using the size at the time
we were notified of it being available.

We paint, sync-decode, image is bad, we're done. But we paint the blue
background already since we do have an intrinsic size (that will get
invalidated at some point later when the image content gets notified and
turns into the broken state).

Timothy, does this sound like a reasonable hypothesis?

Yeah, something like that. I think the problem must be that we can do a metadata decode succesfully (to get the size, and send the load event for the image), but then when we try to decode the image data we get an unrecoverable error. Because if the error comes during the metadata decode nsImageFrame::OnLoadComplete would clear the intrinsic sizes and ratio. So if we get into this state I think it might be possible that the imageframe never gets notified and keeps the non-zero intrinsic size/ratio.

How does this work in ImageLib? When you have a size but the image is bad,
how do you notify clients that the size has "changed" (since a broken image
has no intrinsic size I assume)?

There isn't really anything that notifies about size changes. If my hypothesis in the previous paragraph is correct then this error case falls into a class that we haven't really considered much. The existing imagelib consumers kind of assume that if they get a load/size available without an error on the image will be good. So we could either teach all the consumers how to deal with this situation, or add an ON_ERROR notification type which would seem to make it easier to deal with this. I'll attach a patch I wrote that does the first for nsImageFrame so we can confirm that it would fix the problem.

Flags: needinfo?(tnikkel)

Timothy Nikkel (:tnikkel)

Comment 44

•

6 years ago

Attached patch imgfixerrormaybe — Details — Splinter Review

Timothy Nikkel (:tnikkel)

Comment 45

•

6 years ago

Sheriffs feel free to disable this test for now. The frequency seems quite high.

Timothy Nikkel (:tnikkel)

Comment 46

•

6 years ago

Hmm, that patch didn't work, my next guess is along the lines of comment 40. We've loaded the image, no errors, decoding hasn't started, reftest does sync decode paint, we decode, encounter error, but it's too late for this paint, we can't change the size of the frame during a paint, we paint as if we haven't encountered the error yet. After the reftest capturing paint the image frame adjusts (hopefully) and renders correctly.

Cristina Coroiu [:ccoroiu]

Updated

•

6 years ago

Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

Cristina Coroiu [:ccoroiu]

Updated

•

6 years ago

Whiteboard: [stockwell needswork:owner] → [stockwell needswork]

Comment hidden (Intermittent Failures Robot)

Andreea Pavel [:apavel]

Comment 48

•

6 years ago

Timothy is disabling the test still needed here?

Joel, can you help out with and example on how to disable this?

Flags: needinfo?(tnikkel)

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 49

•

6 years ago

:apavel, this is failing a lot, odd that it seems to be failing across platforms with the same failure at a pretty high rate.

Here is the line to edit:
https://searchfox.org/mozilla-central/source/image/test/reftest/bmp/bmpsuite/q/reftest.list#85

== wrapper.html?pal8os2sp.bmp about:blank

it should be:
fuzzy(0-245,0-8128) == wrapper.html?pal8os2sp.bmp about:blank

Flags: needinfo?(jmaher)

Andreea Pavel [:apavel]

Comment 50

•

6 years ago

Attached file Bug 1478134 - disabled wrapper.html?pal8os2sp.bmp on all platforms — Details

Pulsebot

Comment 51

•

6 years ago

Pushed by apavel@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/81ecd35d0a7d disabled wrapper.html?pal8os2sp.bmp on all platforms r=jmaher

Andreea Pavel [:apavel]

Updated

•

6 years ago

Keywords: leave-open

Whiteboard: [stockwell disable-recommended] → [stockwell disabled]

Timothy Nikkel (:tnikkel)

Comment 52

•

6 years ago

Pushed about a million times to try server yesterday, went from "how is this failing?" to "how does this ever work?" and back a few times. Continuing to investigate this.

Flags: needinfo?(tnikkel)

Comment hidden (Intermittent Failures Robot)

Stefan Hindli [:stefan_hindli]

Comment 54

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/81ecd35d0a7d

Timothy Nikkel (:tnikkel)

Comment 55

•

6 years ago

I think I understand the failure now, it's pretty simple actually. The complicated part is that the failure mode is completely possible before Emilio's patch (as shown by the long history of this bug). The

metadata decode (ie size available) completes
if a (full) decode is triggered between step 1 and 3 (by say nsImageFrame::MaybeDecodeForPredictedSize or by normal painting to the screen) then when that decode encounters an error the error can't be reported to the main thread until after step 3.
reftest does drawwindow with syncdecode. the image is an error mode now, but it's too late, we'd have to reflow/frame construct to fix it.

If the decode error gets reported to the main thread then we mark the image frame as needed a frame reconstruct and reftest drawwindow call flushes so it is guaranteed to happen.

This sequence doesn't seem particular unlikely, actually seems like it would be pretty common actually. So why don't we hit it much more often before Emilio's patch? Based on observing the most likely explanation is that we get the load event for the image right after the size available notification (size available blocks load for the image, and it's a small local file so we likely have all the data right away), this would trigger a frame reconstruct before Emilio's patch. Since the test uses reftest-wait the reftest-wait polling would see this as an pending paint and do another poll. This would usually be enough time for the error to reach the main thread and give us the correct rendering. If it wasn't enough time then we would get the existing lower volume intermittent.

Comment hidden (Intermittent Failures Robot)

Timothy Nikkel (:tnikkel)

Comment 57

•

6 years ago

(In reply to Timothy Nikkel (:tnikkel) from comment #55)

if a (full) decode is triggered between step 1 and 3 (by say
nsImageFrame::MaybeDecodeForPredictedSize or by normal painting to the
screen) then when that decode encounters an error the error can't be
reported to the main thread until after step 3.

This might be a little confusing, so I'll clarify. The error can be reported to the main thread at any point. However, for the intermittent failure to happen it must be the case that the error doesn't get reported to the main thread until after step 3 happens.

Flags: needinfo?(roc)

Andrew Osmond [:aosmond] (he/him)

Updated

•

6 years ago

Priority: -- → P3

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

BugBot [:suhaib / :marco/ :calixte]

Comment 58

•

5 months ago

https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup
For more information, please visit BugBot documentation.

Status: NEW → RESOLVED

Closed: 5 months ago

Resolution: --- → INCOMPLETE

BugBot (nomail) [:suhaib / :marco/ :calixte]

Updated

•

4 months ago

Keywords: leave-open

imgfixerrormaybe 6 years ago Timothy Nikkel (:tnikkel) 3.77 KB, patch		Details \| Diff \| Splinter Review
Bug 1478134 - disabled wrapper.html?pal8os2sp.bmp on all platforms 6 years ago Andreea Pavel [:apavel] 47 bytes, text/x-phabricator-request		Details \| Review