Open Bug 1780687 Opened 3 years ago Updated 11 months ago

Process crash on Android Fission builds (test canvas-display-p3-drawImage-ImageBitmap-video.html)

Categories

(Core :: Graphics: CanvasWebGL, defect)

Unspecified
Android
defect

Tracking

()

REOPENED
Tracking Status
firefox-esr91 --- unaffected
firefox-esr102 --- wontfix
firefox104 --- wontfix
firefox105 --- wontfix
firefox106 --- wontfix
firefox107 --- wontfix

People

(Reporter: intermittent-bug-filer, Unassigned, NeedInfo)

References

(Depends on 1 open bug, Regression)

Details

(Keywords: crash, intermittent-failure, regression, Whiteboard: [fission:android:m2])

Crash Data

Attachments

(1 file)

Filed by: istorozhko [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=384276475&repo=try
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/L_-cv8VoQiKO2foEj2dbng/runs/0/artifacts/public/logs/live_backing.log


We are working on getting web platform tests to run on Android Fission builds, and we noticed these failures on Android 7.0 x86-64 WebRender debug.
Blocks: 1714654
See Also: 1714654
OS: Unspecified → Android
Whiteboard: [fission:android:m2]

Apologies as this is probably me being silly, but I can't see any reference to that crash signature in the log or logcat from comment 0. In fact it doesn't look like it crashes at all?

(In reply to Jamie Nicol [:jnicol] from comment #1)

Apologies as this is probably me being silly, but I can't see any reference to that crash signature in the log or logcat from comment 0. In fact it doesn't look like it crashes at all?

Irene, do you have a link to another try test run of Android Fission that shows the canvas-display-p3-drawImage-ImageBitmap-video.html failure?

Flags: needinfo?(bugzeeeeee)

Thanks, that run does indeed show the failure.

There's nothing fission-specific about this, you can see the same crashes in the nofis logs. I'm not sure why the nofis jobs are not marked as failing though.

I added some logging and can see the error is due to calling glTexStorage2D with GL_R16 as an internal format. This is not supported on GLES without additional extensions (which the emulator presumably does not support). Somewhere in the stack we should prevent ourselves from attempting to create a texture of this format, but I'm unsure where is the best place.

Flags: needinfo?(jnicol)

I've bisected this by hand and found it was regressed by Bug 1764478 - "Remove PDM caching from MediaChangeMonitor to allow config changes to switch between GPU and RDD process". Although it seems intermittent, and since then it might have become more frequent

Prior to this I believe we attempted to hardware decode the video, then give up when that fails. With that change, when hardware decoding fails we switch to software decoding, and since the video has a colour depth of 10 we attempt to create an R16 texture. In the cases where it intermittently doesn't reproduce, we still software decode the video and send the frames to the compositor, but don't get as far as attempting to upload them to the GPU.

Set release status flags based on info from the regressing bug 1764478

:Zaggy1024, since you are the author of the regressor, bug 1764478, could you take a look?
For more information, please visit auto_nag documentation.

It seems odd to me that those patches would cause this regression, even if the issue was previously present, since the decoder initialization when no in-band changes are present should be consistent with the previous behavior.

Prior to this I believe we attempted to hardware decode the video, then give up when that fails. With that change, when hardware decoding fails we switch to software decoding, and since the video has a colour depth of 10 we attempt to create an R16 texture.

I wonder if it was trying to instantiate hardware decode and failing twice previously. That seems like the only case in which the behavior would change, since caching the PDM prevented the same MediaChangeMonitor from switching between processes. Switching to caching the PDMFactory instead may have allowed some logic for switching to software decoding to run, but it's been a while since I looked at that code.

It seems as though the patch only uncovered some bad behavior, though, so I'll clear this NI since I can't personally debug this too easily.

Flags: needinfo?(Zaggy1024)

Thanks, Zaggy. Sorry the Bugzilla bot needinfo'd you.

Sounds like this bug might not be caused by Fission (Site Isolation), but Fission's extra processes might have changed the test's timing so we hit an existing race condition.

(In reply to Chris Peterson [:cpeterson] from comment #9)

Sounds like this bug might not be caused by Fission (Site Isolation), but Fission's extra processes might have changed the test's timing so we hit an existing race condition.

I'm not even sure whether that's the case, though it is plausible. The non-fission jobs are crashing right now on treeherder, but they just aren't being marked as failing for some reason.

(In reply to Jamie Nicol [:jnicol] from comment #10)

The non-fission jobs are crashing right now on treeherder, but they just aren't being marked as failing for some reason.

Bug 1770185 (filed four months ago) looks like a non-Fission duplicate bug for this test crash. It has the same crash signature [@ <gleam::gl::ErrorReactingGl<F> as gleam::gl::Gl>::tex_storage_2d] and crash reason Caught GL error 500 at tex_storage_2d.

However, bug 1770185 is a rare intermittent (only about two test crashes per month), whereas this test crash is 100% reproducible with Fission (well, 5 out of 5 retriggers):

https://treeherder.mozilla.org/jobs?repo=try&revision=912cffe26463c17323a7b00ac5c07ceff88409a7&searchStr=wpt6

See Also: → 1770185

(In reply to Chris Peterson [:cpeterson] from comment #11)

(In reply to Jamie Nicol [:jnicol] from comment #10)

The non-fission jobs are crashing right now on treeherder, but they just aren't being marked as failing for some reason.

Bug 1770185 (filed four months ago) looks like a non-Fission duplicate bug for this test crash. It has the same crash signature [@ <gleam::gl::ErrorReactingGl<F> as gleam::gl::Gl>::tex_storage_2d] and crash reason Caught GL error 500 at tex_storage_2d.

However, bug 1770185 is a rare intermittent (only about two test crashes per month), whereas this test crash is 100% reproducible with Fission (well, 5 out of 5 retriggers):

https://treeherder.mozilla.org/jobs?repo=try&revision=912cffe26463c17323a7b00ac5c07ceff88409a7&searchStr=wpt6

But if you look at the "Failure summary" tab in the W-nofis wpt6 jobs in that try run, you can see that they all reliably hit the same crash. The difference seems to be that the W-fis jobs get reliably marked as failing, but the W-nofis ones do not.

This test failure blocks Fission meta bug 1610822 (gv-fission) transitively through WPT meta bug 1714654.

No longer blocks: gv-fission

Hi Jim, can some investigation on this be done? Is this being exacerbated by Fission process switches? We plan on starting Fission experiment on Nightly 135, so it would be great to check this. Thank you!

Flags: needinfo?(jmathies)
Flags: needinfo?(jmathies)
Severity: S4 → --
Component: Audio/Video → Graphics: Canvas2D

I don't think this is related to the current p3 work. The original age of this bug is older than any of our p3 work.
I think this is just incidentally "p3" in the title of this test.
It's unclear what is actually broken (if anything anymore). Can you give more info or logs about recent test failures of this? It's hard to know what invocation to use to summon the right test failures, and it would help a lot if you could point me in the right direction!

Flags: needinfo?(cpeterson)

This test is disabled on lots of things, see header: https://searchfox.org/mozilla-central/source/testing/web-platform/tests/html/canvas/element/manual/wide-gamut-canvas/canvas-display-p3-drawImage-ImageBitmap-video.html. It is very much related to P3 as in Display P3 the color space.

Maybe we just want to disable those until we're ready here. My two patches in https://bugzilla.mozilla.org/show_bug.cgi?id=1925694 and https://bugzilla.mozilla.org/show_bug.cgi?id=1925699 can be probably copy-pasted-adapter.

Flags: needinfo?(jgilbert)

I'm not sure it's possible to even get this test to be run on android in ci at the current time.

First I removed all annotations from this test so that it would be run everywhere. Then I pushed to try with --full every general wpt job that seemed like it could be relevant. I opened the full log of every chunk and searched the test name, it never showed up.

I do see this test run on desktop, where it is run in the wpt-canvas jobs.

I tried various edits to the web-platform-tests-canvas section

https://searchfox.org/mozilla-central/rev/552c57cbb4eb9d6ae55a53cff217861f21c3ce6d/taskcluster/kinds/test/web-platform.yml#428

that defines that job to try to get android wpt-canvas tests to show up but I couldn't get it to work (disclaimer, I'm not familiar with this file so I could very easily be missing something).

I also tried removing "--exclude-tag=canvas" from the web-platform-tests section of that file to see if that would get them run in the main wpt jobs but that was also unsuccessful (the test never showed up in the logs on try).

So I'm not sure if these tests are intentionally not being run on android or it's an oversight. And if it's an oversight how to tweak the config files to fix that.

Attached file log.txt

Timothy, it looks like you were able to reproduce the test crash (a Caught GL error 500 at tex_storage_2d panic) in your try run (the log in comment 19).

Do you plan to work on this test crash? Or should we ignore this test crash because Joel says (comment 18) this specific test is disable and we don't schedule this test on Android? I can remove this bug from our list of Android Fission blockers and leave the bug open, in case it becomes relevant later.

https://searchfox.org/mozilla-central/rev/ee42ec590725439d33792bc8657d60f080786b2e/gfx/wr/webrender/src/device/gl.rs#1494-1502

Flags: needinfo?(cpeterson) → needinfo?(tnikkel)

I'm not sure I will have time to look into it or if I'm even a good candidate to look into it.

(In reply to Chris Peterson [:cpeterson] from comment #21)

Do you plan to work on this test crash? Or should we ignore this test crash because Joel says (comment 18) this specific test is disable and we don't schedule this test on Android? I can remove this bug from our list of Android Fission blockers and leave the bug open, in case it becomes relevant later.

I'm not sure if the reason we don't schedule it is because we don't view it as being important or if it's because android canvas wpt tests accidentally fell through the cracks and we should have the ability to schedule them. My hunch is a bit of both but leaning towards the latter.

Flags: needinfo?(tnikkel)
See Also: → 1789949

(In reply to Timothy Nikkel (:tnikkel) from comment #22)

I'm not sure if the reason we don't schedule it is because we don't view it as being important or if it's because android canvas wpt tests accidentally fell through the cracks and we should have the ability to schedule them. My hunch is a bit of both but leaning towards the latter.

Hey Timothy, who would you suggest we consult on the importance of this test? And if it is important, would somebody on your team be able to look into this? I am wondering if we should remove it from the list of Fission blockers.

Flags: needinfo?(tnikkel)

It's fairly important to have video painting to canvas working well on Android. This doesn't generally work in the current state of things: https://bugzilla.mozilla.org/show_bug.cgi?id=1526207.

Prior to doing anything here, we should fix 1526207.

Thank you! I added 1526207 as a dependency of this one (feel free to change if I misunderstood!)

Depends on: 1526207

Incidentally, is 1526207 being actively worked on?

Flags: needinfo?(padenot)

If my analysis from comment 6 was correct then this affected software decoded video, so doesn't depend on 1526207. Though it's fair to say these tests won't work anyway if we're now hardware decoding video again, until bug 1526207 is fixed.

I also wrote at the time (comment 12) that I consistently saw the test failures in the non-fission test variants too, but that treeherder wasn't reporting them as failures. It would be interesting to know whether that is still the case

(In reply to [:owlish] 🦉 PST from comment #26)

Incidentally, is 1526207 being actively worked on?

I'm currently working on overhauling our decoders and encoders by switching to using something based on the NDK to avoid using a Java process and all that machinery, so I'd say yes, but it'll take some time. It will however simplify greatly most of this, and then I can work with Jamie or others to get bug 1526207 working.

This is happening in https://bugzilla.mozilla.org/show_bug.cgi?id=1934009.

Flags: needinfo?(padenot)

Hi folks, thank you both for the update! Do you think that work blocks Fission? Would the browser be broken for users on Fission without all this work done first? Or can this be more of a follow-up sort of thing? Especially in light of jnicol's observation that this test fails (albeit intermittently) without Fission as well?

Flags: needinfo?(padenot)
Flags: needinfo?(jnicol)

This is orthogonal to fission.

Flags: needinfo?(padenot)

Yeah, if I was correct 2 years ago (it might be worth double checking) then this bug has nothing to do with fission.

The logs from Tim's test run have expired but if he remembers how to retrigger them we can double check

Flags: needinfo?(jnicol)

ok, I am removing this one from our Fission blockers then! Thank you so much for your help!

No longer blocks: 1714654

Redirect a needinfo that is pending on an inactive user to the triage owner.
:lsalzman, since the bug doesn't have a severity set, could you please set the severity or close the bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(jgilbert) → needinfo?(lsalzman)

This hasn't seen a crash in ages and isn't being actively worked on, so I am going to move this to inactive unless it becomes a problem.

Status: NEW → RESOLVED
Closed: 1 year ago
Component: Graphics: Canvas2D → Graphics: CanvasWebGL
Flags: needinfo?(lsalzman)
Resolution: --- → INACTIVE

I think we should keep this open: the test still fails. we just don't have it enabled to run currently. We should strive to fix it and re-enable it.

Based on my findings in comment 6, I thought this may have been fixed by bug 1970771. I did another try push and we're now hitting a different assertion introduced by that change. John, any ideas why we may be hitting this assertion?

Perhaps related, I notice that assertion may not take in to account whether the video is hardware decoded or not. Or perhaps more precisely whether we use the android NDK decoder (which I guess could still be using software?) and our Image is therefore a SurfaceTextureImage, or whether we software decode ourselves and have an image wrapping a CPU buffer, eg PlanarYcbCrImage.

In the former case we can handle > 8 bit colour depth just fine as we import the SurfaceTexture into OpenGL as an external texture. It's only the latter case where we cannot support it as we have to create a regular OpenGL texture, which requires the extension be supported.

Status: RESOLVED → REOPENED
Flags: needinfo?(jolin)
Resolution: INACTIVE → ---

(In reply to Jamie Nicol [:jnicol] from comment #35)

Based on my findings in comment 6, I thought this may have been fixed by bug 1970771. I did another try push and we're now hitting a different assertion introduced by that change. John, any ideas why we may be hitting this assertion?

That's odd. logcat shows it failed to decode 10-bit HEVC right before the assertion and there should not be any frames sent to VideoFrameContainer. I'll try to reproduce it with more log and see why this happens.

07-15 09:27:49.304 18669 19151 I Gecko   : [Child 18669, MediaSupervisor #2] WARNING: Error constructing decoders: file /builds/worker/checkouts/gecko/dom/media/MediaFormatReader.cpp:454
07-15 09:27:49.304 18669 19151 I Gecko   : [Child 18669, MediaSupervisor #2] WARNING: NS_ERROR_DOM_MEDIA_FATAL_ERR (0x806e0005) - Error no decoder found for video/hevc: file /builds/worker/checkouts/gecko/dom/media/MediaFormatReader.cpp:1867
07-15 09:27:49.304 18669 19110 I Gecko   : [Child 18669, MediaDecoderStateMachine #1] WARNING: Decoder=7be47c364d00 Decode error: NS_ERROR_DOM_MEDIA_FATAL_ERR (0x806e0005) - Error no decoder found for video/hevc: file /builds/worker/checkouts/gecko/dom/media/MediaDecoderStateMachineBase.cpp:168
07-15 09:27:49.305 18669 18684 W Isolated Web Content: [JavaScript Warning: "Media resource http://web-platform.test:8000/html/canvas/element/manual/wide-gamut-canvas/resources/Rec2020-3FF000000.mp4 could not be decoded." {file: "http://web-platform.test:8000/html/canvas/element/manual/wide-gamut-canvas/canvas-display-p3-drawImage-ImageBitmap-video.html" line: 0}]
07-15 09:27:49.330 18588 18614 E eglCodecCommon: glUtilsParamSize: unknow param 0x00008caa
07-15 09:27:49.334 18669 18684 D GeckoViewContentDelegateChild[C]: handleEvent: MozFirstContentfulPaint
07-15 09:27:49.336 18496 18521 D GeckoViewContentDelegateParent: receiveMessage: DispatcherMessage
07-15 09:27:49.366 18588 18614 E eglCodecCommon: glUtilsParamSize: unknow param 0x00008caa
07-15 09:27:49.369 18669 19110 F MOZ_Assert: [18669] Assertion failure: !SupportsOnly8BitImage() || std::all_of(aImages.begin(), aImages.end(), Is8BitImage) (Images should be 8-bit), at /builds/worker/checkouts/gecko/dom/media/VideoFrameContainer.cpp:125
Flags: needinfo?(jolin)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: