Open Bug 1831051 Opened 1 year ago Updated 21 days ago

wayland: The first frame on startup is sometimes uninitialized for a moment (also maybe on x11)

Categories

(Core :: Graphics: WebRender, defect)

Firefox 111
Desktop
Linux
defect

Tracking

()

ASSIGNED

People

(Reporter: lina, Assigned: ahale, NeedInfo)

References

(Blocks 2 open bugs)

Details

Steps to reproduce:

Running Firefox on Asahi Linux (with Mesa GL version overrides or the WIP branch with GLES 3 support to enable WebRender) sometimes shows a magenta window on startup. This happens because sometimes Firefox swaps buffers immediately after creating the window surface, without doing any drawing (00-initialized compressed textures on Apple GPUs sample as magenta).

The problem is compositor-independent (reproduced on Sway and KWin).

Apitrace shows Firefox creating the surface and almost immediately swapping buffers:

1127 eglGetDisplay(display_id = 0x7f891ba040) = 0x7eefdc7d00
1128 eglCreateWindowSurface(dpy = 0x7eefdc7d00, config = 0x7eefe59040, win = 0x7eddc18700, attrib_list = {}) = 0x7ee0010c00
1129 eglGetCurrentContext() = 0x7eefec0900
1130 eglMakeCurrent(dpy = 0x7eefdc7d00, draw = 0x7ee0010c00, read = 0x7ee0010c00, ctx = 0x7eefec0900) = EGL_TRUE
1131 glViewport(x = 0, y = 0, width = 1280, height = 943) // fake
1132 glScissor(x = 0, y = 0, width = 1280, height = 943) // fake
1133 eglGetCurrentContext() = 0x7eefec0900
1134 eglSwapInterval(dpy = 0x7eefdc7d00, interval = 0) = EGL_TRUE
1135 eglGetCurrentContext() = 0x7eefec0900
1136 eglMakeCurrent(dpy = 0x7eefdc7d00, draw = 0x7ee0010c00, read = 0x7ee0010c00, ctx = 0x7eefec0900) = EGL_TRUE
1137 eglGetCurrentContext() = 0x7eefec0900
1138 eglQuerySurface(dpy = 0x7eefdc7d00, surface = 0x7ee0010c00, attribute = EGL_BUFFER_AGE_KHR, value = &0) = EGL_TRUE
1139 eglSwapBuffersWithDamageKHR(dpy = 0x7eefdc7d00, surface = 0x7ee0010c00, rects = NULL, n_rects = 0) = EGL_TRUE

I believe on other systems the window is more likely to sample as transparent, which is less obvious.

Steps to reproduce:

  1. Run this shell loop in a Wayland session capable of Firefox WebRender acceleration:

while true; do
MOZ_DISABLE_AUTO_SAFE_MODE=1 MOZ_ENABLE_WAYLAND=1
timeout 2 apitrace trace -v -a egl -o trace firefox
apitrace dump trace | grep -A30 eglCreateWindowSurface | grep SwapBuff && break
done

Actual results:

The shell loop will terminate at some point, indicating that an eglSwapBuffersWithDamageKHR() call happened shortly after the window surface was created (which can be confirmed by manually inspecting the apitrace dump output).

Visually, the Firefox window is briefly magenta on startup when this happens on Apple Silicon platforms.

Expected results:

The shell loop never terminates.

On Apple Silicon platforms, the window never flashes magenta.

The Bugbug bot thinks this bug should belong to the 'Core::Graphics: WebRender' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Graphics: WebRender
Product: Firefox → Core
Blocks: wayland
OS: Unspecified → Linux
Hardware: Unspecified → Desktop
See Also: → 1778837

Asahi Lina: Just to make sure I understand the stated spread and impact of this bug is, can you confirm my assumptions from reading your description?

  1. This is specific to Apple Silicon (N.B., not necessarily an Apple OS 🙂).
  2. This is specific to Asahi Linux, and not other distros, since Asahi Linux is specifically built to work with Apple Silicon.
  3. When this bug happens, it only happens briefly on Firefox's startup, but not while, say, loading tabs during the same session.
  1. This should not be specific to any particular platform, or at least I have no reason to believe it is. However, on Apple Silicon GPUs, uninitialized buffers show up as magenta, which makes the bug very obvious, and I believe that is why we're probably the first platform to notice the issue (on other GPUs you'd probably just get an initially transparent window or something like that). It could theoretically be specific to ARM64 or even AS for other reasons (say, if this is a platform-dependent CPU race condition, even one that only triggers on Apple CPUs), but I have no evidence for or against that idea right now.

  2. As far as the Firefox build is concerned, Asahi Linux uses packages from Arch Linux ARM, so if this is build-specific I would expect it affects at least all Arch ARM users.

  3. Yes, it only seems to happen once on startup (sometimes), literally the first frame sent to the compositor. Since startup can take a brief moment, that actually shows for a very noticeable fraction of a second on the screen until the next frame replaces it.

I don't have any non-AS Wayland machines handy, but it should be pretty easy to try the shell snippet I shared on a similar environment on another platform and see if it reproduces. If it does, then it clearly isn't AS-specific ^^

(on other GPUs you'd probably just get an initially transparent window or something like that)

It's not specific to Asahi, Arch or aarch64 at all. I can easily see it on amd64 under KDE Plasma on Arch and another aarch64 device with Phosh on Debian too - the window is initially transparent. Never bothered to report though; I probably would if it was magenta instead ;)

Tested right now on both 102.10.0esr and nightly.

Assignee: nobody → stransky
Severity: -- → S2

Martin, can you take a look at this?

I will take a look at this.

I think there are two important aspects of this that are more Linux-specific than Wayland-specific. I have some knowledge of the compositor code (albeit on Windows), and have a good idea of where we're failing to clear the new window surface / compositor surfaces so we show garbage (or the window appears weirdly transparent in some cases, which actually happens on all OS's to some degree but with different manifestations).

The other aspect is that I think startup is taking too long on Linux - a bug like this wouldn't be noticeable if it took extremely little time to start up the graphical parts of the browser after the window is created, the fact it takes a moment is itself an issue I want to look at with a profiler.

Renaming the bug slightly to encompass both parts.

Summary: wayland: The first frame on startup is sometimes uninitialized → wayland: The first frame on startup is sometimes uninitialized for a moment (also maybe on x11)
Flags: needinfo?(ahale)

I debugged this today and found that we do glClear before calling eglSwapBuffers but if glxtest is running slow it significantly delays our clear and swap step compared to the window creation, so an empty window (decorated but nothing in it) appears for a while on Wayland on Ubuntu 23.04, I'm not sure if what I reproed is the same issue that causes magenta on avahi linux though.

Flags: needinfo?(ahale)

In the apitrace log I pasted you can see it never does a glClear, it just calls eglSwapBuffers without ever initializing the buffer. This only happens sometimes though, and when it does it happens immediately (and then the slow part of startup follows so you get the magenta window for a while).

That's different from startups where it doesn't call eglSwapBuffers early at all, and then indeed on Asahi you also get an empty window until the first frame (I'm not sure if that one is a compositor issue or a Wayland integration issue, since I would expect a correct implementation would just never draw the window until it has one frame of contents available, but that's a different problem).

I think there must be a race condition somewhere that sometimes (but only sometimes) causes a swap very early, before anything has been drawn or cleared at all.

I don't know if it's relevant or not, but #1464823 has some discussion that may be related to this issue (though it's pretty old).

Where do I find the apitrace log you pasted? I'm not seeing it in the history of this bug.

Bug 1778837 linked to this bug looks super relevant - I'm not entirely sure how to provoke this wrong initialization order, but I can confirm from my repro attempts that we're not creating the surface when the window is created, and that's breaking assumptions of the window decorators, regardless of whether we hit a race condition that shows magenta I think we should be trying to get the window surface fully populated before any animation occurs.

My attempt to cause slow startup was compiling as debug without optimizations, and modifying glxtest to add a couple half second delays (can't be much more than that or it punts to software rendering).

I'm wondering whether the eglSwapBuffers comes from the regular place in gfx/gl/GLContextProviderEGL.cpp (called by WebRender) or if there's a different piece of code involved.

If you're interested, a couple env variables that might be relevant for order of init questions:
MOZ_GL_SPEW=true MOZ_GL_DEBUG_VERBOSE=true ./mach run

Obviously apitrace is more authoritative overall, seeing the stacks would help.

Flags: needinfo?(lina)

(In reply to Ashley Hale [:ahale] from comment #7)

so an empty window (decorated but nothing in it) appears for a while on Wayland on Ubuntu 23.04, I'm not sure if what I reproed is the same issue that causes magenta on avahi linux though.

Empty transparent window is default window state unless we paint anything into it.

You can flip 'widget.transparent-windows' to paint window with default Gtk background on Linux:
https://searchfox.org/mozilla-central/rev/4a5c56f4aca291802ce27320cd9a752dd5dd955e/widget/gtk/nsWindow.cpp#6142

I will investigate this more, assigning to myself.

The trace snippet in the report is showing eglSwapBuffersWithDamageKHR with n_rects = 0 which makes me think that WebRender is being invoked to draw an empty display list, and that the first display list it receives is expected to contain a background clear but it does not in this case, I can likely make WebRender handle this case specially and guarantee clear occurs before eglSwapBuffers.

My bigger concern beyond fixing that specific bug is that window decorators are making assumptions about us creating a backing surface at the same time we create the context, it is still a bad look if the browser window is just showing a window border with a translucent (and sometimes blurred) view of the desktop inside the window border for a moment on startup, but it is a separate issue from the API call issue reported here, so I will prioritize the incorrect API usage first.

Assignee: stransky → ahale
Flags: needinfo?(ahale)
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
No longer blocks: gfx-triage
Severity: S2 → S3

Clear a needinfo that is pending on an inactive user.

Inactive users most likely will not respond; if the missing information is essential and cannot be collected another way, the bug maybe should be closed as INCOMPLETE.

For more information, please visit BugBot documentation.

Flags: needinfo?(lina)
You need to log in before you can comment on or make changes to this bug.