Certain 3D CSS usage crashes WebRenderer process on Windows, leaving Firefox running poorly until the whole process is restarted
Categories
(Core :: Graphics: WebRender, defect)
Tracking
()
People
(Reporter: fiend, Assigned: gw)
References
Details
(Whiteboard: [win:stability])
Attachments
(4 files)
In an environment rendered with 3D HTML/CSS, moving the 'camera' around (transforming the container to simulate movement) can sometimes result in the issue described in title. Attached a gif of it occurring.
This specifically occurs on windows (we've tried to get it to happen on linux/mac, fortunately does not), and I believe it's specifically when there are smaller animations contained within the container that aren't entirely on-screen, i.e. ground tiles spinning and floating.
When the webrender process stops, it does not recover until Firefox is completely closed and re-opened, and all current + future tabs run poorly. It typically results in some weird behavior on other sites with gifs (i.e. the gif I attached to this thread only having a single frame afterwards) or videos.
Relevant links:
- https://b.corru.observer/concepts/firefox/?force&canvasfloor - Can be recreated here - WASD to move, typically happens very quickly if you move around for a bit. Have observed it happens most frequently by turning right after moving forward once or twice.
- https://b.corru.observer/concepts/firefox/?force&canvasfloor&safety - A version without the tile animations that seems to be fine
- https://share.firefox.dev/4d5h55z - Example performance profile shared by someone able to recreate it on reddit (/u/yjuglaret)
- https://share.firefox.dev/3Uvu8Wm - Performance profile captured personally while writing this bug
Comment 1•9 months ago
•
|
||
(In reply to fiend from comment #0)
- https://share.firefox.dev/4d5h55z - Example performance profile shared by someone able to recreate it on reddit (/u/yjuglaret)
That would be me - this profile is from today's Nightly running on Windows. Thanks for submitting a bug! I confirm that I can reproduce this issue in Release 125.0.2 and Nightly 127. Here is a link to the original Reddit thread for reference.
My personal STR:
- Open Firefox.
- Maximize the Firefox window.
- Navigate to https://b.corru.observer/concepts/firefox/?force&canvasfloor
- Click "firefox flicker - PROCEED".
- Wait for the scene to be completely rendered.
- Type ↑↑←↑↑↑→↑←↑
- Keep → pressed until you visually notice that the rendering slows down and the rotation speed goes down as well. In my tests this took between 3 seconds at minimum and up to more than 60 seconds sometimes, so patience is required here but eventually the bug should appear. (Edit: Sometimes alternating 10 seconds of → and 10 seconds of ← works better.)
Expected behavior: the rendering does not slow down, the rotation speed never varies.
Updated•9 months ago
|
Comment 2•9 months ago
•
|
||
The profiles and the description strongly suggest fallback to sw-wr
@Yannis: can you please attach the contents of your about:support before the bug repros, adn after the bug repros?
Comment 3•9 months ago
|
||
Somehow I have trouble reproducing on the same computer today, so here is from my laptop with the newest Nightly. This is about:support
before running into the bug.
Comment 4•9 months ago
|
||
This is about:support
after running into the bug. Parts that seem relevant in the diff:
1101c1105
< "windowLayerManagerType": "WebRender",
---
> "windowLayerManagerType": "WebRender (Software D3D11)",
1159a1164,1199
> "failures": [
> "[GFX1-]: Handling webrender error 4",
> "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
> "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
> "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
> "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
> "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
> "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
> "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
> "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
> "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
> "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
> "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
> "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
> "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
> "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
> "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80"
> ],
> "indices": [
> 0,
> 31,
> 32,
> 33,
> 34,
> 35,
> 36,
> 37,
> 38,
> 24,
> 25,
> 26,
> 27,
> 28,
> 29,
> 30
> ],
1220c1260
< "status": "available",
---
> "status": "unavailable",
1224a1265,1270
> },
> {
> "type": "runtime",
> "status": "unavailable",
> "failureId": "FEATURE_FAILURE_WEBRENDER_BEGIN_DRAW",
> "message": "BeginDraw() failed"
Comment 5•9 months ago
|
||
The call to BeginDraw
in DCLayerTree::CreateEGLSurfaceForCompositionSurface
fails with E_INVALIDARG
(0x80070057). I'll try to locally debug why.
Comment 6•9 months ago
|
||
Yannis, is this a regression? Can you repro if you disable your integrated Intel Xe GPU? Can you repro if you disable the dGPU and only use the iGPU?
(I cant repro on my Win11 AMD single GPU single monitor machine)
Updated•9 months ago
|
Updated•9 months ago
|
Comment 7•9 months ago
|
||
(In reply to Mayank Bansal from comment #6)
Yannis, is this a regression? Can you repro if you disable your integrated Intel Xe GPU? Can you repro if you disable the dGPU and only use the iGPU?
(I cant repro on my Win11 AMD single GPU single monitor machine)
Not a regression as far as I can tell. I can repro in ESR 115.
As long as the Intel iGPU is enabled I can repro (NVIDIA dGPU enabled or not does not matter): when the bug occurs I have slow rendering and some tiles disappear like in the GIF attachment. If I disable the Intel iGPU the rendering is too slow from the start (NVIDIA dGPU enabled or not does not matter) to let me reproduce the bug; all tiles are shown however.
Comment 8•9 months ago
•
|
||
I'm now positive that BeginDraw
fails with E_INVALIDARG
because we are giving it a &update_rect
such that update_rect.left == update_rect.right
and update_rect.top == update_rect.bottom
(a "zero-area rectangle"). [:gw], is this enough information to address this bug? Let me know if there's another path along which you'd like me to do some debugging. (By the way, is it normal that DCLayerTree::CreateEGLSurfaceForCompositionSurface
returns false
when BeginDraw
fails even though the return type is GLuint
?)
Updated•9 months ago
|
Assignee | ||
Comment 9•9 months ago
|
||
Thanks for the detailed debugging so far, this is very helpful.
This will certainly only occur on Windows as the BeginDraw
is part of the DirectComposition integration, where the failure occurs. However, it will be failing (silently) in the same way on Linux / Mac, I would think.
We could in theory detect the zero-rect case in DCLayerTree
and avoid the failure there. However, it would be much better if we could find the root cause of why we end up with a zero-rect case getting that far down the pipeline.
WR is supposed to early out on zero-area rectangles before it gets to that point in the pipeline. So handling it in DCLayerTree
would sort of be a band-aid that hides the underlying issue earlier on, which may cause other unexpected performance or correctness issues.
I may have time later this week to try repro on a Windows box and look into it. If you are interested in doing some more debugging locally, and can repro relatively easily, the place I would start looking is around [1]. This adds a surface to the compositor - we could possibly add some logging here to find out when a zero-area rect is ending up being sent to the compositor interface, and then trace back from there to see how it occurs.
[1] https://searchfox.org/mozilla-central/source/gfx/wr/webrender/src/composite.rs#775
Comment 10•9 months ago
|
||
After a bit more debugging, the reason we get a zero-area rectangle is because wr::DeviceIntRect aDirtyRect
has 0 in all 4 variants of aDirtyRect.{min,max}.{x,y}
. This value comes from the dirty_rect
field of a PictureCacheTarget
, which itself comes from the pic_task.scissor_rect
of a RenderTaskKind::Picture(ref pic_task)
in build_render_pass
, which is originally computed by the path below in PicturePrimitive::take_context
.
let scissor_rect = frame_state.composite_state.get_surface_rect(
tile.local_dirty_rect,
&tile.local_tile_rect,
tile_cache.transform_index,
).to_i32();
This call can return DeviceRect::zero
(and does here, as far as I can tell) if the intersection is empty:
pub fn get_surface_rect<T>(
&self,
local_sub_rect: &Box2D<f32, T>,
local_bounds: &Box2D<f32, T>,
transform_index: CompositorTransformIndex,
) -> DeviceRect {
let transform = &self.transforms[transform_index.0];
let surface_bounds = transform.local_to_surface.map_rect(&local_bounds);
let surface_rect = transform.local_to_surface.map_rect(&local_sub_rect);
surface_rect
.round_out()
.translate(-surface_bounds.min.to_vector())
.round_out()
.intersection(&surface_bounds.size().round().into())
.unwrap_or_else(DeviceRect::zero)
}
PicturePrimitive::take_context
does not treat DeviceRect::zero
as a special case which seems to be why we later end up with the problem mentioned in comment 8.
Assignee | ||
Comment 11•9 months ago
|
||
Interesting findings. The assumption here is that tile.local_dirty_rect
should always intersect with tile.local_tile_rect
as it should be a subset, in this code path.
So I wonder if that invariant is not correct, or if we're dealing with some floating point accuracy issues (I suspect the latter).
Would it be possible to log out the values of tile.local_dirty_rect
, tile.local_tile_rect
and the return value of get_surface_rect
(before the .to_i32()
call) when this case occurs? That would hopefully let us identify whether it's a result of float accuracy / rounding issues, or that the expected invariant in the inputs is not true here.
Comment 12•9 months ago
•
|
||
So digging a bit further, the following happens right before the bug:
- we reach a call to
Tile::invalidate
where theinvalidation_rect
isSome(rect)
(e.g.Box2D((3456.0, 512.0), (3584.0, 1024.0))
); - in that call we update
self.local_dirty_rect
fromBox2D((0.0, 0.0), (0.0, 0.0))
torect
through a union; - in
PicturePrimitive::take_context
we intersect that withtile.current_descriptor.local_valid_rect
(e.g.Box2D((3072.0, 512.0), (3456.0, 1024.0))
); - but the intersection is empty so
self.local_dirty_rect
is nowPictureRect::zero()
, leading to a second empty intersection in the computation from comment 10.
I'll try to continue to track this, e.g. figure out which call to Tile::invalidate
is involved here.
Comment 13•9 months ago
•
|
||
The invalidation occurs in Tile::update_dirty_rects
. The invalidation reason is PrimCount
. It changes the Tile
for example from:
id=210
is_valid=1
local_tile_rect=Box2D((3072.0, 1536.0), (4096.0, 2048.0))
local_valid_rect=Box2D((-13214.886, -169.00008), (3456.0005, 4861.522))
local_dirty_rect=Box2D((0.0, 0.0), (0.0, 0.0))
current_descriptor.local_valid_rect=Box2D((3072.0, 1536.0), (3456.0, 2048.0))
To:
id=210
is_valid=0
local_tile_rect=Box2D((3072.0, 1536.0), (4096.0, 2048.0))
local_valid_rect=Box2D((-13214.886, -169.00008), (3456.0005, 4861.522))
local_dirty_rect=Box2D((3456.0, 1536.0), (3584.0, 2048.0))
current_descriptor.local_valid_rect=Box2D((3072.0, 1536.0), (3456.0, 2048.0))
Below are the lines in PicturePrimitive::take_context
that seem relevant when we later reach it (the intersection is empty and then we take the else
branch):
// Ensure that the dirty rect doesn't extend outside the local valid rect.
tile.local_dirty_rect = tile.local_dirty_rect
.intersection(&tile.current_descriptor.local_valid_rect)
.unwrap_or_else(PictureRect::zero);
// ...
if tile.is_valid {
// ...
} else {
// ...
if let TileSurface::Texture { ref mut descriptor } = tile.surface.as_mut().unwrap() {
// ...
let scissor_rect = frame_state.composite_state.get_surface_rect(
&tile.local_dirty_rect,
&tile.local_tile_rect,
tile_cache.transform_index,
).to_i32();
In another part of PicturePrimitive::take_context
, to reset the dirty rect we not only set tile.local_dirty_rect
to PictureRect::zero()
but we also set tile.is_valid
to true
. The latter seems to be missing here and that could explain why taking the else
branch here would be a mistake. Updating to the code below seems to safely make the issue disappear (assuming I've tested long enough):
tile.local_dirty_rect = tile.local_dirty_rect
.intersection(&tile.current_descriptor.local_valid_rect)
.unwrap_or_else(|| { tile.is_valid = true; PictureRect::zero() });
As an outsider to this code I cannot confirm if that is the correct fix, but it seems like a reasonable explanation.
Assignee | ||
Comment 14•9 months ago
|
||
Great find, thanks for digging in to it. Yes, I agree that the proposed fix seems reasonable.
I did a try run with your patch and it doesn't seem to break anything [1] other than a few unrelated intermittent failures.
Long term, it'd probably be better to handle this slightly earlier than take_context
, perhaps in post_update
of Tile
. However, I think it's fine to merge this as-is, since that change will require a bit more work.
Do you want to create a patch and I'll review it, or would you rather I create a patch? (not entirely sure how we'll automate a test for this yet, will need to think about it a bit more).
Comment 15•9 months ago
|
||
Thanks, I'll push the short-term change for review then.
Comment 16•9 months ago
•
|
||
When we reset the dirty rect to PictureRect::zero() on a Tile, we must
also mark the Tile as valid. This prevents the propagation of zero-area
rectangles which can otherwise ultimately cause BeginDraw failures.
Comment 17•9 months ago
|
||
Thanks for helping out here, Yannis. :)
Comment 18•9 months ago
•
|
||
Try build also appears to fix bug 1886652
But Bug 1750348 still crashes.
Comment 19•9 months ago
|
||
Updated•9 months ago
|
Comment 21•9 months ago
|
||
bugherder |
Updated•9 months ago
|
Assignee | ||
Comment 23•8 months ago
|
||
The bug has (I think) existed for a long time and probably doesn't affect that many pages. On the other hand, it's a severe bug (GPU process crash) when it does occur. If it applies cleanly it's probably worth uplifting I think.
Comment 24•8 months ago
|
||
This grafts cleanly to esr115. Feel free to go ahead and nominate.
Assignee | ||
Updated•8 months ago
|
Assignee | ||
Comment 25•7 months ago
|
||
Comment on attachment 9399943 [details]
Bug 1893205 - Properly reset the dirty rect in PicturePrimitive::take_context. r=gw
ESR Uplift Approval Request
- If this is not a sec:{high,crit} bug, please state case for ESR consideration: Suggested by ryanvm, fixes GPU process crashes on some pages.
- User impact if declined: Certain pages will crash GPU process, falling back to software.
- Fix Landed on Version:
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Has been in m-c for some time without any reported issues. Grafts cleanly.
Comment 26•7 months ago
|
||
Comment on attachment 9399943 [details]
Bug 1893205 - Properly reset the dirty rect in PicturePrimitive::take_context. r=gw
Approved for 115.13esr.
Updated•7 months ago
|
Comment 27•7 months ago
|
||
uplift |
Description
•