Closed Bug 1893205 Opened 9 months ago Closed 9 months ago

Certain 3D CSS usage crashes WebRenderer process on Windows, leaving Firefox running poorly until the whole process is restarted

Tracking

()

Status:

RESOLVED FIXED

Milestone:

127 Branch

Tracking Flags:

Tracking

Status

firefox-esr115

---

fixed

firefox125

---

wontfix

firefox126

---

wontfix

firefox127

---

fixed

People

(Reporter: fiend, Assigned: gw)

References

Details

(Whiteboard: [win:stability])

Attachments

(4 files)

Demonstration of the bug - moving around smoothly and randomly, followed by a flash, then a drop in FPS and the floor canvas losing what was rendered on it. 9 months ago fiend 7.07 MB, image/gif		Details
about_support_before.json 9 months ago Yannis Juglaret [:yannis] 70.80 KB, application/json		Details
about_support_after.json 9 months ago Yannis Juglaret [:yannis] 72.56 KB, application/json		Details
Bug 1893205 - Properly reset the dirty rect in PicturePrimitive::take_context. r=gw 9 months ago Yannis Juglaret [:yannis] 48 bytes, text/x-phabricator-request	RyanVM : approval-mozilla-esr115+	Details \| Review

fiend

Reporter

Description

•

9 months ago

Attached image Demonstration of the bug - moving around smoothly and randomly, followed by a flash, then a drop in FPS and the floor canvas losing what was rendered on it. — Details

In an environment rendered with 3D HTML/CSS, moving the 'camera' around (transforming the container to simulate movement) can sometimes result in the issue described in title. Attached a gif of it occurring.

This specifically occurs on windows (we've tried to get it to happen on linux/mac, fortunately does not), and I believe it's specifically when there are smaller animations contained within the container that aren't entirely on-screen, i.e. ground tiles spinning and floating.

When the webrender process stops, it does not recover until Firefox is completely closed and re-opened, and all current + future tabs run poorly. It typically results in some weird behavior on other sites with gifs (i.e. the gif I attached to this thread only having a single frame afterwards) or videos.

Relevant links:

https://b.corru.observer/concepts/firefox/?force&canvasfloor - Can be recreated here - WASD to move, typically happens very quickly if you move around for a bit. Have observed it happens most frequently by turning right after moving forward once or twice.
https://b.corru.observer/concepts/firefox/?force&canvasfloor&safety - A version without the tile animations that seems to be fine
https://share.firefox.dev/4d5h55z - Example performance profile shared by someone able to recreate it on reddit (/u/yjuglaret)
https://share.firefox.dev/3Uvu8Wm - Performance profile captured personally while writing this bug

Yannis Juglaret [:yannis]

Comment 1

•

9 months ago

•

Edited

(In reply to fiend from comment #0)

https://share.firefox.dev/4d5h55z - Example performance profile shared by someone able to recreate it on reddit (/u/yjuglaret)

That would be me - this profile is from today's Nightly running on Windows. Thanks for submitting a bug! I confirm that I can reproduce this issue in Release 125.0.2 and Nightly 127. Here is a link to the original Reddit thread for reference.

My personal STR:

Open Firefox.
Maximize the Firefox window.
Navigate to https://b.corru.observer/concepts/firefox/?force&canvasfloor
Click "firefox flicker - PROCEED".
Wait for the scene to be completely rendered.
Type ↑↑←↑↑↑→↑←↑
Keep → pressed until you visually notice that the rendering slows down and the rotation speed goes down as well. In my tests this took between 3 seconds at minimum and up to more than 60 seconds sometimes, so patience is required here but eventually the bug should appear. (Edit: Sometimes alternating 10 seconds of → and 10 seconds of ← works better.)

Expected behavior: the rendering does not slow down, the rotation speed never varies.

Has STR: --- → yes

status-firefox125: --- → affected

status-firefox126: --- → affected

status-firefox127: --- → affected

Yannis Juglaret [:yannis]

Updated

•

9 months ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

Mayank Bansal

Comment 2

•

9 months ago

•

Edited

The profiles and the description strongly suggest fallback to sw-wr

@Yannis: can you please attach the contents of your about:support before the bug repros, adn after the bug repros?

Flags: needinfo?(yjuglaret)

Yannis Juglaret [:yannis]

Comment 3

•

9 months ago

Attached file about_support_before.json — Details

Somehow I have trouble reproducing on the same computer today, so here is from my laptop with the newest Nightly. This is about:support before running into the bug.

Flags: needinfo?(yjuglaret)

Yannis Juglaret [:yannis]

Comment 4

•

9 months ago

Attached file about_support_after.json — Details

This is about:support after running into the bug. Parts that seem relevant in the diff:

1101c1105
<       "windowLayerManagerType": "WebRender",
---
>       "windowLayerManagerType": "WebRender (Software D3D11)",
1159a1164,1199
>       "failures": [
>         "[GFX1-]: Handling webrender error 4",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80"
>       ],
>       "indices": [
>         0,
>         31,
>         32,
>         33,
>         34,
>         35,
>         36,
>         37,
>         38,
>         24,
>         25,
>         26,
>         27,
>         28,
>         29,
>         30
>       ],
1220c1260
<             "status": "available",
---
>             "status": "unavailable",
1224a1265,1270
>               },
>               {
>                 "type": "runtime",
>                 "status": "unavailable",
>                 "failureId": "FEATURE_FAILURE_WEBRENDER_BEGIN_DRAW",
>                 "message": "BeginDraw() failed"

Yannis Juglaret [:yannis]

Comment 5

•

9 months ago

The call to BeginDraw in DCLayerTree::CreateEGLSurfaceForCompositionSurface fails with E_INVALIDARG (0x80070057). I'll try to locally debug why.

Flags: needinfo?(yjuglaret)

Mayank Bansal

Comment 6

•

9 months ago

Yannis, is this a regression? Can you repro if you disable your integrated Intel Xe GPU? Can you repro if you disable the dGPU and only use the iGPU?

(I cant repro on my Win11 AMD single GPU single monitor machine)

Yannis Juglaret [:yannis]

Updated

•

9 months ago

Flags: needinfo?(yjuglaret)

Whiteboard: [win:stability]

Yannis Juglaret [:yannis]

Updated

•

9 months ago

Flags: needinfo?(yjuglaret)

Yannis Juglaret [:yannis]

Comment 7

•

9 months ago

(In reply to Mayank Bansal from comment #6)

Yannis, is this a regression? Can you repro if you disable your integrated Intel Xe GPU? Can you repro if you disable the dGPU and only use the iGPU?

(I cant repro on my Win11 AMD single GPU single monitor machine)

Not a regression as far as I can tell. I can repro in ESR 115.

As long as the Intel iGPU is enabled I can repro (NVIDIA dGPU enabled or not does not matter): when the bug occurs I have slow rendering and some tiles disappear like in the GIF attachment. If I disable the Intel iGPU the rendering is too slow from the start (NVIDIA dGPU enabled or not does not matter) to let me reproduce the bug; all tiles are shown however.

status-firefox-esr115: --- → affected

Yannis Juglaret [:yannis]

Comment 8

•

9 months ago

•

Edited

I'm now positive that BeginDraw fails with E_INVALIDARG because we are giving it a &update_rect such that update_rect.left == update_rect.right and update_rect.top == update_rect.bottom (a "zero-area rectangle"). [:gw], is this enough information to address this bug? Let me know if there's another path along which you'd like me to do some debugging. (By the way, is it normal that DCLayerTree::CreateEGLSurfaceForCompositionSurface returns false when BeginDraw fails even though the return type is GLuint?)

Flags: needinfo?(yjuglaret) → needinfo?(gwatson)

Jeff Muizelaar [:jrmuizel]

Updated

•

9 months ago

Severity: -- → S2

Glenn Watson [:gw]

Assignee

Comment 9

•

9 months ago

Thanks for the detailed debugging so far, this is very helpful.

This will certainly only occur on Windows as the BeginDraw is part of the DirectComposition integration, where the failure occurs. However, it will be failing (silently) in the same way on Linux / Mac, I would think.

We could in theory detect the zero-rect case in DCLayerTree and avoid the failure there. However, it would be much better if we could find the root cause of why we end up with a zero-rect case getting that far down the pipeline.

WR is supposed to early out on zero-area rectangles before it gets to that point in the pipeline. So handling it in DCLayerTree would sort of be a band-aid that hides the underlying issue earlier on, which may cause other unexpected performance or correctness issues.

I may have time later this week to try repro on a Windows box and look into it. If you are interested in doing some more debugging locally, and can repro relatively easily, the place I would start looking is around [1]. This adds a surface to the compositor - we could possibly add some logging here to find out when a zero-area rect is ending up being sent to the compositor interface, and then trace back from there to see how it occurs.

[1] https://searchfox.org/mozilla-central/source/gfx/wr/webrender/src/composite.rs#775

Assignee: nobody → gwatson

Flags: needinfo?(gwatson)

Yannis Juglaret [:yannis]

Comment 10

•

9 months ago

After a bit more debugging, the reason we get a zero-area rectangle is because wr::DeviceIntRect aDirtyRect has 0 in all 4 variants of aDirtyRect.{min,max}.{x,y}. This value comes from the dirty_rect field of a PictureCacheTarget, which itself comes from the pic_task.scissor_rect of a RenderTaskKind::Picture(ref pic_task) in build_render_pass, which is originally computed by the path below in PicturePrimitive::take_context.

let scissor_rect = frame_state.composite_state.get_surface_rect(
  tile.local_dirty_rect,
  &tile.local_tile_rect,
  tile_cache.transform_index,
).to_i32();

This call can return DeviceRect::zero (and does here, as far as I can tell) if the intersection is empty:

    pub fn get_surface_rect<T>(
        &self,
        local_sub_rect: &Box2D<f32, T>,
        local_bounds: &Box2D<f32, T>,
        transform_index: CompositorTransformIndex,
    ) -> DeviceRect {
        let transform = &self.transforms[transform_index.0];

        let surface_bounds = transform.local_to_surface.map_rect(&local_bounds);
        let surface_rect = transform.local_to_surface.map_rect(&local_sub_rect);

        surface_rect
            .round_out()
            .translate(-surface_bounds.min.to_vector())
            .round_out()
            .intersection(&surface_bounds.size().round().into())
            .unwrap_or_else(DeviceRect::zero)
    }

PicturePrimitive::take_context does not treat DeviceRect::zero as a special case which seems to be why we later end up with the problem mentioned in comment 8.

Glenn Watson [:gw]

Assignee

Comment 11

•

9 months ago

Interesting findings. The assumption here is that tile.local_dirty_rect should always intersect with tile.local_tile_rect as it should be a subset, in this code path.

So I wonder if that invariant is not correct, or if we're dealing with some floating point accuracy issues (I suspect the latter).

Would it be possible to log out the values of tile.local_dirty_rect, tile.local_tile_rect and the return value of get_surface_rect (before the .to_i32() call) when this case occurs? That would hopefully let us identify whether it's a result of float accuracy / rounding issues, or that the expected invariant in the inputs is not true here.

Flags: needinfo?(yjuglaret)

Yannis Juglaret [:yannis]

Comment 12

•

9 months ago

•

Edited

So digging a bit further, the following happens right before the bug:

we reach a call to Tile::invalidate where the invalidation_rect is Some(rect) (e.g. Box2D((3456.0, 512.0), (3584.0, 1024.0)));
in that call we update self.local_dirty_rect from Box2D((0.0, 0.0), (0.0, 0.0)) to rect through a union;
in PicturePrimitive::take_context we intersect that with tile.current_descriptor.local_valid_rect (e.g. Box2D((3072.0, 512.0), (3456.0, 1024.0)));
but the intersection is empty so self.local_dirty_rect is now PictureRect::zero(), leading to a second empty intersection in the computation from comment 10.

I'll try to continue to track this, e.g. figure out which call to Tile::invalidate is involved here.

Yannis Juglaret [:yannis]

Comment 13

•

9 months ago

•

Edited

The invalidation occurs in Tile::update_dirty_rects. The invalidation reason is PrimCount. It changes the Tile for example from:

id=210
is_valid=1
local_tile_rect=Box2D((3072.0, 1536.0), (4096.0, 2048.0))
local_valid_rect=Box2D((-13214.886, -169.00008), (3456.0005, 4861.522))
local_dirty_rect=Box2D((0.0, 0.0), (0.0, 0.0))
current_descriptor.local_valid_rect=Box2D((3072.0, 1536.0), (3456.0, 2048.0))

To:

id=210
is_valid=0
local_tile_rect=Box2D((3072.0, 1536.0), (4096.0, 2048.0))
local_valid_rect=Box2D((-13214.886, -169.00008), (3456.0005, 4861.522))
local_dirty_rect=Box2D((3456.0, 1536.0), (3584.0, 2048.0))
current_descriptor.local_valid_rect=Box2D((3072.0, 1536.0), (3456.0, 2048.0))

Below are the lines in PicturePrimitive::take_context that seem relevant when we later reach it (the intersection is empty and then we take the else branch):

// Ensure that the dirty rect doesn't extend outside the local valid rect.
tile.local_dirty_rect = tile.local_dirty_rect
    .intersection(&tile.current_descriptor.local_valid_rect)
    .unwrap_or_else(PictureRect::zero);
// ...
if tile.is_valid {
    // ...
} else {
    // ...
    if let TileSurface::Texture { ref mut descriptor } = tile.surface.as_mut().unwrap() {
        // ...
        let scissor_rect = frame_state.composite_state.get_surface_rect(
            &tile.local_dirty_rect,
            &tile.local_tile_rect,
            tile_cache.transform_index,
        ).to_i32();

In another part of PicturePrimitive::take_context, to reset the dirty rect we not only set tile.local_dirty_rect to PictureRect::zero() but we also set tile.is_valid to true. The latter seems to be missing here and that could explain why taking the else branch here would be a mistake. Updating to the code below seems to safely make the issue disappear (assuming I've tested long enough):

tile.local_dirty_rect = tile.local_dirty_rect
    .intersection(&tile.current_descriptor.local_valid_rect)
    .unwrap_or_else(|| { tile.is_valid = true; PictureRect::zero() });

As an outsider to this code I cannot confirm if that is the correct fix, but it seems like a reasonable explanation.

Flags: needinfo?(yjuglaret)

Glenn Watson [:gw]

Assignee

Comment 14

•

9 months ago

Great find, thanks for digging in to it. Yes, I agree that the proposed fix seems reasonable.

I did a try run with your patch and it doesn't seem to break anything [1] other than a few unrelated intermittent failures.

Long term, it'd probably be better to handle this slightly earlier than take_context, perhaps in post_update of Tile. However, I think it's fine to merge this as-is, since that change will require a bit more work.

Do you want to create a patch and I'll review it, or would you rather I create a patch? (not entirely sure how we'll automate a test for this yet, will need to think about it a bit more).

[1] https://treeherder.mozilla.org/jobs?repo=try&revision=fc5b7d309aa9385f749da00c60f794755b9a2d65&selectedTaskRun=d2JAIvNkSrKTBUZd4e4F8w.1

Flags: needinfo?(yjuglaret)

Yannis Juglaret [:yannis]

Comment 15

•

9 months ago

Thanks, I'll push the short-term change for review then.

Flags: needinfo?(yjuglaret)

Yannis Juglaret [:yannis]

Comment 16

•

9 months ago

•

Edited

Attached file Bug 1893205 - Properly reset the dirty rect in PicturePrimitive::take_context. r=gw — Details

When we reset the dirty rect to PictureRect::zero() on a Tile, we must
also mark the Tile as valid. This prevents the propagation of zero-area
rectangles which can otherwise ultimately cause BeginDraw failures.

Bob Hood [:bhood]

Comment 17

•

9 months ago

Thanks for helping out here, Yannis. :)

Mayank Bansal

Comment 18

•

9 months ago

•

Edited

Try build also appears to fix bug 1886652
But Bug 1750348 still crashes.

Comment 19

•

9 months ago

Pushed by yjuglaret@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/37528cd56040 Properly reset the dirty rect in PicturePrimitive::take_context. r=gw

Mayank Bansal

Updated

•

9 months ago

Duplicate of this bug: 1886652

Comment 21

•

9 months ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/37528cd56040

Status: NEW → RESOLVED

Closed: 9 months ago

status-firefox127: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 127 Branch

Ryan VanderMeulen [:RyanVM]

Updated

•

9 months ago

status-firefox125: affected → wontfix

status-firefox126: affected → wontfix

Yannis Juglaret [:yannis]

Updated

•

8 months ago

Comment 22

•

8 months ago

Should we consider uplifting this to ESR115?

Flags: needinfo?(gwatson)

Glenn Watson [:gw]

Assignee

Comment 23

•

8 months ago

The bug has (I think) existed for a long time and probably doesn't affect that many pages. On the other hand, it's a severe bug (GPU process crash) when it does occur. If it applies cleanly it's probably worth uplifting I think.

Flags: needinfo?(gwatson)

Mayank Bansal

Updated

•

8 months ago

Blocks: 1826177

Ryan VanderMeulen [:RyanVM]

Comment 24

•

8 months ago

This grafts cleanly to esr115. Feel free to go ahead and nominate.

Flags: needinfo?(gwatson)

Glenn Watson [:gw]

Assignee

Updated

•

8 months ago

Flags: needinfo?(gwatson)

Glenn Watson [:gw]

Assignee

Comment 25

•

7 months ago

Comment on attachment 9399943 [details]
Bug 1893205 - Properly reset the dirty rect in PicturePrimitive::take_context. r=gw

ESR Uplift Approval Request

If this is not a sec:{high,crit} bug, please state case for ESR consideration: Suggested by ryanvm, fixes GPU process crashes on some pages.
User impact if declined: Certain pages will crash GPU process, falling back to software.
Fix Landed on Version:
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): Has been in m-c for some time without any reported issues. Grafts cleanly.

Attachment #9399943 - Flags: approval-mozilla-esr115?

Ryan VanderMeulen [:RyanVM]

Comment 26

•

7 months ago

Comment on attachment 9399943 [details]
Bug 1893205 - Properly reset the dirty rect in PicturePrimitive::take_context. r=gw

Approved for 115.13esr.

Attachment #9399943 - Flags: approval-mozilla-esr115? → approval-mozilla-esr115+

Ryan VanderMeulen [:RyanVM]

Updated

•

7 months ago

status-firefox-esr115: affected → fixed

Pulsebot

Comment 27

•

7 months ago

uplift

https://hg.mozilla.org/releases/mozilla-esr115/rev/46db4faaf678

You need to log in before you can comment on or make changes to this bug.