Closed Bug 1893205 Opened 9 months ago Closed 9 months ago

Certain 3D CSS usage crashes WebRenderer process on Windows, leaving Firefox running poorly until the whole process is restarted

Categories

(Core :: Graphics: WebRender, defect)

defect

Tracking

()

RESOLVED FIXED
127 Branch
Tracking Status
firefox-esr115 --- fixed
firefox125 --- wontfix
firefox126 --- wontfix
firefox127 --- fixed

People

(Reporter: fiend, Assigned: gw)

References

Details

(Whiteboard: [win:stability])

Attachments

(4 files)

In an environment rendered with 3D HTML/CSS, moving the 'camera' around (transforming the container to simulate movement) can sometimes result in the issue described in title. Attached a gif of it occurring.

This specifically occurs on windows (we've tried to get it to happen on linux/mac, fortunately does not), and I believe it's specifically when there are smaller animations contained within the container that aren't entirely on-screen, i.e. ground tiles spinning and floating.

When the webrender process stops, it does not recover until Firefox is completely closed and re-opened, and all current + future tabs run poorly. It typically results in some weird behavior on other sites with gifs (i.e. the gif I attached to this thread only having a single frame afterwards) or videos.

Relevant links:

(In reply to fiend from comment #0)

That would be me - this profile is from today's Nightly running on Windows. Thanks for submitting a bug! I confirm that I can reproduce this issue in Release 125.0.2 and Nightly 127. Here is a link to the original Reddit thread for reference.

My personal STR:

  • Open Firefox.
  • Maximize the Firefox window.
  • Navigate to https://b.corru.observer/concepts/firefox/?force&canvasfloor
  • Click "firefox flicker - PROCEED".
  • Wait for the scene to be completely rendered.
  • Type ↑↑←↑↑↑→↑←↑
  • Keep → pressed until you visually notice that the rendering slows down and the rotation speed goes down as well. In my tests this took between 3 seconds at minimum and up to more than 60 seconds sometimes, so patience is required here but eventually the bug should appear. (Edit: Sometimes alternating 10 seconds of → and 10 seconds of ← works better.)

Expected behavior: the rendering does not slow down, the rotation speed never varies.

Has STR: --- → yes
Status: UNCONFIRMED → NEW
Ever confirmed: true

The profiles and the description strongly suggest fallback to sw-wr

@Yannis: can you please attach the contents of your about:support before the bug repros, adn after the bug repros?

Flags: needinfo?(yjuglaret)

Somehow I have trouble reproducing on the same computer today, so here is from my laptop with the newest Nightly. This is about:support before running into the bug.

Flags: needinfo?(yjuglaret)

This is about:support after running into the bug. Parts that seem relevant in the diff:

1101c1105
<       "windowLayerManagerType": "WebRender",
---
>       "windowLayerManagerType": "WebRender (Software D3D11)",
1159a1164,1199
>       "failures": [
>         "[GFX1-]: Handling webrender error 4",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80",
>         "GP+[GFX1]: RenderDXGITextureHost AcquireSync timeout, hr=0x80"
>       ],
>       "indices": [
>         0,
>         31,
>         32,
>         33,
>         34,
>         35,
>         36,
>         37,
>         38,
>         24,
>         25,
>         26,
>         27,
>         28,
>         29,
>         30
>       ],
1220c1260
<             "status": "available",
---
>             "status": "unavailable",
1224a1265,1270
>               },
>               {
>                 "type": "runtime",
>                 "status": "unavailable",
>                 "failureId": "FEATURE_FAILURE_WEBRENDER_BEGIN_DRAW",
>                 "message": "BeginDraw() failed"


The call to BeginDraw in DCLayerTree::CreateEGLSurfaceForCompositionSurface fails with E_INVALIDARG (0x80070057). I'll try to locally debug why.

Flags: needinfo?(yjuglaret)

Yannis, is this a regression? Can you repro if you disable your integrated Intel Xe GPU? Can you repro if you disable the dGPU and only use the iGPU?

(I cant repro on my Win11 AMD single GPU single monitor machine)

Flags: needinfo?(yjuglaret)
Whiteboard: [win:stability]
Flags: needinfo?(yjuglaret)

(In reply to Mayank Bansal from comment #6)

Yannis, is this a regression? Can you repro if you disable your integrated Intel Xe GPU? Can you repro if you disable the dGPU and only use the iGPU?

(I cant repro on my Win11 AMD single GPU single monitor machine)

Not a regression as far as I can tell. I can repro in ESR 115.

As long as the Intel iGPU is enabled I can repro (NVIDIA dGPU enabled or not does not matter): when the bug occurs I have slow rendering and some tiles disappear like in the GIF attachment. If I disable the Intel iGPU the rendering is too slow from the start (NVIDIA dGPU enabled or not does not matter) to let me reproduce the bug; all tiles are shown however.

I'm now positive that BeginDraw fails with E_INVALIDARG because we are giving it a &update_rect such that update_rect.left == update_rect.right and update_rect.top == update_rect.bottom (a "zero-area rectangle"). [:gw], is this enough information to address this bug? Let me know if there's another path along which you'd like me to do some debugging. (By the way, is it normal that DCLayerTree::CreateEGLSurfaceForCompositionSurface returns false when BeginDraw fails even though the return type is GLuint?)

Flags: needinfo?(yjuglaret) → needinfo?(gwatson)
Severity: -- → S2

Thanks for the detailed debugging so far, this is very helpful.

This will certainly only occur on Windows as the BeginDraw is part of the DirectComposition integration, where the failure occurs. However, it will be failing (silently) in the same way on Linux / Mac, I would think.

We could in theory detect the zero-rect case in DCLayerTree and avoid the failure there. However, it would be much better if we could find the root cause of why we end up with a zero-rect case getting that far down the pipeline.

WR is supposed to early out on zero-area rectangles before it gets to that point in the pipeline. So handling it in DCLayerTree would sort of be a band-aid that hides the underlying issue earlier on, which may cause other unexpected performance or correctness issues.

I may have time later this week to try repro on a Windows box and look into it. If you are interested in doing some more debugging locally, and can repro relatively easily, the place I would start looking is around [1]. This adds a surface to the compositor - we could possibly add some logging here to find out when a zero-area rect is ending up being sent to the compositor interface, and then trace back from there to see how it occurs.

[1] https://searchfox.org/mozilla-central/source/gfx/wr/webrender/src/composite.rs#775

Assignee: nobody → gwatson
Flags: needinfo?(gwatson)

After a bit more debugging, the reason we get a zero-area rectangle is because wr::DeviceIntRect aDirtyRect has 0 in all 4 variants of aDirtyRect.{min,max}.{x,y}. This value comes from the dirty_rect field of a PictureCacheTarget, which itself comes from the pic_task.scissor_rect of a RenderTaskKind::Picture(ref pic_task) in build_render_pass, which is originally computed by the path below in PicturePrimitive::take_context.

let scissor_rect = frame_state.composite_state.get_surface_rect(
  tile.local_dirty_rect,
  &tile.local_tile_rect,
  tile_cache.transform_index,
).to_i32();

This call can return DeviceRect::zero (and does here, as far as I can tell) if the intersection is empty:

    pub fn get_surface_rect<T>(
        &self,
        local_sub_rect: &Box2D<f32, T>,
        local_bounds: &Box2D<f32, T>,
        transform_index: CompositorTransformIndex,
    ) -> DeviceRect {
        let transform = &self.transforms[transform_index.0];

        let surface_bounds = transform.local_to_surface.map_rect(&local_bounds);
        let surface_rect = transform.local_to_surface.map_rect(&local_sub_rect);

        surface_rect
            .round_out()
            .translate(-surface_bounds.min.to_vector())
            .round_out()
            .intersection(&surface_bounds.size().round().into())
            .unwrap_or_else(DeviceRect::zero)
    }

PicturePrimitive::take_context does not treat DeviceRect::zero as a special case which seems to be why we later end up with the problem mentioned in comment 8.

Interesting findings. The assumption here is that tile.local_dirty_rect should always intersect with tile.local_tile_rect as it should be a subset, in this code path.

So I wonder if that invariant is not correct, or if we're dealing with some floating point accuracy issues (I suspect the latter).

Would it be possible to log out the values of tile.local_dirty_rect, tile.local_tile_rect and the return value of get_surface_rect (before the .to_i32() call) when this case occurs? That would hopefully let us identify whether it's a result of float accuracy / rounding issues, or that the expected invariant in the inputs is not true here.

Flags: needinfo?(yjuglaret)

So digging a bit further, the following happens right before the bug:

  • we reach a call to Tile::invalidate where the invalidation_rect is Some(rect) (e.g. Box2D((3456.0, 512.0), (3584.0, 1024.0)));
  • in that call we update self.local_dirty_rect from Box2D((0.0, 0.0), (0.0, 0.0)) to rect through a union;
  • in PicturePrimitive::take_context we intersect that with tile.current_descriptor.local_valid_rect (e.g. Box2D((3072.0, 512.0), (3456.0, 1024.0)));
  • but the intersection is empty so self.local_dirty_rect is now PictureRect::zero(), leading to a second empty intersection in the computation from comment 10.

I'll try to continue to track this, e.g. figure out which call to Tile::invalidate is involved here.

The invalidation occurs in Tile::update_dirty_rects. The invalidation reason is PrimCount. It changes the Tile for example from:

id=210
is_valid=1
local_tile_rect=Box2D((3072.0, 1536.0), (4096.0, 2048.0))
local_valid_rect=Box2D((-13214.886, -169.00008), (3456.0005, 4861.522))
local_dirty_rect=Box2D((0.0, 0.0), (0.0, 0.0))
current_descriptor.local_valid_rect=Box2D((3072.0, 1536.0), (3456.0, 2048.0))

To:

id=210
is_valid=0
local_tile_rect=Box2D((3072.0, 1536.0), (4096.0, 2048.0))
local_valid_rect=Box2D((-13214.886, -169.00008), (3456.0005, 4861.522))
local_dirty_rect=Box2D((3456.0, 1536.0), (3584.0, 2048.0))
current_descriptor.local_valid_rect=Box2D((3072.0, 1536.0), (3456.0, 2048.0))

Below are the lines in PicturePrimitive::take_context that seem relevant when we later reach it (the intersection is empty and then we take the else branch):

// Ensure that the dirty rect doesn't extend outside the local valid rect.
tile.local_dirty_rect = tile.local_dirty_rect
    .intersection(&tile.current_descriptor.local_valid_rect)
    .unwrap_or_else(PictureRect::zero);
// ...
if tile.is_valid {
    // ...
} else {
    // ...
    if let TileSurface::Texture { ref mut descriptor } = tile.surface.as_mut().unwrap() {
        // ...
        let scissor_rect = frame_state.composite_state.get_surface_rect(
            &tile.local_dirty_rect,
            &tile.local_tile_rect,
            tile_cache.transform_index,
        ).to_i32();

In another part of PicturePrimitive::take_context, to reset the dirty rect we not only set tile.local_dirty_rect to PictureRect::zero() but we also set tile.is_valid to true. The latter seems to be missing here and that could explain why taking the else branch here would be a mistake. Updating to the code below seems to safely make the issue disappear (assuming I've tested long enough):

tile.local_dirty_rect = tile.local_dirty_rect
    .intersection(&tile.current_descriptor.local_valid_rect)
    .unwrap_or_else(|| { tile.is_valid = true; PictureRect::zero() });

As an outsider to this code I cannot confirm if that is the correct fix, but it seems like a reasonable explanation.

Flags: needinfo?(yjuglaret)

Great find, thanks for digging in to it. Yes, I agree that the proposed fix seems reasonable.

I did a try run with your patch and it doesn't seem to break anything [1] other than a few unrelated intermittent failures.

Long term, it'd probably be better to handle this slightly earlier than take_context, perhaps in post_update of Tile. However, I think it's fine to merge this as-is, since that change will require a bit more work.

Do you want to create a patch and I'll review it, or would you rather I create a patch? (not entirely sure how we'll automate a test for this yet, will need to think about it a bit more).

[1] https://treeherder.mozilla.org/jobs?repo=try&revision=fc5b7d309aa9385f749da00c60f794755b9a2d65&selectedTaskRun=d2JAIvNkSrKTBUZd4e4F8w.1

Flags: needinfo?(yjuglaret)

Thanks, I'll push the short-term change for review then.

Flags: needinfo?(yjuglaret)

When we reset the dirty rect to PictureRect::zero() on a Tile, we must
also mark the Tile as valid. This prevents the propagation of zero-area
rectangles which can otherwise ultimately cause BeginDraw failures.

Thanks for helping out here, Yannis. :)

Try build also appears to fix bug 1886652
But Bug 1750348 still crashes.

See Also: → 1886652
Pushed by yjuglaret@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/37528cd56040 Properly reset the dirty rect in PicturePrimitive::take_context. r=gw
Duplicate of this bug: 1886652
See Also: 1886652
Status: NEW → RESOLVED
Closed: 9 months ago
Resolution: --- → FIXED
Target Milestone: --- → 127 Branch
See Also: → 1750348

Should we consider uplifting this to ESR115?

Flags: needinfo?(gwatson)

The bug has (I think) existed for a long time and probably doesn't affect that many pages. On the other hand, it's a severe bug (GPU process crash) when it does occur. If it applies cleanly it's probably worth uplifting I think.

Flags: needinfo?(gwatson)
Blocks: 1826177

This grafts cleanly to esr115. Feel free to go ahead and nominate.

Flags: needinfo?(gwatson)
Flags: needinfo?(gwatson)

Comment on attachment 9399943 [details]
Bug 1893205 - Properly reset the dirty rect in PicturePrimitive::take_context. r=gw

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Suggested by ryanvm, fixes GPU process crashes on some pages.
  • User impact if declined: Certain pages will crash GPU process, falling back to software.
  • Fix Landed on Version:
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Has been in m-c for some time without any reported issues. Grafts cleanly.
Attachment #9399943 - Flags: approval-mozilla-esr115?

Comment on attachment 9399943 [details]
Bug 1893205 - Properly reset the dirty rect in PicturePrimitive::take_context. r=gw

Approved for 115.13esr.

Attachment #9399943 - Flags: approval-mozilla-esr115? → approval-mozilla-esr115+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: