Closed Bug 1735677 Opened 3 years ago Closed 3 years ago

gfx.webrender.debug.dl.dump-content causes weird crashes (memory corruption?)

Categories

(Core :: Graphics: WebRender, defect)

defect

Tracking

()

RESOLVED FIXED
95 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- disabled
firefox93 --- disabled
firefox94 --- disabled
firefox95 --- fixed

People

(Reporter: botond, Assigned: mikokm)

References

(Regression)

Details

(Keywords: regression)

Attachments

(2 files)

Attached file Example backtrace

STR

In a debug build, run ./mach run --setpref gfx.webrender.debug.dl.dump-content=true https://bug1734007.bmoattachments.org/attachment.cgi?id=9244249.

I don't think the issue is specific to this page, I've seen it on other pages as well.

Actual results

Firefox crashes with debug assertions or segmentation faults in unrelated code. I've attached an example backtrace, which is a segmentation fault in nsTArray::Length() caused by the array's mHdr being null. The calling code in this case is DOM / style related, but that varies.

I've also seen assertions in TimeStamp code and elsewhere.

The assertions / crashes are seemingly unrelated to the WebRender display list dumping, but invariably go away if I run with --setpref gfx.webrender.debug.dl.dump-content=false. I suspect memory corruption.

Can you try with an asan build?

Flags: needinfo?(botond)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #1)

Can you try with an asan build?

I can Try with a capital T: https://treeherder.mozilla.org/jobs?repo=try&revision=915252df71b0645233ddf80cf77627110d2d474b

Flags: needinfo?(botond)

(In reply to Botond Ballo [:botond] from comment #2)

I can Try with a capital T: https://treeherder.mozilla.org/jobs?repo=try&revision=915252df71b0645233ddf80cf77627110d2d474b

Seeing some use-after-poisons.

I can reproduce this locally. Interestingly, I can completely comment out the body of wr_dump_display_list (and emit_display_list) and it still occurs, but only with that debug flag enabled.

Flags: needinfo?(gwatson)

I think this is an existing corruption issue which may be made more visible by the patch above.

I was able to see ASAN failures / crashes with a build as far back as 2020 (only when the flag is enabled though):

commit 9010a032c1a74dc4a45e51fef0a30e8bb17139a2 (HEAD)
Author: Lee Salzman <lsalzman@mozilla.com>
Date:   Fri Oct 23 02:15:59 2020 +0000

    Bug 1672898 - increase precision of SWGL's YUV composition stepping. r=mattwoodrow

In this case, the following partial stack trace from ASAN:

==650278==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61300088a4b2 at pc 0x55a273afce09 bp 0x7ffdacf93190 sp 0x7ffdacf92950
WRITE of size 185 at 0x61300088a4b2 thread T0 (Web Content)
    #0 0x55a273afce08 in memset /builds/worker/fetches/llvm-project/llvm/projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:780:3
    #1 0x7f2aef9870ce in core::intrinsics::write_bytes::h6a1ee2ed9ab6412b /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/core/src/intrinsics.rs:2022:14
    #2 0x7f2aef9871a8 in core::ptr::mut_ptr::_$LT$impl$u20$$BP$mut$u20$T$GT$::write_bytes::h7a4fee190bcc0ee4 /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/core/src/ptr/mut_ptr.rs:995:18
    #3 0x7f2aef9881c6 in peek_poke::strip_red_zone::he77ef14b14274e9c /homecode/work/gecko2/gfx/wr/peek-poke/src/lib.rs:125:9
    #4 0x7f2aef920424 in webrender_api::display_list::DisplayListBuilder::emit_display_list::hd5f5681e22279beb /homecode/work/gecko2/gfx/wr/webrender_api/src/display_list.rs:1084:9
    #5 0x7f2aef901102 in wr_dump_display_list /homecode/work/gecko2/gfx/webrender_bindings/src/bindings.rs:3766:17
    #6 0x7f2ae1e7d14f in mozilla::layers::WebRenderCommandBuilder::BuildWebRenderCommands(mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, nsDisplayList*, nsDisplayListBuilder*, mozilla::layers::WebRenderScrollData&, WrFiltersHolder&&) /homecode/work/gecko2/gfx/layers/wr/WebRenderCommandBuilder.cpp:1577:20
    #7 0x7f2ae1e95a91 in mozilla::layers::WebRenderLayerManager::EndTransactionWithoutLayer(nsDisplayList*, nsDisplayListBuilder*, WrFiltersHolder&&, mozilla::layers::WebRenderBackgroundData*) /homecode/work/gecko2/gfx/layers/wr/WebRenderLayerManager.cpp:349:30
    #8 0x7f2ae9681168 in nsDisplayList::PaintRoot(nsDisplayListBuilder*, gfxContext*, unsigned int) /homecode/work/gecko2/layout/painting/nsDisplayList.cpp:2454:18
    #9 0x7f2ae8d281e0 in nsLayoutUtils::PaintFrame(gfxContext*, nsIFrame*, nsRegion const&, unsigned int, nsDisplayListBuilderMode, nsLayoutUtils::PaintFrameFlags) /homecode/work/gecko2/layout/base/nsLayoutUtils.cpp:3465:13
    #10 0x7f2ae8bfe9bd in mozilla::PresShell::Paint(nsView*, nsRegion const&, mozilla::PaintFlags) /homecode/work/gecko2/layout/base/PresShell.cpp:6361:5
    #11 0x7f2ae8407880 in nsViewManager::ProcessPendingUpdatesPaint(nsIWidget*) /homecode/work/gecko2/view/nsViewManager.cpp:460:18

I can try digging back further tomorrow if no one else gets to it. Perhaps Jeff or Miko have an idea what the issue might be though? Many of the stack traces I saw while doing local builds were in the DL cache code, but that might be unrelated.

Flags: needinfo?(mikokm)
Flags: needinfo?(jmuizelaar)

Set release status flags based on info from the regressing bug 1724344

I ran mozregression with asan build enabled, and I see ASAN violations when this flag is enabled back to July 2020, which seems to be the underlying cause.

mozregression was only able to narrow it down to this pushlog [1] due to missing old builds, and I'm struggling to build locally from that far back.

The commit that stands out to me (based on what I was seeing in the ASAN traces) from that pushlog is [2]. This seems likely to be the underlying cause, but I'm not sure how all the Gecko-side DL caching code works. Perhaps Miko could take a look?

[1] https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=c4186bb32c302ec2c82b3b0323af828ccebe0b48&tochange=a35461d1fc07e6e855d064453363a46d32bb5a3d

[2] https://hg.mozilla.org/mozilla-central/rev/422a055dfbae293ad9277720e5d9a0901e5cdad5

No longer regressed by: 1724344
Assignee: nobody → mikokm
Status: NEW → ASSIGNED
Flags: needinfo?(mikokm)

Modifying the code to

    pub fn emit_display_list<W>(
        &mut self,
        indent: usize,
        range: Range<Option<usize>>,
        mut sink: W,
    ) -> usize
    where
        W: Write
    {
        let mut temp = BuiltDisplayList::default();
        ensure_red_zone::<di::DisplayItem>(&mut temp.payload.items_data);
        ensure_red_zone::<di::DisplayItem>(&mut temp.payload.cache_data);
        strip_red_zone::<di::DisplayItem>(&mut temp.payload.items_data);
        strip_red_zone::<di::DisplayItem>(&mut temp.payload.cache_data);
        0
    }

is enough to trigger ASAN checks (loop would almost certainly work too).

I think the issue is that strip_red_zone() is not doing the right thing.

Pushed by mikokm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/d77b337b99da
Remove bogus write_bytes in strip_red_zone r=jrmuizel
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 95 Branch
Flags: needinfo?(jmuizelaar)

The patch landed in nightly and beta is affected.
:miko, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(mikokm)
Flags: needinfo?(mikokm)
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: