WebRender crashes on aarch64 because of calling convention mismatch

RESOLVED FIXED

Status

()

enhancement
P3
normal
RESOLVED FIXED
4 months ago
10 days ago

People

(Reporter: jrmuizel, Unassigned)

Tracking

(Blocks 4 bugs)

unspecified
ARM64
All
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox66 disabled, firefox67 disabled, firefox68 affected)

Details

(Whiteboard: [geckoview:fenix:p3], crash signature)

(Reporter)

Description

4 months ago
We seem to be crashing in wr_state_new() with a call to HeapAlloc with a size of 0. It doesn't really make sense for this to be happening so there's a mystery to untangle.
David - can you take this one?
Flags: needinfo?(dmajor)
OS: Unspecified → Windows 10
Hardware: Unspecified → ARM64
(In reply to Jeff Muizelaar [:jrmuizel] from comment #0)
> We seem to be crashing in wr_state_new() with a call to HeapAlloc with a
> size of 0. It doesn't really make sense for this to be happening so there's
> a mystery to untangle.

HeapAlloc goes into mozjemalloc, and mozjemalloc still allocates memory when given a size of 0, so this can't be the cause of the problem. Plus, if the allocator was returning a null pointer, we should be ending up in GeckoHandleOOM, not in a fallthrough brk instruction in the wr_state_new function itself. That being said, considering the rust code, the corresponding arm code is weird. I'll dig into this further.
Does this build work any better? Latest nightly rustc, just in case it's been fixed. (I don't have my test machine handy right now.) https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=5b9ffb504fd33893c1056c388b12701efcdea082&selectedJob=217008329
Flags: needinfo?(dmajor) → needinfo?(jmuizelaar)
The crash happens in DisplayListBuilder::with_capacity. Not quite sure why yet.
Oh! You know what, I think I had the wrong signature in mind for HeapAlloc. I saw that the second parameter was getting set to zero (by copying from the `wzr` zero-register) but the second param is dwFlags, not dwBytes. Sorry Jeff!
However, the HeapAlloc still failed, and looking at the right register this time, we tried to alloc some huge number of bytes. The value in that register looks similar to the stack pointer. Maybe there's a missing indirection.
Aha, this may be an ABI issue about passing params.

`wr_state_new` has a signature like

pub extern "C" fn wr_state_new(pipeline_id: WrPipelineId,
                               content_size: LayoutSize,
                               capacity: usize) -> *mut WrState {

Where `WrPipelineId` is a pair of u32, and `LayoutSize` is a pair of floats. I can see 1536x939 being passed in @s0 and @s1, which looks reasonable for my screen size.

The callee is attempting to pull `capacity` out of @x2, which in the repro does not have the expected value. I am not sure whether this is the caller's fault for passing the value in the wrong parameter, or the callee's fault for looking in the wrong place.

(It's interesting that the two floats were passed in @s0 and @s1 rather than as a pointer to a `LayoutSize` structure. Is this a Rust-ism? Is it allowed with `extern "C"`? Maybe the same happened with `pipeline_id` getting split into two registers rather than pass-by-pointer? Maybe that threw off the register count? I'm just making wild guesses.)
(Reporter)

Updated

4 months ago
Flags: needinfo?(jmuizelaar)
(Reporter)

Updated

4 months ago
Summary: WebRender crashes on Win-aarch64 → WebRender crashes on Win-aarch64 because of calling convention mismatch
(Reporter)

Updated

4 months ago
FWIW, I wasn't looking at the right PID in my allocation logs, and indeed, I can see the last allocation from the crashing process being:
  malloc(209654363664)=0x0
As for the weirdness of the crashing code, that's probably because it's an inlined version of the rust OOM handling code, with no custom OOM handler installed because we only register ours when the crash reporter is enabled. Which, in retrospect, is a mistake. Filed bug 1514435 for this.
The mismatch still happens even when the C++ side of the call is compiled by clang-cl, which suggests that rustc is the one in the wrong.
The root cause is the PhantomData in euclid::TypedSize2D (which LayoutSize is TypedSize2D<f32, LayoutPixel>)

Filed https://github.com/rust-lang/rust/issues/56877
(In reply to Mike Hommey [:glandium] from comment #11)
> The root cause is the PhantomData in euclid::TypedSize2D (which LayoutSize
> is TypedSize2D<f32, LayoutPixel>)
> 
> Filed https://github.com/rust-lang/rust/issues/56877

Great find!
(Reporter)

Updated

3 months ago
Priority: -- → P3

Adding geckoview to whiteboard so we can triage priority for GV.

Whiteboard: [geckoview]

David says crash affects ARM64 Android, too. GV hits this crash on Android Reference Browser startup.

Blocks: wr-android
OS: Windows 10 → All
Summary: WebRender crashes on Win-aarch64 because of calling convention mismatch → WebRender crashes on aarch64 because of calling convention mismatch
Whiteboard: [geckoview] → [geckoview:p3]
(Reporter)

Updated

3 months ago
Duplicate of this bug: 1521816
Crash Signature: [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | gkrust_shared::oom_hook::hook]

(In reply to David Major [:dmajor] from comment #16)

Try build with rustc nightly and simd disabled to work around bug 1521249: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=9b14e291eeb3160219795a5e3aa830664ff91467

It works, for some value of work. I /sometimes/ get a white screen at startup, but /sometimes/ not, and in that case, I have a mostly usable browser. about:support says the compositor is webrender. When I'm in this situation, something that systematically leads to failure is to about the "About Nightly" dialog, which shows a white window for it and freezes the main window too.

Oh, but if I wait long enough, the dialog content finally shows up and the main window unfreezes. And while it's all frozen, one of the processes slowly starts consuming more and more memory.

(Reporter)

Updated

3 months ago
(Reporter)

Comment 19

3 months ago

What's the plan for having our actual builds use a rustc that has this bug?

Status: NEW → RESOLVED
Last Resolved: 3 months ago
Flags: needinfo?(mh+mozilla)
Resolution: --- → FIXED

Can we potentially uplift the Rust fix to rust beta, and then temporarily use beta for the Nightly 67 cycle?

(Reporter)

Comment 21

3 months ago

Or just do ARM builds with rust nightly?

(In reply to Jeff Muizelaar [:jrmuizel] from comment #21)

Or just do ARM builds with rust nightly?

But then we have to deal with bug 1521249, right? It seems like we would ideally extend our runway on that and contract the runway for the aarch issue. In general, building with rust nightly dicier than beta, so I'd like to avoid it if we can.

I agree that we could do the compiler override on ARM64 only.

At the moment, there's probably no path forward. Using nightly rust is blocked by bug 1521249, and the rust fix in its current form, is unlikely to be uplifted to beta, because upstream is still figuring out details.

Flags: needinfo?(mh+mozilla)
Depends on: 1524822
(Reporter)

Comment 24

2 months ago

Reopening until we can actually do builds with a fixed rustc.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

As a potential workaround we could just not pass sizes / rects by value, right?

Indeed. The problem is with passing those types by value. Passing them by ref should cause no problem.

Comment 27

2 months ago

Rust 1.33 was released today, which means that the nightly fix should now be in beta.

Perhaps it makes more sense now to use rust beta for the 67 cycle?

Would make it possible to test webrender in reference-browser on modern hardware

I think once bug 1521249 lands, we could do use rust beta specifically for aarch64 builds.

Depends on: 1531986
Crash Signature: [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | gkrust_shared::oom_hook::hook] → [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | gkrust_shared::oom_hook::hook] [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | gkrust_shared::oom_hook::hook | rust_oom]

(In reply to Bobby Holley (:bholley) from comment #28)

I think once bug 1521249 lands, we could do use rust beta specifically for aarch64 builds.

68 Nightly is now using Rust 1.34 beta on Windows (bug 1533010).

These WebRender crashes also affected Android aarch64. Will we need to use Rust Beta on Android, too?

Depends on: 1533010
Whiteboard: [geckoview:p3] → [geckoview:fenix:p3]
(Reporter)

Comment 30

13 days ago

(In reply to Chris Peterson [:cpeterson] from comment #29)

(In reply to Bobby Holley (:bholley) from comment #28)

I think once bug 1521249 lands, we could do use rust beta specifically for aarch64 builds.

68 Nightly is now using Rust 1.34 beta on Windows (bug 1533010).

These WebRender crashes also affected Android aarch64. Will we need to use Rust Beta on Android, too?

1.34 comes out on the 11th. WebRender should be more usable on Android by then so I think we can wait for until at least the 11th.

(Reporter)

Comment 31

10 days ago

This is solved now.

Status: REOPENED → RESOLVED
Last Resolved: 3 months ago10 days ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.