URLPattern latent x86 shippable gtest crash
Categories
(Core :: Networking, defect, P2)
Tracking
()
People
(Reporter: edgul, Unassigned)
References
Details
(Whiteboard: [necko-triaged])
During development of Bug 1948330 (see also: bug 1731418 and bug 1948295) one of the WIP patches introduced a gtest which directly called urlp_get_protocol_component()
rust-defined function from c++ bindings (cbindgen-generated). This invocation would crash when we attempt to clone the q_pattern.protocol in the following (rust) code on x86 (32 bit) PGO/LTO builds:
#[no_mangle]
pub unsafe extern "C" fn urlp_get_protocol_component(
pattern: UrlpPattern,
res: *mut UrlpComponent,
) {
let q_pattern = &*(pattern.0 as *const Uq::UrlPattern);
let tmp: UrlpComponent = q_pattern.protocol.clone().into();
*res = tmp;
}
This crash magically goes away when we replace the direct call to urlp_get_protocol_component()
to use our c++ defined convenience getter UrlpGetProtocol
, which uses the same rust function in the implementation. (scenario 1)
It is also suspected that the crash disappears by changing the c++ getter UrlpGetProtocol
to pass UrlpPattern
by-value instead of by-reference, but we are having trouble reliably reproducing the crash. (scenario 2). Note that in this scenario we have continued to use the rust-defined getting directly in the test.
So the current working hypothesis is that this is a compiler bug in either/or PGO or LTO 32 bit builds.
But also seen on try in followup builds (with Linux 24.04 x86 Shippable gtest-1proc
manually added to the suite): https://treeherder.mozilla.org/jobs?repo=try&revision=4062e451a048b7d558c07df537e1b994b4e4531e&selectedTaskRun=J05_NjMxSHSNrhUogWKFdg.0
I've also created some additional test builds (currently still running) to try to narrow down PGO or LTO:
- control (should experience crash): https://treeherder.mozilla.org/jobs?repo=try&revision=9cdb28c01c3d5ab6e96ec482b392695976fc9559
- profile build: https://treeherder.mozilla.org/jobs?repo=try&landoCommitID=138900
- LTO build: https://treeherder.mozilla.org/jobs?repo=try&landoCommitID=139162
The test builds have finished and seem to indicate an LTO-only bug.
Comment 2•2 months ago
|
||
I can reproduce this locally
Comment 3•2 months ago
|
||
I think this is caused by Rust and Clang disagreeing on the calling convention of urlp_get_protocol_component
See:
https://rust.godbolt.org/z/9nhPnEozs
vs
https://godbolt.org/z/xe8Mh9q8a
Normally, it ends up ok, but when using LTO urlp_get_protocol_component
can get inlined into C++ code and then things break.
Comment 4•2 months ago
|
||
Gankra, am I correct in seeing a mismatch here?
Comment 5•2 months ago
|
||
So the LLVM IR for the C code is
i8* @square(wrapper)(i8* readnone returned %p.0)
while it's one of the two following for Rust:
ptr @square(ptr noalias nocapture noundef readonly byval([4 x i8]) align 4 dereferenceable(4) %num)
ptr @square2(ptr noundef readnone returned %num)
I see our implementation for urlp_get_protocol_component
(https://searchfox.org/mozilla-central/source/netwerk/base/urlpattern_glue/src/lib.rs#65) matches the one that leads to square
, while square2
seems to be the C-compatible one.
I tend to agree with Jeff's analysis here.
Comment 6•2 months ago
|
||
You may want repr(transparent)
instead of repr(C)
on struct UrlpPattern
.
Updated•1 month ago
|
Comment 8•1 month ago
|
||
Redirect a needinfo that is pending on an inactive user to the triage owner.
:jesup, since the bug has recent activity, could you have a look please?
For more information, please visit BugBot documentation.
Updated•1 month ago
|
Description
•