volatile accesses in MOZ_REALLY_CRASH() are sometimes elided
Categories
(Toolkit :: Crash Reporting, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox137 | --- | fixed |
People
(Reporter: gsvelto, Assigned: gsvelto)
References
(Blocks 1 open bug)
Details
Attachments
(1 file, 2 obsolete files)
Checking AArch64 crashes we've noticed that this line within MOZ_REALLY_CRASH()
is entirely omitted by clang, as it triggers UB. We still crash by hitting the following call to abort()
but the crashes look different, as abort()
requires special handling on macOS.
We should replace that line with something that reliably emits the access we want, possibly via asm()
to be 100% sure we get it.
Comment 1•1 year ago
|
||
The severity field is not set for this bug.
:gsvelto, could you have a look please?
For more information, please visit BugBot documentation.
Assignee | ||
Updated•1 year ago
|
Comment 2•1 year ago
|
||
On Linux and macOS, we currently use an undefined behavior (null pointer
dereference) to trigger crashes, which is not desirable because we have
absolutely no guarantee on how the compiled code will look like.
Instead, this patch proposes to uses breakpoints like we already do on
Windows.
To achieve this, this patch cuts MOZ_REALLY_CRASH() into
MOZ_RECOVERABLE_CRASH() and MOZ_FATAL_CRASH(). MOZ_RECOVERABLE_CRASH()
crashes by triggering breakpoint exceptions when possible, on all
platforms. MOZ_FATAL_CRASH() is fallback code for the case where the
breakpoint exception would have been continued from a debugger, or less
likely handled by an exception handler.
MOZ_FATAL_CRASH() on Windows will use __fastfail() rather than
TerminateProcess(), which should more accurately ensure that we get a
Windows Error Reporting crash report if we should reach that code.
Assignee | ||
Comment 3•1 year ago
|
||
Assignee | ||
Comment 4•1 year ago
|
||
Any thoughts about my patch here?
Comment 5•1 year ago
|
||
It would probably be good to understand under what circumstances the line is optimized out, because I can't reproduce on godbolt.
Comment 6•1 year ago
|
||
When I was disassembling builds (to verify that the patch for bug 1814600 was working), I saw this in ghidra in a shippable macOS aarch64 on various release assertion locations in dom/media
.
Updated•1 year ago
|
Comment 7•1 year ago
|
||
I quite like Yannis approach based on __builtin_debugtrap
, it looks more maintainable than the inline assembly approach (btw I think we could reduce the number and the order of the conditional and always rely on either __builtin_debugtrap
or __builtin_trap
).
But if __builtin_trap
is an option, may be just calling std::abort
is also a viable path?
Considering the compiler optimizing things out, well it's UB so he's allowed to do so. I think that having a volatile pointer initialized to NULL somewhere and just loaded there (instead of casting a random address) would work better, but I prefer the explicit approach mentioned above.
Assignee | ||
Comment 8•1 year ago
|
||
There's a few downsides with that approach:
- We're changing the crash reason, not a huge deal but it will show up in Socorro
- The current way we crash stores the line number into NULL, this is deliberate to allow us to recover the line number by looking at the register values even when the related debug information is incorrect or missing
- Using
__fastfail()
orabort()
on Windows requires using WER to intercept the crashes. Our WER support is not perfect and I've noticed in my testing that the service doesn't pass all the crashes to our runtime exception module, it seems to throttle them. On macOSabort()
also has different semantics compared to a segfault. We support catching it and it seems to work correctly, however the code involved is very fragile and I'm a bit hesitant to rely on it. At least not until we've rewritten the macOS minidump writing machinery & exception handler.
Comment 9•11 months ago
•
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #4)
Any thoughts about my patch here?
Definitely better than what we have at the moment, so I would be fine going with that for now.
For reference here are the pros of __builtin_debugtrap
that I think could make it worth considering for the future:
- support is provided by the compiler rather than us;
- leaves CPU registers untouched, which can be useful for crash analysis (but at the moment, we're already losing registers when we set the crash reason);
- unifies crash reason with Windows crashes which are using
__debugbreak
; - better defined behavior compared to
__builtin_trap
(as far as I understand); - continuable from a debugger ,so the code that follows must remain alive, compilers cannot assume that the code that follows is unreachable (unlike
__builtin_trap
presumably), our Windows code actively uses that (with__debugbreak
) to put line information after the crashing instruction it seems.
And the cons:
- not supported by GCC;
- needs a fallback "second crashing instruction" afterwards because we don't actually want to be able to continue from a crash even in a debugger.
Assignee | ||
Comment 10•4 months ago
|
||
So I was asked about this separately and one of the things that came up is that the current behavior doesn't work for ASAN. I'd retain the existing behavior for how it preserves the line number of the crash in the registers, however I'd switch to __builtin_trap()
when we're using ASAN instead of the illegal access. That would remove the spurious ASAN error we get when we crash. Any thoughts about that?
Assignee | ||
Comment 11•3 months ago
|
||
Updated•3 months ago
|
Updated•3 months ago
|
Assignee | ||
Comment 12•3 months ago
|
||
I've updated my patch with a fix that would also help with bug 1885714 and asked for review. This looks like the path of least resistance to solve this issue as it preserves the existing behavior but let me know if you think otherwise.
Comment 14•3 months ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #8)
- The current way we crash stores the line number into NULL, this is deliberate to allow us to recover the line number by looking at the register values even when the related debug information is incorrect or missing
I think this reason doesn't make sense. Debugging a huge project like Firefox is almost impossible without debug info. And since we have a Mozilla Symbol Server, it's difficult to meet some situations with crash reports without matched debug info. (If Firefox was compiled by some maintainers like distribution maintainers, they will upload debug symbols to MSS. If Firefox was compiled by user themself, they should have the debug info.)
Comment 15•3 months ago
|
||
(In reply to Yannis Juglaret [:yannis] from comment #9)
- not supported by GCC;
C++ standard has std::breakpoint
for standized __builtin_debugtrap
. Unfortunately, it requires C++26.
Assignee | ||
Comment 16•3 months ago
|
||
(In reply to Celeste Liu [:Coelacanthus] from comment #14)
I think this reason doesn't make sense. Debugging a huge project like Firefox is almost impossible without debug info. And since we have a Mozilla Symbol Server, it's difficult to meet some situations with crash reports without matched debug info. (If Firefox was compiled by some maintainers like distribution maintainers, they will upload debug symbols to MSS. If Firefox was compiled by user themself, they should have the debug info.)
Fortunately we're not always missing debug information, but we go to extreme lengths to capture it, it's not a given (see my article from a few years ago). Two problems remain: sometimes the compiler will simply not output the debug information we need, this typically happens when an optimization phase is unable to preserve it. This happens far more often than we would like. Additionally we have concrete cases where the compiler simply cannot emit the appropriate debug information because it's not sufficiently expressive. Bug 1930705 is an example of this. Identical code folding will make the compiler merge several functions into the same piece of code. When executing that code, it's impossible to tell if we're running one or the other function, the only way to figure it out would be to look at the caller, but DWARF debug information doesn't have anything to express that relationship. So the debug information for one function is lost, and whenever we call it we get the name and lines of the other that was merged into it.
That being said I'm curious to know why would you have us call abort()
in this case besides consistency with other C/C++ code.
Comment 17•3 months ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #16)
That being said I'm curious to know why would you have us call
abort()
in this case besides consistency with other C/C++ code.
SIGSEGV is a bit misleading. Every signal has its own purpose. When a SIGSEGV appears, the user will try to check whether some invalid address was used first. But when SIGABRT appears, they will know the program found an error and terminate itself.
Comment hidden (obsolete) |
Comment 19•3 months ago
|
||
One thing that could help here is for example bug 1911044. In general it would be useful to try to get the error out in more situations, which would make it obvious it is not an accidental crash despite the SIGSEGV
.
Comment 20•3 months ago
|
||
The one thing that makes SIGSEGV extra confusing is that MOZ_CRASH doesn't print anything in non-debug builds. We may want to reconsider that, that would clarify things greatly.
Assignee | ||
Comment 21•3 months ago
|
||
(In reply to Mike Hommey [:glandium] from comment #20)
The one thing that makes SIGSEGV extra confusing is that MOZ_CRASH doesn't print anything in non-debug builds. We may want to reconsider that, that would clarify things greatly.
Yes, very good point. We can definitely find a way to stuff the line-number in a known register even when using abort()
but the real deal-breaker for me is macOS support. I'd need to test accurately if we're catching it properly across all versions we support, and it'd be better if I did that after having finished OOP crash reporting given that'll affect how we communicate with the exception handlers.
In the meantime we could open a bug to make MOZ_CRASH()
more chatty on non-debug builds, that would go a long way in clarifying what happened to a user watching the output.
Comment 22•2 months ago
•
|
||
In the meantime we could open a bug to make MOZ_CRASH() more chatty on non-debug build
Do both of you have me shadowbanned or something 😂 See the bug linked in comment 19 which has a few ways to achieve that, I believe even based on your own input.
Assignee | ||
Comment 23•1 month ago
|
||
Before I land this I wanted to make a size comparison to see how much of an impact this has on libXUL, and on some architectures it's somewhat significant. Here's a comparison of libXUL sizes in KiBs.
architecture | volatile | inline asm | difference |
---|---|---|---|
Linux x86-64 | 206259 | 206300 | 41 |
Linux x86 | 201115 | 201252 | 137 |
macOS AArch64 | 188638 | 188733 | 96 |
macOS unified | 389054 | 388829 | -224 |
Windows x86 | 135429 | 135501 | 73 |
Windows x86-64 | 150918 | 150996 | 78 |
Windows AArch64 | 138652 | 138640 | -13 |
Android ARM | 116100 | 116094 | -5 |
Android AArch64 | 151697 | 151732 | 35 |
The differences are not huge and I expected an increase given that we know the compiler sometimes merges this code when it should not, but I expected the diffs to be smaller. Also I don't know what's going on with macOS unified builds but I've double-checked and the reduction in size is indeed significant.
Comment 24•1 month ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #23)
Also I don't know what's going on with macOS unified builds but I've double-checked and the reduction in size is indeed significant.
Do you actually have a number for 64bit-only MacOS builds? Not sure if those can be still build via our pipeline, but it would be good to know which numbers that architecture shows.
Comment 25•1 month ago
|
||
Comment 26•1 month ago
|
||
Assignee | ||
Comment 27•1 month ago
|
||
I've adjusted the patch to account for the failures, I'll reland it soon.
Comment 28•1 month ago
|
||
Comment 29•1 month ago
|
||
Assignee | ||
Comment 30•1 month ago
|
||
I'm a bit puzzled by the failures here, Nicholas can you help me figure out what's wrong?
Comment 31•1 month ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #30)
I'm a bit puzzled by the failures here, Nicholas can you help me figure out what's wrong?
This test case checks that we are able to crash on purpose. It is meant to test that our test suite is fully working and not no-op-ed accidentally.
My guess would be that the exit code changed, causing the test harness logic to not match correctly the error code.
Comment 32•14 days ago
|
||
Comment 33•14 days ago
|
||
bugherder |
Updated•14 days ago
|
Comment 34•5 days ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #10)
... I'd switch to
__builtin_trap()
when we're using ASAN instead of the illegal access. That would remove the spurious ASAN error we get when we crash. Any thoughts about that?
Sorry for the late comment, but why is the ASAN report spurious? Now we get a sigill instead of sigsegv, and fuzzing signatures are broken. I don't follow why the new implementation is a problem for ASAN vs the old one which did the same thing another way.
Updated•5 days ago
|
Assignee | ||
Comment 35•4 days ago
|
||
(In reply to Jesse Schwartzentruber (:truber) from comment #34)
Sorry for the late comment, but why is the ASAN report spurious? Now we get a sigill instead of sigsegv, and fuzzing signatures are broken. I don't follow why the new implementation is a problem for ASAN vs the old one which did the same thing another way.
Because we deliberately accessed a null pointer. This was pointed out to me by someone in Dublin so I rolled the change inside the patch, but it's easy enough to revert in case it's a problem.
Description
•