Closed Bug 1858670 Opened 1 year ago Closed 14 days ago

volatile accesses in MOZ_REALLY_CRASH() are sometimes elided

Categories

(Toolkit :: Crash Reporting, defect)

ARM64
Unspecified
defect

Tracking

()

RESOLVED FIXED
137 Branch
Tracking Status
firefox137 --- fixed

People

(Reporter: gsvelto, Assigned: gsvelto)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 2 obsolete files)

Checking AArch64 crashes we've noticed that this line within MOZ_REALLY_CRASH() is entirely omitted by clang, as it triggers UB. We still crash by hitting the following call to abort() but the crashes look different, as abort() requires special handling on macOS.

We should replace that line with something that reliably emits the access we want, possibly via asm() to be 100% sure we get it.

The severity field is not set for this bug.
:gsvelto, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(gsvelto)
Severity: -- → S3
Flags: needinfo?(gsvelto)

On Linux and macOS, we currently use an undefined behavior (null pointer
dereference) to trigger crashes, which is not desirable because we have
absolutely no guarantee on how the compiled code will look like.
Instead, this patch proposes to uses breakpoints like we already do on
Windows.

To achieve this, this patch cuts MOZ_REALLY_CRASH() into
MOZ_RECOVERABLE_CRASH() and MOZ_FATAL_CRASH(). MOZ_RECOVERABLE_CRASH()
crashes by triggering breakpoint exceptions when possible, on all
platforms. MOZ_FATAL_CRASH() is fallback code for the case where the
breakpoint exception would have been continued from a debugger, or less
likely handled by an exception handler.

MOZ_FATAL_CRASH() on Windows will use __fastfail() rather than
TerminateProcess(), which should more accurately ensure that we get a
Windows Error Reporting crash report if we should reach that code.

Any thoughts about my patch here?

Flags: needinfo?(yjuglaret)
Flags: needinfo?(mh+mozilla)

It would probably be good to understand under what circumstances the line is optimized out, because I can't reproduce on godbolt.

Flags: needinfo?(mh+mozilla)

When I was disassembling builds (to verify that the patch for bug 1814600 was working), I saw this in ghidra in a shippable macOS aarch64 on various release assertion locations in dom/media.

Flags: needinfo?(sguelton)

I quite like Yannis approach based on __builtin_debugtrap, it looks more maintainable than the inline assembly approach (btw I think we could reduce the number and the order of the conditional and always rely on either __builtin_debugtrap or __builtin_trap).

But if __builtin_trap is an option, may be just calling std::abort is also a viable path?

Considering the compiler optimizing things out, well it's UB so he's allowed to do so. I think that having a volatile pointer initialized to NULL somewhere and just loaded there (instead of casting a random address) would work better, but I prefer the explicit approach mentioned above.

Flags: needinfo?(sguelton)

There's a few downsides with that approach:

  • We're changing the crash reason, not a huge deal but it will show up in Socorro
  • The current way we crash stores the line number into NULL, this is deliberate to allow us to recover the line number by looking at the register values even when the related debug information is incorrect or missing
  • Using __fastfail() or abort() on Windows requires using WER to intercept the crashes. Our WER support is not perfect and I've noticed in my testing that the service doesn't pass all the crashes to our runtime exception module, it seems to throttle them. On macOS abort() also has different semantics compared to a segfault. We support catching it and it seems to work correctly, however the code involved is very fragile and I'm a bit hesitant to rely on it. At least not until we've rewritten the macOS minidump writing machinery & exception handler.

(In reply to Gabriele Svelto [:gsvelto] from comment #4)

Any thoughts about my patch here?

Definitely better than what we have at the moment, so I would be fine going with that for now.

For reference here are the pros of __builtin_debugtrap that I think could make it worth considering for the future:

  • support is provided by the compiler rather than us;
  • leaves CPU registers untouched, which can be useful for crash analysis (but at the moment, we're already losing registers when we set the crash reason);
  • unifies crash reason with Windows crashes which are using __debugbreak;
  • better defined behavior compared to __builtin_trap (as far as I understand);
  • continuable from a debugger ,so the code that follows must remain alive, compilers cannot assume that the code that follows is unreachable (unlike __builtin_trap presumably), our Windows code actively uses that (with __debugbreak) to put line information after the crashing instruction it seems.

And the cons:

  • not supported by GCC;
  • needs a fallback "second crashing instruction" afterwards because we don't actually want to be able to continue from a crash even in a debugger.
Flags: needinfo?(yjuglaret)

So I was asked about this separately and one of the things that came up is that the current behavior doesn't work for ASAN. I'd retain the existing behavior for how it preserves the line number of the crash in the registers, however I'd switch to __builtin_trap() when we're using ASAN instead of the illegal access. That would remove the spurious ASAN error we get when we crash. Any thoughts about that?

Attachment #9435835 - Attachment is obsolete: true
Assignee: nobody → gsvelto
Attachment #9371747 - Attachment description: WIP: Bug 1858670 - Replace volatile accesses with inline assembly to ensure MOZ_CRASH() crashing sequence is always emitted → Bug 1858670 - Replace volatile accesses with inline assembly to ensure MOZ_CRASH() crashing sequence is always emitted r=glandium
Status: NEW → ASSIGNED

I've updated my patch with a fix that would also help with bug 1885714 and asked for review. This looks like the path of least resistance to solve this issue as it preserves the existing behavior but let me know if you think otherwise.

Duplicate of this bug: 1931553

(In reply to Gabriele Svelto [:gsvelto] from comment #8)

  • The current way we crash stores the line number into NULL, this is deliberate to allow us to recover the line number by looking at the register values even when the related debug information is incorrect or missing

I think this reason doesn't make sense. Debugging a huge project like Firefox is almost impossible without debug info. And since we have a Mozilla Symbol Server, it's difficult to meet some situations with crash reports without matched debug info. (If Firefox was compiled by some maintainers like distribution maintainers, they will upload debug symbols to MSS. If Firefox was compiled by user themself, they should have the debug info.)

(In reply to Yannis Juglaret [:yannis] from comment #9)

  • not supported by GCC;

C++ standard has std::breakpoint for standized __builtin_debugtrap. Unfortunately, it requires C++26.

(In reply to Celeste Liu [:Coelacanthus] from comment #14)

I think this reason doesn't make sense. Debugging a huge project like Firefox is almost impossible without debug info. And since we have a Mozilla Symbol Server, it's difficult to meet some situations with crash reports without matched debug info. (If Firefox was compiled by some maintainers like distribution maintainers, they will upload debug symbols to MSS. If Firefox was compiled by user themself, they should have the debug info.)

Fortunately we're not always missing debug information, but we go to extreme lengths to capture it, it's not a given (see my article from a few years ago). Two problems remain: sometimes the compiler will simply not output the debug information we need, this typically happens when an optimization phase is unable to preserve it. This happens far more often than we would like. Additionally we have concrete cases where the compiler simply cannot emit the appropriate debug information because it's not sufficiently expressive. Bug 1930705 is an example of this. Identical code folding will make the compiler merge several functions into the same piece of code. When executing that code, it's impossible to tell if we're running one or the other function, the only way to figure it out would be to look at the caller, but DWARF debug information doesn't have anything to express that relationship. So the debug information for one function is lost, and whenever we call it we get the name and lines of the other that was merged into it.

That being said I'm curious to know why would you have us call abort() in this case besides consistency with other C/C++ code.

(In reply to Gabriele Svelto [:gsvelto] from comment #16)

That being said I'm curious to know why would you have us call abort() in this case besides consistency with other C/C++ code.

SIGSEGV is a bit misleading. Every signal has its own purpose. When a SIGSEGV appears, the user will try to check whether some invalid address was used first. But when SIGABRT appears, they will know the program found an error and terminate itself.

One thing that could help here is for example bug 1911044. In general it would be useful to try to get the error out in more situations, which would make it obvious it is not an accidental crash despite the SIGSEGV.

The one thing that makes SIGSEGV extra confusing is that MOZ_CRASH doesn't print anything in non-debug builds. We may want to reconsider that, that would clarify things greatly.

(In reply to Mike Hommey [:glandium] from comment #20)

The one thing that makes SIGSEGV extra confusing is that MOZ_CRASH doesn't print anything in non-debug builds. We may want to reconsider that, that would clarify things greatly.

Yes, very good point. We can definitely find a way to stuff the line-number in a known register even when using abort() but the real deal-breaker for me is macOS support. I'd need to test accurately if we're catching it properly across all versions we support, and it'd be better if I did that after having finished OOP crash reporting given that'll affect how we communicate with the exception handlers.

In the meantime we could open a bug to make MOZ_CRASH() more chatty on non-debug builds, that would go a long way in clarifying what happened to a user watching the output.

In the meantime we could open a bug to make MOZ_CRASH() more chatty on non-debug build

Do both of you have me shadowbanned or something 😂 See the bug linked in comment 19 which has a few ways to achieve that, I believe even based on your own input.

See Also: → 1911044

Before I land this I wanted to make a size comparison to see how much of an impact this has on libXUL, and on some architectures it's somewhat significant. Here's a comparison of libXUL sizes in KiBs.

architecture volatile inline asm difference
Linux x86-64 206259 206300 41
Linux x86 201115 201252 137
macOS AArch64 188638 188733 96
macOS unified 389054 388829 -224
Windows x86 135429 135501 73
Windows x86-64 150918 150996 78
Windows AArch64 138652 138640 -13
Android ARM 116100 116094 -5
Android AArch64 151697 151732 35

The differences are not huge and I expected an increase given that we know the compiler sometimes merges this code when it should not, but I expected the diffs to be smaller. Also I don't know what's going on with macOS unified builds but I've double-checked and the reduction in size is indeed significant.

(In reply to Gabriele Svelto [:gsvelto] from comment #23)

Also I don't know what's going on with macOS unified builds but I've double-checked and the reduction in size is indeed significant.

Do you actually have a number for 64bit-only MacOS builds? Not sure if those can be still build via our pipeline, but it would be good to know which numbers that architecture shows.

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6342c3c70506 Replace volatile accesses with inline assembly to ensure MOZ_CRASH() crashing sequence is always emitted r=glandium

Backed out for causing build bustages

Backout link

Push with failures

Failure log

Flags: needinfo?(gsvelto)

I've adjusted the patch to account for the failures, I'll reland it soon.

Flags: needinfo?(gsvelto)
Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/298f76b3bbf4 Replace volatile accesses with inline assembly to ensure MOZ_CRASH() crashing sequence is always emitted r=glandium

Backed out for causing SM bustages.

Push with failures

Failure log

Backout link

Flags: needinfo?(gsvelto)

I'm a bit puzzled by the failures here, Nicholas can you help me figure out what's wrong?

Flags: needinfo?(gsvelto) → needinfo?(nicolas.b.pierron)

(In reply to Gabriele Svelto [:gsvelto] from comment #30)

I'm a bit puzzled by the failures here, Nicholas can you help me figure out what's wrong?

This test case checks that we are able to crash on purpose. It is meant to test that our test suite is fully working and not no-op-ed accidentally.

My guess would be that the exit code changed, causing the test harness logic to not match correctly the error code.

Flags: needinfo?(nicolas.b.pierron)
See Also: → 1945507
Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/8f83426a182f Replace volatile accesses with inline assembly to ensure MOZ_CRASH() crashing sequence is always emitted r=glandium
Status: ASSIGNED → RESOLVED
Closed: 14 days ago
Resolution: --- → FIXED
Target Milestone: --- → 137 Branch
See Also: → 1945709
Regressions: 1945745
Attachment #9363664 - Attachment is obsolete: true

(In reply to Gabriele Svelto [:gsvelto] from comment #10)

... I'd switch to __builtin_trap() when we're using ASAN instead of the illegal access. That would remove the spurious ASAN error we get when we crash. Any thoughts about that?

Sorry for the late comment, but why is the ASAN report spurious? Now we get a sigill instead of sigsegv, and fuzzing signatures are broken. I don't follow why the new implementation is a problem for ASAN vs the old one which did the same thing another way.

Flags: needinfo?(gsvelto)

(In reply to Jesse Schwartzentruber (:truber) from comment #34)

Sorry for the late comment, but why is the ASAN report spurious? Now we get a sigill instead of sigsegv, and fuzzing signatures are broken. I don't follow why the new implementation is a problem for ASAN vs the old one which did the same thing another way.

Because we deliberately accessed a null pointer. This was pointed out to me by someone in Dublin so I rolled the change inside the patch, but it's easy enough to revert in case it's a problem.

Flags: needinfo?(gsvelto)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: