Closed Bug 1810249 Opened 1 year ago Closed 1 year ago

Really bad debug information in the macOS dSYM these days

Categories

(Firefox Build System :: General, defect, P3)

defect

Tracking

(firefox-esr102 unaffected, firefox109 fixed, firefox110 fixed, firefox111 fixed)

VERIFIED FIXED
111 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox109 --- fixed
firefox110 --- fixed
firefox111 --- fixed

People

(Reporter: mstange, Assigned: glandium)

References

(Regression)

Details

(Keywords: regression)

Attachments

(1 file)

Filename information on macOS has gotten way worse recently, maybe due to a clang or linker upgrade.

I'm mostly noticing this in the profiler, but it applies to crash stacks, too. For example, here's a recent macOS crash:

https://crash-stats.mozilla.org/report/index/b04e2d77-7b3c-4274-8f06-241d30230113

Look at the filenames around here:

Function File
26 XUL nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*) xpcom/ds/nsTArray-inl.h:195 cfi
27 XUL mozilla::AppShutdown::AdvanceShutdownPhaseInternal(mozilla::ShutdownPhase, bool, char16_t const*, nsCOMPtr<nsISupports> const&) xpcom/base/AvailableMemoryWatcherMac.cpp:420 cfi
28 XUL nsXREDirProvider::DoShutdown() xpcom/base/nsCOMPtr.h:914 cfi
29 XUL ScopedXPCOMStartup::~ScopedXPCOMStartup() toolkit/xre/CmdLineAndEnvUtils.h:98 cfi

None of those source links go to the right line.

If I download the raw XUL.dSYM, and run addr2line with the right addresses, then I get the same bad data, so this is not a dump_syms issue. There are a few exceptions, where the native debug info says "no debug information at all" and dump_syms fills those gaps with "nearby" information; in this example that's what happened for nsXREDirProvider::DoShutdown()

mkdir bad-debug-info && cd bad-debug-info
curl -L "https://symbols.mozilla.org/XUL/307EB34E8F43314EBF556ECBACCE4AE70/XUL.dSYM.tar.bz2" -o XUL.dSYM.tar.bz2
bzip2 -dk XUL.dSYM.tar.bz2
tar -xzf XUL.dSYM.tar
cargo install --examples addr2line
addr2line -pCfi -e ./XUL.dSYM/Contents/Resources/DWARF/XUL 0x399ac0 0x353080 0x17e9644 0x55ac760
nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*) at /builds/worker/checkouts/gecko/xpcom/ds/nsTArray-inl.h:195
mozilla::AppShutdown::AdvanceShutdownPhaseInternal(mozilla::ShutdownPhase, bool, char16_t const*, nsCOMPtr<nsISupports> const&) at /builds/worker/checkouts/gecko/xpcom/base/AppShutdown.cpp:409
nsXREDirProvider::DoShutdown() at ??:0
ScopedXPCOMStartup::~ScopedXPCOMStartup() at /builds/worker/workspace/obj-build/dist/include/mozilla/CmdLineAndEnvUtils.h:98

Here are two more examples which I noticed in the profiler:

I'm currently getting a regression window.

It was indeed the clang 15 upgrade that broke it.

22:32.93 INFO: Last good revision: 99e2e426103b16eaa9331b269fb38be0eae943df
22:32.93 INFO: First bad revision: 1f668a84e0129a84f38bae01b2906212c1b53a40
22:32.93 INFO: Pushlog:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=99e2e426103b16eaa9331b269fb38be0eae943df&tochange=1f668a84e0129a84f38bae01b2906212c1b53a40

mozregression --launch 99e2e426103b (last good): https://share.firefox.dev/3GxnbvW
mozregression --launch 1f668a84e012 (first bad): https://share.firefox.dev/3WfbVKq

Look at the filenames of the 10 functions close to the root of the stack. Before the clang upgrade, all filenames make sense. After the upgrade, they're almost complete nonsense: Mutex.h, atomic, nsDirectoryServiceUtils.h... The only correct filenames are from the "caller file" information of inlined calls - those are stored in a different part of the debug info, not in __debug_line.

Keywords: regression
Regressed by: clang-15

:glandium, since you are the author of the regressor, bug 1784202, could you take a look? Also, could you set the severity field?

For more information, please visit auto_nag documentation.

Flags: needinfo?(mh+mozilla)
Severity: -- → S3
Priority: -- → P3

Maybe something is garbling the __debug_line information after compilation. Maybe our dsymutil version is incompatible with some clang 15 changes?

DWARF information from Linux seems to be fine, for example this profile from Linux has correct information: https://share.firefox.dev/3H4KY7N
I also haven't seen this problem in local builds on macOS - those don't run dsymutil.

I ran ~/.mozbuild/clang/bin/dsymutil on the XUL binary from my local build and the output was fine. So maybe the debug data gets garbled somewhere else. I don't know what the next step to debug this should be.

Did you also check the output of dump_syms?

Flags: needinfo?(mh+mozilla) → needinfo?(mstange.moz)

Set release status flags based on info from the regressing bug 1784202

(In reply to Mike Hommey [:glandium] from comment #6)

Did you also check the output of dump_syms?

No, but as I showed in comment 0 with addr2line, the dSYM itself is already bad for CI builds. This is not a dump_syms problem.

Flags: needinfo?(mstange.moz)

This bug is breaking the source view functionality of the Firefox Profiler on macOS. It's also causing knock-on effects for the symbolication pipeline, causing the profiler to compute NaN symbol addresses in some cases and breaking the assembly view for those functions.

Can we raise the priority of this? It's a pretty major functionality regression for users of the profiler, and it also affects crash report quality.

Flags: needinfo?(mh+mozilla)

So... copy/pasting from comment 0, and with a little change, I get something very different:

curl -L "https://symbols.mozilla.org/XUL/307EB34E8F43314EBF556ECBACCE4AE70/XUL.dSYM.tar.bz2" -o XUL.dSYM.tar.bz2
bzip2 -dk XUL.dSYM.tar.bz2
tar -xzf XUL.dSYM.tar
~/.mozbuild/clang/bin//llvm-addr2line -pCfi -e /tmp/XUL.dSYM/Contents/Resources/DWARF/XUL 0x399ac0 0x353080 0x17e9644 0x55ac760

The output is:

nsTArrayInfallibleAllocator::ResultTypeProxy nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_RelocateUsingMemutils>::EnsureCapacity<nsTArrayInfallibleAllocator>(unsigned long, unsigned long) at /builds/worker/checkouts/gecko/xpcom/ds/nsTArray-inl.h:195
 (inlined by) nsISupports** nsTArray_Impl<nsISupports*, nsTArrayInfallibleAllocator>::InsertElementAtInternal<nsTArrayInfallibleAllocator, nsISupports*&>(unsigned long, nsISupports*&) at /builds/worker/workspace/obj-build/dist/include/nsTArray.h:2609
 (inlined by) mozilla::NotNull<nsISupports**> nsTArray<nsISupports*>::InsertElementAt<nsISupports*&>(unsigned long, nsISupports*&) at /builds/worker/workspace/obj-build/dist/include/nsTArray.h:2882
 (inlined by) nsCOMArray_base::InsertObjectAt(nsISupports*, int) at /builds/worker/checkouts/gecko/xpcom/ds/nsCOMArray.cpp:120
 (inlined by) nsCOMArray_base::AppendObject(nsISupports*) at /builds/worker/checkouts/gecko/xpcom/ds/nsCOMArray.h:76
 (inlined by) nsCOMArray<nsIObserver>::AppendObject(nsIObserver*) at /builds/worker/checkouts/gecko/xpcom/ds/nsCOMArray.h:321
 (inlined by) nsObserverList::AppendStrongObservers(nsCOMArray<nsIObserver>&) at /builds/worker/checkouts/gecko/xpcom/ds/nsObserverList.cpp:60
 (inlined by) nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*) at /builds/worker/checkouts/gecko/xpcom/ds/nsObserverService.cpp:311
mozilla::AppShutdown::AdvanceShutdownPhaseInternal(mozilla::ShutdownPhase, bool, char16_t const*, nsCOMPtr<nsISupports> const&) at /builds/worker/checkouts/gecko/xpcom/base/AppShutdown.cpp:409
nsXREDirProvider::DoShutdown() at ??:0
bool mozilla::internal::strimatch<char>(char const*, char const*) at /builds/worker/checkouts/gecko/toolkit/xre/nsAppRunner.cpp:2062
 (inlined by) mozilla::ArgResult mozilla::CheckArg<char>(int&, char**, char const*, char const**, mozilla::CheckArgFlag) at /builds/worker/workspace/obj-build/dist/include/mozilla/CmdLineAndEnvUtils.h:178
 (inlined by) CheckArg(char const*, char const**, mozilla::CheckArgFlag) at /builds/worker/checkouts/gecko/toolkit/xre/nsAppRunner.cpp:470
 (inlined by) ScopedXPCOMStartup::~ScopedXPCOMStartup() at /builds/worker/checkouts/gecko/toolkit/xre/nsAppRunner.cpp:4457

That seems better, although I don't know whether it's correct or not. At least, it shows a discrepancy between what llvm-addr2line shows and what the rust-based addr2line shows.

Flags: needinfo?(mh+mozilla) → needinfo?(mstange.moz)

That is... remarkably different! So the problem is really that we're missing the inlines! And the locations are correct for inside those inlines.

Okay then it seems the problem is in the rust addr2line crate. Thanks for checking!

Flags: needinfo?(mstange.moz)

I will test the other examples which I supplied before I close this bug.

Never mind, I spoke to soon. The llvm-addr2line results seem more consistent, but they're still wrong. For example strimatch does not call DoShutdown(). AppendStrongObservers is not on the stack when SharedStub is called. And the line numbers are still missing for the functions close to the root in the event loop:

% llvm-addr2line -pCfi -e XUL.dSYM/Contents/Resources/DWARF/XUL 0x17e6de8 0x65a3f8 0x17e7080 0x117bf14 0x115a3a8 0x65a3f8 0x694f90
XRE_InitChildProcess(int, char**, XREChildData const*) at ??:0
MessageLoop::RunInternal() at ??:0
 (inlined by) MessageLoop::RunHandler() at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:374
 (inlined by) MessageLoop::Run() at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:356
XRE_RunAppShell() at ??:0
nsAppShell::Run() at ??:0
nsBaseAppShell::Run() at ??:0
MessageLoop::RunInternal() at ??:0
 (inlined by) MessageLoop::RunHandler() at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:374
 (inlined by) MessageLoop::Run() at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:356
mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) at ??:0

That's way too many ??:0. All these functions had correct line numbers before the clang upgrade. Profile

(In reply to Markus Stange [:mstange] from comment #13)

Never mind, I spoke to soon. The llvm-addr2line results seem more consistent, but they're still wrong. For example strimatch does not call DoShutdown(). AppendStrongObservers is not on the stack when SharedStub is called. And the line numbers are still missing for the functions close to the root in the event loop:

% llvm-addr2line -pCfi -e XUL.dSYM/Contents/Resources/DWARF/XUL 0x17e6de8 0x65a3f8 0x17e7080 0x117bf14 0x115a3a8 0x65a3f8 0x694f90
XRE_InitChildProcess(int, char**, XREChildData const*) at ??:0
MessageLoop::RunInternal() at ??:0
 (inlined by) MessageLoop::RunHandler() at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:374
 (inlined by) MessageLoop::Run() at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:356
XRE_RunAppShell() at ??:0
nsAppShell::Run() at ??:0
nsBaseAppShell::Run() at ??:0
MessageLoop::RunInternal() at ??:0
 (inlined by) MessageLoop::RunHandler() at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:374
 (inlined by) MessageLoop::Run() at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:356
mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) at ??:0

That's way too many ??:0. All these functions had correct line numbers before the clang upgrade. Profile

If I take the symbols of the build corresponding to that profile (https://archive.mozilla.org/pub/firefox/nightly/2023/01/2023-01-12-21-30-33-mozilla-central/firefox-110.0a1.en-US.mac.crashreporter-symbols.zip) and I look at the addresses from your command line, they all fall in places where line numbers are there in the symbols file. So I don't know why you get those ??:0 with llvm-addr2line, but presumably, there shouldn't be a problem with the profiler. If the traces themselves make no sense (as in function A is not calling function B in any way yet that's what the stack says), that would be a problem with the stack unwinding, not the symbols.

Oh wait, the symbols files are "ok" because dump_syms is filling with garbage...

That's right, dump_syms doesn't preserve missing line numbers - it fills dead space with nearby information, last I checked.

To get the truth, you can use llvm-dwarfdump --debug-line or readelf -wl or readelf -wL.

Er, no, not with readelf because dSYMs are not ELF files...

Ok, this looks like a regression in the clang-provided dsymutil. If I recreate the .dSYM with clang 14's dsymutil from the XUL and XUL.lto.o/* files created by clang 15, I get this:

$ ~/fetches/clang/bin/llvm-addr2line -pCfi -e XUL.dSYM/Contents/Resources/DWARF/XUL 0x17e6de8 0x65a3f8 0x17e7080 0x117bf14 0x115a3a8 0x65a3f8 0x694f90
XRE_InitChildProcess(int, char**, XREChildData const*) at /builds/worker/checkouts/gecko/toolkit/xre/nsEmbedFunctions.cpp:676
MessageLoop::RunInternal() at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:381
 (inlined by) MessageLoop::RunHandler() at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:374
 (inlined by) MessageLoop::Run() at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:356
XRE_RunAppShell() at /builds/worker/checkouts/gecko/toolkit/xre/nsEmbedFunctions.cpp:743
nsAppShell::Run() at /builds/worker/checkouts/gecko/widget/cocoa/nsAppShell.mm:804
nsBaseAppShell::Run() at /builds/worker/checkouts/gecko/widget/nsBaseAppShell.cpp:148
MessageLoop::RunInternal() at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:381
 (inlined by) MessageLoop::RunHandler() at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:374
 (inlined by) MessageLoop::Run() at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:356
NS_ProcessNextEvent(nsIThread*, bool) at /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:473
 (inlined by) mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) at /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:107

I'll bisect after lunch.

Assignee: nobody → mh+mozilla
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/b3cefa9fdb65
Revert clang change that broke debug info on mac. r=firefox-build-system-reviewers,andi
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 111 Branch

Markus, can you check that everything looks fine now?

Flags: needinfo?(mstange.moz)

Everything looks great now, thanks a lot for the fix. Example profile from today's macOS arm64 build: https://share.firefox.dev/3DlGB62
The profiler source view works again. No matter which function I double click, it always displays a plausible line of code now.

Status: RESOLVED → VERIFIED
Flags: needinfo?(mstange.moz)

Comment on attachment 9314171 [details]
Bug 1810249 - Revert clang change that broke debug info on mac.

Beta/Release Uplift Approval Request

  • User impact if declined: Less useful crash reports for mac builds. I don't know if this severely affects tracking new crash reports on release, but if it does, that would warrant a dot-release, IMHO.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This reverts a relatively small patch in LLVM that causes debug info post-processing on mac to fail blatantly.
  • String changes made/needed: N/A
  • Is Android affected?: No
Attachment #9314171 - Flags: approval-mozilla-release?
Attachment #9314171 - Flags: approval-mozilla-beta?

Comment on attachment 9314171 [details]
Bug 1810249 - Revert clang change that broke debug info on mac.

Approved for 110 beta 7, thanks.

Attachment #9314171 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9314171 [details]
Bug 1810249 - Revert clang change that broke debug info on mac.

Possibly-broken crash reports sounds bad. Approved for 109.0.1.

Attachment #9314171 - Flags: approval-mozilla-release? → approval-mozilla-release+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: