Closed Bug 1380381 Opened 7 years ago Closed 7 years ago

build symbols missing on macOS/OS X, unhelpful crash signatures like [@ XUL + 0xddb7c]

Categories

(Firefox Build System :: General, defect, P1)

defect

Tracking

(firefox-esr52 unaffected, firefox54 unaffected, firefox55 unaffected, firefox56blocking fixed)

RESOLVED FIXED
mozilla56
Tracking Status
firefox-esr52 --- unaffected
firefox54 --- unaffected
firefox55 --- unaffected
firefox56 blocking fixed

People

(Reporter: aryx, Assigned: mshal)

References

Details

(Keywords: regression)

Attachments

(1 file)

At the moment, OS X crash signatures are cryptic and unhelpful, see e.g. https://treeherder.mozilla.org/logviewer.html#?job_id=113699712&repo=autoland

PROCESS-CRASH | testing/xpcshell/example/unit/test_do_check_null_failing.js | application crashed [@ XUL + 0xddb7c]
See Also: → 1380200
[task 2017-07-12T16:44:40.349700Z] 16:44:40     INFO -  Stack dump:
[task 2017-07-12T16:44:40.349865Z] 16:44:40     INFO -  0.	Program arguments: /home/worker/workspace/build/src/clang/bin/llvm-dsymutil --arch=x86_64 /home/worker/workspace/build/src/obj-firefox/toolkit/library/XUL
[task 2017-07-12T16:44:40.446149Z] 16:44:40     INFO -  Error running dsymutil: Command '['/home/worker/workspace/build/src/clang/bin/llvm-dsymutil', '--arch=x86_64', '/home/worker/workspace/build/src/obj-firefox/toolkit/library/XUL']' returned non-zero exit status -11

So this would have been caught by bug 1304042 if we had it, and this is probably a recurrence of the issue from Bug 1371017. Follow this trail of tears back to bug 1301751 - llvm-dsymutil crashing on XUL. Which was related to a webrender landing and some rust codegen.

This appears to affect nightlies as well and we're getting no useful stacks from mac nightly crashes now. So I suggest we treat this as a blocker issue, and should perhaps close the tree until it's resolved. Do we know whether disabling stylo would fix this?
catlee/coop: I'm not sure with Ted gone who is the right engineering owner for this in the short-term/unblocking the tree kind of way. Thoughts?
Flags: needinfo?(coop)
Flags: needinfo?(catlee)
Closed trees on request from Benjamin
And for the record, this affects nightlies and crash-stats since Monday as well, not just the integration trees.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #1)
>  Do we know whether disabling stylo would fix this?

Bug 1379339 hit m-c on Monday, so it's quite possible that stylo was what tickled this.
(In reply to David Major [:dmajor] from comment #5)
> (In reply to Benjamin Smedberg [:bsmedberg] from comment #1)
> >  Do we know whether disabling stylo would fix this?
> 
> Bug 1379339 hit m-c on Monday, so it's quite possible that stylo was what
> tickled this.

Although, the parent changeset of the stylo landing shows a dsymutil failure too, so maybe not.

https://public-artifacts.taskcluster.net/bc8_L8KSTza4S6D9fEK7Nw/0/public/logs/live_backing.log
mshal is going to help track down the regression window here as a first step.
Assignee: nobody → mshal
Status: NEW → ASSIGNED
Flags: needinfo?(coop)
According to error message in comment #1 this is a bug in the build system. So moving component.
Component: Buildduty → Build Config
Priority: -- → P1
Product: Release Engineering → Core
QA Contact: catlee
Unsure if this is helpful, but I found a stack trace inside a recent cross-compilation OS X build on mozilla-central.

From https://public-artifacts.taskcluster.net/CDC9zxERStunaIAVbeXGdQ/0/public/logs/live_backing.log:

[task 2017-07-13T15:09:16.295282Z] 15:09:16     INFO -  #0 0x00000000007a8945 llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x7a8945)
[task 2017-07-13T15:09:16.295504Z] 15:09:16     INFO -  #1 0x00000000007a8f96 SignalHandler(int) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x7a8f96)
[task 2017-07-13T15:09:16.295689Z] 15:09:16     INFO -  #2 0x00007ff047cd17e0 __restore_rt (/lib64/libpthread.so.0+0xf7e0)
[task 2017-07-13T15:09:16.295978Z] 15:09:16     INFO -  #3 0x000000000042e1d8 llvm::dsymutil::(anonymous namespace)::DwarfLinker::lookForDIEsToKeep(llvm::dsymutil::(anonymous namespace)::DwarfLinker::RelocationManager&, llvm::DWARFDebugInfoEntryMinimal const&, llvm::dsymutil::DebugMapObject const&, llvm::dsymutil::(anonymous namespace)::CompileUnit&, unsigned int) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x42e1d8)
[task 2017-07-13T15:09:16.296193Z] 15:09:16     INFO -  #4 0x000000000042e583 llvm::dsymutil::(anonymous namespace)::DwarfLinker::lookForDIEsToKeep(llvm::dsymutil::(anonymous namespace)::DwarfLinker::RelocationManager&, llvm::DWARFDebugInfoEntryMinimal const&, llvm::dsymutil::DebugMapObject const&, llvm::dsymutil::(anonymous namespace)::CompileUnit&, unsigned int) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x42e583)
[task 2017-07-13T15:09:16.296421Z] 15:09:16     INFO -  #5 0x000000000042e65d llvm::dsymutil::(anonymous namespace)::DwarfLinker::lookForDIEsToKeep(llvm::dsymutil::(anonymous namespace)::DwarfLinker::RelocationManager&, llvm::DWARFDebugInfoEntryMinimal const&, llvm::dsymutil::DebugMapObject const&, llvm::dsymutil::(anonymous namespace)::CompileUnit&, unsigned int) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x42e65d)
[task 2017-07-13T15:09:16.296663Z] 15:09:16     INFO -  #6 0x000000000042e65d llvm::dsymutil::(anonymous namespace)::DwarfLinker::lookForDIEsToKeep(llvm::dsymutil::(anonymous namespace)::DwarfLinker::RelocationManager&, llvm::DWARFDebugInfoEntryMinimal const&, llvm::dsymutil::DebugMapObject const&, llvm::dsymutil::(anonymous namespace)::CompileUnit&, unsigned int) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x42e65d)
[task 2017-07-13T15:09:16.296880Z] 15:09:16     INFO -  #7 0x000000000042e583 llvm::dsymutil::(anonymous namespace)::DwarfLinker::lookForDIEsToKeep(llvm::dsymutil::(anonymous namespace)::DwarfLinker::RelocationManager&, llvm::DWARFDebugInfoEntryMinimal const&, llvm::dsymutil::DebugMapObject const&, llvm::dsymutil::(anonymous namespace)::CompileUnit&, unsigned int) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x42e583)
[task 2017-07-13T15:09:16.297094Z] 15:09:16     INFO -  #8 0x000000000042e65d llvm::dsymutil::(anonymous namespace)::DwarfLinker::lookForDIEsToKeep(llvm::dsymutil::(anonymous namespace)::DwarfLinker::RelocationManager&, llvm::DWARFDebugInfoEntryMinimal const&, llvm::dsymutil::DebugMapObject const&, llvm::dsymutil::(anonymous namespace)::CompileUnit&, unsigned int) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x42e65d)
[task 2017-07-13T15:09:16.297309Z] 15:09:16     INFO -  #9 0x000000000042e583 llvm::dsymutil::(anonymous namespace)::DwarfLinker::lookForDIEsToKeep(llvm::dsymutil::(anonymous namespace)::DwarfLinker::RelocationManager&, llvm::DWARFDebugInfoEntryMinimal const&, llvm::dsymutil::DebugMapObject const&, llvm::dsymutil::(anonymous namespace)::CompileUnit&, unsigned int) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x42e583)
[task 2017-07-13T15:09:16.297525Z] 15:09:16     INFO -  #10 0x000000000042e65d llvm::dsymutil::(anonymous namespace)::DwarfLinker::lookForDIEsToKeep(llvm::dsymutil::(anonymous namespace)::DwarfLinker::RelocationManager&, llvm::DWARFDebugInfoEntryMinimal const&, llvm::dsymutil::DebugMapObject const&, llvm::dsymutil::(anonymous namespace)::CompileUnit&, unsigned int) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x42e65d)
[task 2017-07-13T15:09:16.297739Z] 15:09:16     INFO -  #11 0x000000000042e583 llvm::dsymutil::(anonymous namespace)::DwarfLinker::lookForDIEsToKeep(llvm::dsymutil::(anonymous namespace)::DwarfLinker::RelocationManager&, llvm::DWARFDebugInfoEntryMinimal const&, llvm::dsymutil::DebugMapObject const&, llvm::dsymutil::(anonymous namespace)::CompileUnit&, unsigned int) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x42e583)
[task 2017-07-13T15:09:16.297962Z] 15:09:16     INFO -  #12 0x000000000042e65d llvm::dsymutil::(anonymous namespace)::DwarfLinker::lookForDIEsToKeep(llvm::dsymutil::(anonymous namespace)::DwarfLinker::RelocationManager&, llvm::DWARFDebugInfoEntryMinimal const&, llvm::dsymutil::DebugMapObject const&, llvm::dsymutil::(anonymous namespace)::CompileUnit&, unsigned int) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x42e65d)
[task 2017-07-13T15:09:16.298175Z] 15:09:16     INFO -  #13 0x000000000042e583 llvm::dsymutil::(anonymous namespace)::DwarfLinker::lookForDIEsToKeep(llvm::dsymutil::(anonymous namespace)::DwarfLinker::RelocationManager&, llvm::DWARFDebugInfoEntryMinimal const&, llvm::dsymutil::DebugMapObject const&, llvm::dsymutil::(anonymous namespace)::CompileUnit&, unsigned int) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x42e583)
[task 2017-07-13T15:09:16.298388Z] 15:09:16     INFO -  #14 0x000000000042e583 llvm::dsymutil::(anonymous namespace)::DwarfLinker::lookForDIEsToKeep(llvm::dsymutil::(anonymous namespace)::DwarfLinker::RelocationManager&, llvm::DWARFDebugInfoEntryMinimal const&, llvm::dsymutil::DebugMapObject const&, llvm::dsymutil::(anonymous namespace)::CompileUnit&, unsigned int) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x42e583)
[task 2017-07-13T15:09:16.298600Z] 15:09:16     INFO -  #15 0x000000000042e65d llvm::dsymutil::(anonymous namespace)::DwarfLinker::lookForDIEsToKeep(llvm::dsymutil::(anonymous namespace)::DwarfLinker::RelocationManager&, llvm::DWARFDebugInfoEntryMinimal const&, llvm::dsymutil::DebugMapObject const&, llvm::dsymutil::(anonymous namespace)::CompileUnit&, unsigned int) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x42e65d)
[task 2017-07-13T15:09:16.298826Z] 15:09:16     INFO -  #16 0x000000000042e65d llvm::dsymutil::(anonymous namespace)::DwarfLinker::lookForDIEsToKeep(llvm::dsymutil::(anonymous namespace)::DwarfLinker::RelocationManager&, llvm::DWARFDebugInfoEntryMinimal const&, llvm::dsymutil::DebugMapObject const&, llvm::dsymutil::(anonymous namespace)::CompileUnit&, unsigned int) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x42e65d)
[task 2017-07-13T15:09:16.298978Z] 15:09:16     INFO -  #17 0x000000000042aa4d llvm::dsymutil::(anonymous namespace)::DwarfLinker::link(llvm::dsymutil::DebugMap const&) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x42aa4d)
[task 2017-07-13T15:09:16.299186Z] 15:09:16     INFO -  #18 0x00000000004285b9 llvm::dsymutil::linkDwarf(llvm::StringRef, llvm::dsymutil::DebugMap const&, llvm::dsymutil::LinkOptions const&) (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x4285b9)
[task 2017-07-13T15:09:16.299354Z] 15:09:16     INFO -  #19 0x000000000041b938 main (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x41b938)
[task 2017-07-13T15:09:16.299533Z] 15:09:16     INFO -  #20 0x00007ff046e90d5d __libc_start_main (/lib64/libc.so.6+0x1ed5d)
[task 2017-07-13T15:09:16.299723Z] 15:09:16     INFO -  #21 0x000000000041b1d9 _start (/home/worker/workspace/build/src/clang/bin/llvm-dsymutil+0x41b1d9)
[task 2017-07-13T15:09:16.299883Z] 15:09:16     INFO -  Stack dump:
[task 2017-07-13T15:09:16.300105Z] 15:09:16     INFO -  0.	Program arguments: /home/worker/workspace/build/src/clang/bin/llvm-dsymutil --arch=x86_64 /home/worker/workspace/build/src/obj-firefox/toolkit/library/XUL
We're actively discussing this in #build.

I've written a crawler for TC logs and it is pretty clear that the failure is intermittent. We think we found an inflection point where we went from "always good" to "mostly bad." But we can't explain it by the commits we're seeing. I'm suspicious that a change to the TC platform was made outside of Firefox's version control universe around 2017-06-22 13:04 UTC and it slowly percolated out to workers, introduce intermittent failures.

garndt: would someone from the TC team be willing to go on a wild goose chase to see if anything changed around June 22 that would impact the gecko-3-b-macosx64 worker?
Flags: needinfo?(garndt)
More details for anyone who wants to help.

Here are results from pushes to various repos around the "inflection point" where we went from constantly good to periodically good. Newest on top.

autoland 0a7160e976d82ea77f0cb99554cf08fddb27867e good
autoland 62908c27accf0be42caeee6543e61f413afeea38 good
autoland 0000dd6746c4a64e0c09b757cc0fc5c1dd61a5b8 good
autoland 99a5d811d40fdebd04ed828487970d80fb02b302 good
autoland 49814bb3620325ca2eee9fcd7f85bff4061c2e28 bad
autoland 4ce04b6bf179769a0371fcccddd4f36c80d210b8 good
autoland cde5fccb853804eed02fe04d8c1b8f3b3c3265ed good
autoland 023b96f48df893f82ced2ad2f659ac4f2b4d303a bad (first bad)
autoland dc59646d36a4f959343ccdf155685d7cbe3a06c6 good (last constantly good)
autoland 3605c8cdc4a73788dd4fda3f5a3458771f286179 good
autoland ad8486907154a406e05e50cb21d3a6cde83e8c73 good

For inbound:

mozilla-inbound 768a1255610d1225ce412533da4379f664fe7c9c good
mozilla-inbound b7ab577515658d9acbc974e3a200f872aa93d20e good
mozilla-inbound 8b34a8ccdb17604b9bce0f228f8cab9710224331 good
mozilla-inbound 0298f22db7a5a127c26c1de3b23dcfa303a64c28 good
mozilla-inbound e44f1f7460176d691d8bddf2795831537ec12e96 bad (first bad)
mozilla-inbound 69052c8fa3eb6fa70498891b1fdcd571e1b7d9e8 good (last constantly good)
mozilla-inbound 17261c616f8e2b4c1857d6f4d38cbabfd470f9da good
mozilla-inbound 0bcb56297d0481c4b2221d2499c67acabbf813d7 good
mozilla-inbound 107ad5dc5c01685eda215675af316e3d63984635 good
mozilla-inbound 211e6b6d2abcc79c572671d82dc4e17af95bd342 good

For central:

mozilla-central 92dc60b522d81862e52bff5cdb1b698eb5608658 bad
mozilla-central c03f906e6579fce0b61524c4f5c8445dfa76ffe0 bad
mozilla-central 594cc32b632396a867ef1f98428968b224d82151 bad
mozilla-central 2f10d697c307a09a286faa1f11831644c40f778d bad (first bad)
mozilla-central 3b468193c933806339fa4a8dee73d03a09e635ab good (last constantly good)
mozilla-central 7455c74d833a9db4e02be17eda14588c7ef0de76 good
mozilla-central b1b9129838ade91684574f42219b2010928d7db4 good
mozilla-central 4c6668cbaccb1c207ced3bba18b81cf2be8ca495 good
mozilla-central 61edb6b490719686c8e9f8e3bf4bdd401aa1c2b5 good
The success rate also diminishes significantly over time. That's why I think we introduced "bad state" on TC workers. As workers get recycled, the good ones disappear and the new/bad ones fill in. There also may be a component where incremental vs clobber builds impact success rate.
According to bug 1267425 comment 1, Nightly builds were migrated to TC the day of the inflection. But the transition to cross builds would have been captured in version control. I wonder if something changed on the TC side to support the Nightly migration?
garndt is on PTO. We're getting help from bstack in #taskcluster.
Flags: needinfo?(garndt)
(In reply to Gregory Szorc [:gps] from comment #13)
> According to bug 1267425 comment 1, Nightly builds were migrated to TC the
> day of the inflection. But the transition to cross builds would have been
> captured in version control. I wonder if something changed on the TC side to
> support the Nightly migration?

I suspect this has been always an issue, it went unnoticed so far because we didn't run x-builds on production yet.
https://hg.mozilla.org/integration/autoland/rev/686fa05da73a7b52aa488dc7f6b695cda0e72503 (which is the child of the changeset that started cross builds) was pushed at 2017-06-21 18:33 +0000 and it doesn't exhibit the dsymutil issue. So dsymutil and cross builds did work initially.
(In reply to Gregory Szorc [:gps] from comment #16)
> https://hg.mozilla.org/integration/autoland/rev/
> 686fa05da73a7b52aa488dc7f6b695cda0e72503 (which is the child of the
> changeset that started cross builds) was pushed at 2017-06-21 18:33 +0000
> and it doesn't exhibit the dsymutil issue. So dsymutil and cross builds did
> work initially.

iiuc, it doesn't always fail, does it?
It always worked at first. Then it started failing sporadically. Now it is failing most of the time.

Anyway, we appeared to rule out sccache as the source of the regression.

We're now looking at bug 1305731 and its relation to bug 1301751. This may have been a lingering dsymutil issue all along. If so, the resolution isn't going to be pretty (loss of some Rust debug info). But if that's the cost to reopen the trees, it must be paid. That or we go back to not cross-compiling for OS X.
See Also: 1380200
It seems that reapplying the patch from bug 1301751 will work around the issue for now, which calls rust with debuginfo=1 instead of debuginfo=2 (at least, that's my understanding). I'm testing a slightly different version that will only affect the cross-OSX builds.

We would still need to followup and actually fix the underlying issue.
(In reply to Gregory Szorc [:gps] from comment #18)
> That or we go back to not cross-compiling
> for OS X.

...and as catlee pointed out in IRC, that's a blocker to stylo because we'd need to upgrade the OS on the Mac builders, and that would take some time.
Flags: needinfo?(catlee)
Comment on attachment 8886414 [details]
Bug 1380381 - Set rust debuginfo=1 for cross-OSX builds;

https://reviewboard.mozilla.org/r/157208/#review162284

Please mention in the commit message that debuginfo=1 still means there's enough info for stack traces, although it lacks line info (so we'll know that something crashed in $function, but not at what specific line it did). debuginfo=2 also adds variable information, useful for debugging, but is irrelevant to crash reports.
Attachment #8886414 - Flags: review?(mh+mozilla) → review+
The workaround has my approval. It is unfortunate that we lose temporarily lose line-level debugging info for Rust. But it's that, a big unknown and PITA reverting cross-compiling, keeping the trees closed, or no reliable debug symbols at all on OS X. This seems like the least bad option.

Thanks for working on the patches, Mike's.
Pushed by kwierso@gmail.com:
https://hg.mozilla.org/mozilla-central/rev/67cd1ee26f26
Set rust debuginfo=1 for cross-OSX builds; r=glandium a=kwierso to reopen this CLOSED TREE
Thank you all for the interrupt-driven work! Does it makes sense to resolve this bug FIXED and file followup(s) for debugging the dsymutil problem/re-enabling debuginfo=2?
(In reply to Benjamin Smedberg [:bsmedberg] from comment #26)
> Thank you all for the interrupt-driven work! Does it makes sense to resolve
> this bug FIXED and file followup(s) for debugging the dsymutil
> problem/re-enabling debuginfo=2?

Yes.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Blocks: 1381043
Target Milestone: --- → mozilla56
IT looks like we are running into this on Thunderbird builds on OSX (https://treeherder.mozilla.org/logviewer.html#?job_id=114714845&repo=comm-central&lineNumber=2046 for example). Is there any reason to restrict this to just cross-builds, or would it be reasonable to enable it for all OSX-targeting builds. Applying https://hg.mozilla.org/try-comm-central/file/cc5e767eaee8/mozilla-rust-symbols-no-cross.patch seems to fix the thunderbird failures: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=d35ae23fb16428b33627d3bb6be2a40120d39ce0
(In reply to Tom Prince [:tomprince] from comment #29)
> Is there any reason to restrict this
> to just cross-builds, or would it be reasonable to enable it for all
> OSX-targeting builds. Applying
> https://hg.mozilla.org/try-comm-central/file/cc5e767eaee8/mozilla-rust-
> symbols-no-cross.patch seems to fix the thunderbird failures:
> https://treeherder.mozilla.org/#/jobs?repo=try-comm-
> central&revision=d35ae23fb16428b33627d3bb6be2a40120d39ce0

I don't see why not. We're only doing cross builds for Firefox nowadays, so it should be fine to do it based solely on the target.
The problem is that most developers will be doing non-cross builds if they build on Mac, and this change will make Rust code less debuggable, which is something we'd like to avoid for developer builds.
Depends on: 1381955
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: