Profile initialization | application crashed [@ __llvm_profile_instrument_target + 0x53]
Categories
(Firefox Build System :: Toolchains, defect)
Tracking
(Not tracked)
People
(Reporter: intermittent-bug-filer, Unassigned)
References
Details
(Keywords: crash)
Crash Data
Filed by: sgiesecke [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=318581365&repo=try
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Kb_S2E05RcmjhT8UGBjL1g/runs/0/artifacts/public/logs/live_backing.log
With some local changes, I get crashes in the generate-profile-macosx64-shippable/opt jobs. Other platforms do not seem to be affected.
Could this be related to the issue mentioned in https://github.com/llvm/llvm-project-staging/commit/27650ec5541cd604a5027ad63895e0badfd35efe? Do we have that fix?```
Updated•4 years ago
|
Could this be related to the issue mentioned in https://github.com/llvm/llvm-project-staging/commit/27650ec5541cd604a5027ad63895e0badfd35efe? Do we have that fix?
The code that that patch reverted didn't land until clang trunk was version 12, so I don't think that's (directly) it.
Comment 2•4 years ago
|
||
Ok, but this looks like an issue in llvm, rather than an issue caused by the specific (dom/indexedDB) code changes in my push. Right?
At flrst glance, yes.
It's unfortunate that it's on Mac. If it was Linux or Windows, I'd be able to debug it much more easily. Any chance you could paste a disassembly of __llvm_profile_instrument_target
from beginning up to the point of failure?
Comment 4•4 years ago
|
||
(In reply to :dmajor from comment #3)
At flrst glance, yes.
It's unfortunate that it's on Mac. If it was Linux or Windows, I'd be able to debug it much more easily. Any chance you could paste a disassembly of
__llvm_profile_instrument_target
from beginning up to the point of failure?
I don't have a Mac either. I have no idea how I would do that, unfortunately.
Going by these values from the treeherder log
[task 2020-10-14T10:50:39.180Z] Thread 32 (crashed)
[task 2020-10-14T10:50:39.180Z] 0 XUL!__llvm_profile_instrument_target + 0x53
...
[task 2020-10-14T10:50:39.180Z] rip = 0x000000010e66f293
...
[task 2020-10-14T10:50:39.203Z] Loaded modules:
...
[task 2020-10-14T10:50:39.203Z] 0x10e65f000 - 0x117dc0fff XUL ???
Then the xul offset should be 0x10240 so I think this is it:
0000000000010240 pushq %rbp
0000000000010241 movq %rsp, %rbp
0000000000010244 pushq %r15
0000000000010246 pushq %r14
0000000000010248 pushq %r12
000000000001024a pushq %rbx
000000000001024b testq %rsi, %rsi
000000000001024e je 0x103ad
0000000000010254 movl %edx, %r14d
0000000000010257 movq %rsi, %rbx
000000000001025a movq %rdi, %r15
000000000001025d movq 0x20(%rsi), %r12
0000000000010261 testq %r12, %r12
0000000000010264 je 0x102db
0000000000010266 movl %r14d, %r14d
0000000000010269 movq (%r12,%r14,8), %rsi
000000000001026d testq %rsi, %rsi
0000000000010270 je 0x10338
0000000000010276 movq $-0x1, %rdx
000000000001027d xorl %ecx, %ecx
000000000001027f xorl %eax, %eax
0000000000010281 nopw %cs:(%rax,%rax)
000000000001028b nopl (%rax,%rax)
0000000000010290 movq %rsi, %rbx
0000000000010293 movq 0x8(%rsi), %rsi
0000000000010297 cmpq %r15, (%rbx)
000000000001029a je 0x10398
00000000000102a0 cmpq %rdx, %rsi
00000000000102a3 cmovbq %rbx, %rax
00000000000102a7 cmovbq %rsi, %rdx
00000000000102ab incb %cl
00000000000102ad movq 0x10(%rbx), %rsi
00000000000102b1 testq %rsi, %rsi
00000000000102b4 jne 0x10290
It looks like we are crashing reading CurVNode->Next (For context, I believe 10290-102b4 is the while loop) because CurVNode
is full of e5e5.
jemalloc shouldn't be poisoning the instrumentation control blocks, of course. Any chance the push might have had some memory unsafety where a bad pointer got passed to free()
and jemalloc poisoned the wrong area? Otherwise, if it's a miscompile, it could be anywhere (code, jemalloc, instrumentation, etc.) and this will be terrible to debug.
Comment 6•4 years ago
|
||
jemalloc shouldn't be poisoning the instrumentation control blocks, of course. Any chance the push might have had some memory unsafety where a bad pointer got passed to
free()
and jemalloc poisoned the wrong area?
I cannot completely rule this out, but I think it's rather unlikely, given
- the nature of the changes
- all tests looking fine
- it seems to be deterministically reproducible on OS X, and
- the Linux generate-profile job is ok.
Unfortunately, the Windows generate-profile didn't run because of Bug 1670712.
Otherwise, if it's a miscompile, it could be anywhere (code, jemalloc, instrumentation, etc.) and this will be terrible to debug.
It looks like the push is part of a large stack. Could you try narrowing it down to a specific changeset?
Comment 8•4 years ago
|
||
(In reply to :dmajor from comment #7)
It looks like the push is part of a large stack. Could you try narrowing it down to a specific changeset?
Ok, will try to do that.
Comment 9•4 years ago
|
||
Oh. I found this is caused by disabling optimizations on two directories: https://hg.mozilla.org/try/rev/2100b35947620404ea1f2cd78cf8641079cf977f This wasn't intended for landing, of course. Maybe that's expected? It's a bit annoying, but definitely of low priority then.
Comment 10•4 years ago
|
||
I vaguely recall that there are some footguns around -O0 but I don't recall offhand if they are related here. If this was on a more convenient OS then I'd like to look deeper just for curiosity's sake, but in practice if it's not blocking then I probably won't be able to spend time on it.
Comment 11•4 years ago
|
||
Fine for me, I will close this as WORKSFORME then. Thanks for looking into it so quickly, and sorry for me not noticing this part of the stack earlier.
Comment hidden (Intermittent Failures Robot) |
Description
•