Closed Bug 1671132 Opened 4 years ago Closed 4 years ago

Profile initialization | application crashed [@ __llvm_profile_instrument_target + 0x53]

Categories

(Firefox Build System :: Toolchains, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: crash)

Crash Data

Filed by: sgiesecke [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=318581365&repo=try
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Kb_S2E05RcmjhT8UGBjL1g/runs/0/artifacts/public/logs/live_backing.log


With some local changes, I get crashes in the generate-profile-macosx64-shippable/opt jobs. Other platforms do not seem to be affected.

Could this be related to the issue mentioned in https://github.com/llvm/llvm-project-staging/commit/27650ec5541cd604a5027ad63895e0badfd35efe? Do we have that fix?```
Flags: needinfo?(dmajor)

Could this be related to the issue mentioned in https://github.com/llvm/llvm-project-staging/commit/27650ec5541cd604a5027ad63895e0badfd35efe? Do we have that fix?

The code that that patch reverted didn't land until clang trunk was version 12, so I don't think that's (directly) it.

Flags: needinfo?(dmajor)

Ok, but this looks like an issue in llvm, rather than an issue caused by the specific (dom/indexedDB) code changes in my push. Right?

At flrst glance, yes.

It's unfortunate that it's on Mac. If it was Linux or Windows, I'd be able to debug it much more easily. Any chance you could paste a disassembly of __llvm_profile_instrument_target from beginning up to the point of failure?

(In reply to :dmajor from comment #3)

At flrst glance, yes.

It's unfortunate that it's on Mac. If it was Linux or Windows, I'd be able to debug it much more easily. Any chance you could paste a disassembly of __llvm_profile_instrument_target from beginning up to the point of failure?

I don't have a Mac either. I have no idea how I would do that, unfortunately.

Going by these values from the treeherder log

[task 2020-10-14T10:50:39.180Z] Thread 32 (crashed)
[task 2020-10-14T10:50:39.180Z]  0  XUL!__llvm_profile_instrument_target + 0x53
...
[task 2020-10-14T10:50:39.180Z]     rip = 0x000000010e66f293
...
[task 2020-10-14T10:50:39.203Z] Loaded modules:
...
[task 2020-10-14T10:50:39.203Z] 0x10e65f000 - 0x117dc0fff  XUL  ???

Then the xul offset should be 0x10240 so I think this is it:

0000000000010240	pushq	%rbp
0000000000010241	movq	%rsp, %rbp
0000000000010244	pushq	%r15
0000000000010246	pushq	%r14
0000000000010248	pushq	%r12
000000000001024a	pushq	%rbx
000000000001024b	testq	%rsi, %rsi
000000000001024e	je	0x103ad
0000000000010254	movl	%edx, %r14d
0000000000010257	movq	%rsi, %rbx
000000000001025a	movq	%rdi, %r15
000000000001025d	movq	0x20(%rsi), %r12
0000000000010261	testq	%r12, %r12
0000000000010264	je	0x102db
0000000000010266	movl	%r14d, %r14d
0000000000010269	movq	(%r12,%r14,8), %rsi
000000000001026d	testq	%rsi, %rsi
0000000000010270	je	0x10338
0000000000010276	movq	$-0x1, %rdx
000000000001027d	xorl	%ecx, %ecx
000000000001027f	xorl	%eax, %eax
0000000000010281	nopw	%cs:(%rax,%rax)
000000000001028b	nopl	(%rax,%rax)
0000000000010290	movq	%rsi, %rbx
0000000000010293	movq	0x8(%rsi), %rsi
0000000000010297	cmpq	%r15, (%rbx)
000000000001029a	je	0x10398
00000000000102a0	cmpq	%rdx, %rsi
00000000000102a3	cmovbq	%rbx, %rax
00000000000102a7	cmovbq	%rsi, %rdx
00000000000102ab	incb	%cl
00000000000102ad	movq	0x10(%rbx), %rsi
00000000000102b1	testq	%rsi, %rsi
00000000000102b4	jne	0x10290

It looks like we are crashing reading CurVNode->Next (For context, I believe 10290-102b4 is the while loop) because CurVNode is full of e5e5.

jemalloc shouldn't be poisoning the instrumentation control blocks, of course. Any chance the push might have had some memory unsafety where a bad pointer got passed to free() and jemalloc poisoned the wrong area? Otherwise, if it's a miscompile, it could be anywhere (code, jemalloc, instrumentation, etc.) and this will be terrible to debug.

jemalloc shouldn't be poisoning the instrumentation control blocks, of course. Any chance the push might have had some memory unsafety where a bad pointer got passed to free() and jemalloc poisoned the wrong area?

I cannot completely rule this out, but I think it's rather unlikely, given

  • the nature of the changes
  • all tests looking fine
  • it seems to be deterministically reproducible on OS X, and
  • the Linux generate-profile job is ok.
    Unfortunately, the Windows generate-profile didn't run because of Bug 1670712.

Otherwise, if it's a miscompile, it could be anywhere (code, jemalloc, instrumentation, etc.) and this will be terrible to debug.

It looks like the push is part of a large stack. Could you try narrowing it down to a specific changeset?

(In reply to :dmajor from comment #7)

It looks like the push is part of a large stack. Could you try narrowing it down to a specific changeset?

Ok, will try to do that.

Flags: needinfo?(sgiesecke)

Oh. I found this is caused by disabling optimizations on two directories: https://hg.mozilla.org/try/rev/2100b35947620404ea1f2cd78cf8641079cf977f This wasn't intended for landing, of course. Maybe that's expected? It's a bit annoying, but definitely of low priority then.

Flags: needinfo?(sgiesecke)

I vaguely recall that there are some footguns around -O0 but I don't recall offhand if they are related here. If this was on a more convenient OS then I'd like to look deeper just for curiosity's sake, but in practice if it's not blocking then I probably won't be able to spend time on it.

Fine for me, I will close this as WORKSFORME then. Thanks for looking into it so quickly, and sorry for me not noticing this part of the stack earlier.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.