Closed Bug 1037772 Opened 5 years ago Closed 5 years ago

Intermittent toauto-topN-heightN-bottomA.html | application crashed [@ js::jit::AssemblerX86Shared::ToggleToCmp(js::jit::CodeLocationLabel)] after Assertion failure: *ptr == 0xE9

Categories

(Core :: JavaScript: GC, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla33
Tracking Status
firefox31 --- unaffected
firefox32 --- unaffected
firefox33 --- fixed
firefox-esr24 --- unaffected

People

(Reporter: KWierso, Assigned: ehoogeveen)

References

Details

(Keywords: crash, intermittent-failure)

Attachments

(1 file, 2 obsolete files)

https://tbpl.mozilla.org/php/getParsedLog.php?id=43661233&tree=Mozilla-Inbound
Ubuntu VM 12.04 x64 mozilla-inbound debug test reftest on 2014-07-11 18:35:13 PDT for push 7c366c305105

slave: tst-linux64-spot-1047



19:25:33     INFO -  REFTEST TEST-START | file:///builds/slave/test/build/tests/reftest/tests/layout/reftests/position-dynamic-changes/vertical/toauto-topN-heightN-bottomA.html
19:25:33     INFO -  REFTEST TEST-LOAD | file:///builds/slave/test/build/tests/reftest/tests/layout/reftests/position-dynamic-changes/vertical/toauto-topN-heightN-bottomA.html | 7783 / 11072 (70%)
19:25:33     INFO -  ++DOMWINDOW == 84 (0x7f6c6aff4400) [pid = 1719] [serial = 20681] [outer = 0x7f6c82abfc00]
19:25:33     INFO -  Assertion failure: *ptr == 0xE9, at /builds/slave/m-in-l64-d-0000000000000000000/build/js/src/jit/shared/Assembler-x86-shared.h:1694
19:25:33     INFO -  TEST-INFO | Main app process: killed by SIGSEGV
19:25:33  WARNING -  TEST-UNEXPECTED-FAIL | file:///builds/slave/test/build/tests/reftest/tests/layout/reftests/position-dynamic-changes/vertical/toauto-topN-heightN-bottomA.html | Exited with code 11 during test run
19:25:33     INFO -  INFO | automation.py | Application ran for: 0:48:01.964953
19:25:33     INFO -  INFO | zombiecheck | Reading PID log: /tmp/tmp04Nn0Ypidlog
19:25:33     INFO -  ==> process 1719 launched child process 1775
19:25:33     INFO -  ==> process 1719 launched child process 1827
19:25:33     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 1775
19:25:33     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 1827
19:25:41  WARNING -  PROCESS-CRASH | file:///builds/slave/test/build/tests/reftest/tests/layout/reftests/position-dynamic-changes/vertical/toauto-topN-heightN-bottomA.html | application crashed [@ js::jit::AssemblerX86Shared::ToggleToCmp(js::jit::CodeLocationLabel)]
19:25:41     INFO -  Crash dump filename: /tmp/tmp5wQ6ep.mozrunner/minidumps/4b50fbfb-dad9-ddec-53137f53-579fb887.dmp
19:25:41     INFO -  Operating system: Linux
19:25:41     INFO -                    0.0.0 Linux 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64
19:25:41     INFO -  CPU: amd64
19:25:41     INFO -       family 6 model 44 stepping 2
19:25:41     INFO -       1 CPU
19:25:41     INFO -  Crash reason:  SIGSEGV
19:25:41     INFO -  Crash address: 0x0
19:25:41     INFO -  Thread 0 (crashed)
19:25:41     INFO -   0  libxul.so!js::jit::AssemblerX86Shared::ToggleToCmp(js::jit::CodeLocationLabel) [Assembler-x86-shared.h:7c366c305105 : 1694 + 0x5]
19:25:41     INFO -      rbx = 0x00007f6c90b0cec6   r12 = 0x00007f6c90b0cece
19:25:41     INFO -      r13 = 0x00007f6c7ee8b220   r14 = 0x0000000000000001
19:25:41     INFO -      r15 = 0x00007f6c72b52128   rip = 0x00007f6ca69b5b70
19:25:41     INFO -      rsp = 0x00007fff885a7680   rbp = 0x00007fff885a7680
19:25:41     INFO -      Found by: given as instruction pointer in context
19:25:41     INFO -   1  libxul.so!js::jit::JitCode::togglePreBarriers(bool) [CompactBuffer.h:7c366c305105 : 57 + 0xe]
19:25:41     INFO -      rbx = 0x00007f6c90b0cec6   r12 = 0x00007f6c90b0cece
19:25:41     INFO -      r13 = 0x00007f6c7ee8b220   r14 = 0x0000000000000001
19:25:41     INFO -      r15 = 0x00007f6c72b52128   rip = 0x00007f6ca69de04f
19:25:41     INFO -      rsp = 0x00007fff885a7690   rbp = 0x00007fff885a76c0
19:25:41     INFO -      Found by: call frame info
19:25:41     INFO -   2  libxul.so!js::jit::ToggleBarriers(JS::Zone*, bool) [BaselineJIT.h:7c366c305105 : 285 + 0xa]
19:25:41     INFO -      rbx = 0x00007f6c82631800   r12 = 0x0000000000000001
19:25:41     INFO -      r13 = 0x00007fff885a7700   r14 = 0x0000000000000001
19:25:41     INFO -      r15 = 0x00007f6c72b52128   rip = 0x00007f6ca6a25dfb
19:25:41     INFO -      rsp = 0x00007fff885a76d0   rbp = 0x00007fff885a7770
19:25:41     INFO -      Found by: call frame info
19:25:41     INFO -   3  libxul.so!JS::Zone::setNeedsBarrier(bool, JS::Zone::ShouldUpdateIon) [Zone.cpp:7c366c305105 : 75 + 0x8]
19:25:41     INFO -      rbx = 0x00007f6c82631800   r12 = 0x0000000000000001
19:25:41     INFO -      r13 = 0x0000000000000001   r14 = 0x00007f6c8ca69af8
19:25:41     INFO -      r15 = 0x00007f6c82631800   rip = 0x00007f6ca68e4306
19:25:41     INFO -      rsp = 0x00007fff885a7780   rbp = 0x00007fff885a77a0
19:25:41     INFO -      Found by: call frame info
19:25:41     INFO -   4  libxul.so!AutoGCSlice::~AutoGCSlice [jsgc.cpp:7c366c305105 : 4799 + 0x4]
19:25:41     INFO -      rbx = 0x00007f6c8ca69af0   r12 = 0x00007fff885a7810
19:25:41     INFO -      r13 = 0x0000000000000001   r14 = 0x00007f6c8ca69af8
19:25:41     INFO -      r15 = 0x00007f6c82631800   rip = 0x00007f6ca6bc8103
19:25:41     INFO -      rsp = 0x00007fff885a77b0   rbp = 0x00007fff885a77e0
19:25:41     INFO -      Found by: call frame info
See Also: → 1037750
The earliest instance I see of this on inbound is on https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=reftest&rev=7c366c305105

I'm retriggering a lot of debug reftests on the previous push to see if any show up there, but it's looking like it's either bug 1017165 or bug 1034910 that introduced this.

Ccing patch authors from those bugs for investigation.
Yeah, not seeing any of this failure before bug 1034910 and bug 1017165 landed.
Depends on: 1034910, 1017165
This is almost certainly bug 1017165, as bug 1034910 is very simple.
No longer depends on: 1034910
Yeah, this was probably triggered by my patch, which might have changed GC timings (and which objects end up in the same arenas). I haven't been able to reproduce this at all so far locally, though :( Changing the component to Javascript: GC given what's on the stack. Since I'll probably need help from the exact rooting / GGC experts (terrence, jonco, sfink) on this, and they're likely out for the weekend, feel free to back me out if this is happening with any frequency.
Component: JavaScript Engine: JIT → JavaScript: GC
Okay, I can reproduce these crashes running the following jit-tests with JS_GC_ZEAL=10 in a debug-opt build:

self-test\assertDeepEq.js
debug\Environment-variables.js

Neither of these tests crash without the patch from bug 1017165 applied. I'm compiling an unoptimized debug build now to see exactly where things first fail.
So very roughly, this seems to be a situation where we construct an AutoGCSlice while zone->isGCMarking() and call zone->setNeedsBarrier(false, Zone::DontUpdateIon), then destruct the AutoGCSlice with zone->isGCMarking() still, and call zone->setNeedsBarrier(true, Zone::UpdateIon), without updating the Ion code in between.

Calling zone->setNeedsBarrier(false, Zone::UpdateIon) in the constructor instead fixes the crashes, but we don't want to do that because it's slow (according to the comment). So either the assert in AssemblerX86Shared::ToggleToCmp() is invalid, or something is failing to update the ion code before we return. I'll try to figure out what's supposed to be happening here.
Actually, that seems to be too simplistic: Zone::setNeedsBarrier only calls jit::ToggleBarriers if the requested state doesn't match Zone::ionUsingBarriers_. So this must be out-of-date with the actual scripts in the zone, somehow.
Okay, I think I understand what's going on. ArenaIter (used by ZoneCellIterUnderGC) iterates over all arenas in a zone, including those in zone->allocator.arenas.arenaListsToSweep. During incremental GC, we used to finalize an arena from arenaListsToSweep, and immediately insert it into arenaLists - thus allowing ArenaIter to walk the same list each time.

But after my patch, we build up a SortedArenaList first, then merge the swept arenas back into the arenaLists again at the end. This means that swept arenas may be temporarily unavailable to ArenaIter between slices, stopping it from updating their pre-barrier state!

The most reasonable solution that I can think of here is to:
1) Create a RuntimeSweptCellIter that walks the GCRuntime's incrementalSweepList.
2) Use it inside AutoGCSlice::AutoGCSlice and AutoGCSlice::~AutoGCSlice.
This isn't what I said above, but I think it's a more robust approach. This adds members to ArenaLists to track swept arenas and makes ArenaIter walk them as well. This should be pretty much performance neutral unless we go over budget during an incremental slice, but making a SortedArenaList contiguous isn't free (worst case, it involves updating some 256 pointers).

Still, this approach ensures that regardless of *why* we iterate over the arenas during a GC (and I haven't audited all the places where we do this), we'll see the whole set like we did before bug 1017165. I also don't *really* expect it to have much effect on performance, considering we already iterate over all the cells in each arena at the end of each incremental slice. It might have a noticeable impact if the ArenaLists were tiny, but in that case we shouldn't go over budget anyway, so we wouldn't take this path.
Assignee: nobody → emanuel.hoogeveen
Status: NEW → ASSIGNED
Attachment #8454968 - Flags: review?(wmccloskey)
Reasonably constrained try run: https://tbpl.mozilla.org/?tree=Try&rev=10fb6ca9c6a5 (though this was very intermittent, so this is more to see if I broke anything)
The try push actually shows the same failure as before, and with the same stack no less. Going to have to see if I can reproduce it on that reftest I guess.
I still haven't been able to reproduce this failure locally (after trying various things), so I spun up a try build with some additional checks: https://tbpl.mozilla.org/?tree=Try&rev=6d6647941ba6 (they don't compile on ARM, but that's okay)

So far it looks like they shifted the timing to fail on a crashtest instead - I've done some retriggers to see if it still reproduces on anything else, and I'll see if I have more luck reproducing locally using the crashtests.
One interesting detail: in the one crashtest failure so far, the checks I added moved the failure to the *constructor* of AutoGCSlice (assuming the line number in the stack is to be believed), so it looks like things became out-of-sync outside of GC (between slices?).

Could it be that code is generated with barriers at a time when barriers are disabled?
Note that we have Zone::needsBarrier_ and Zone::ionUsingBarriers_. Nothing checks the latter outside Zone::setNeedsBarrier(), so maybe we're generating code (or setting barriers on that code) based on Zone::needsBarrier_ when we should be checking Zone::ionUsingBarriers_? I have to run right now, but I'll see if I can answer that question in a bit.
Attachment #8454968 - Attachment description: Let ArenaLists track swept arenas between slices and extend ArenaIter to walk incrementally swept arenas as well. → Part 1: Let ArenaLists track swept arenas between slices and extend ArenaIter to walk incrementally swept arenas as well.
I think this is the culprit. There are other uses of Zone::needsBarrier_ (including one that gets baked into JITcode [1]) that might need to be audited (by someone with more knowledge than I), but I think this one is probably responsible for the assertion failures we're seeing here.

Try run: https://tbpl.mozilla.org/?tree=Try&rev=75317be18244

[1] http://hg.mozilla.org/integration/mozilla-inbound/annotate/67c13008762f/js/src/jit/IonMacroAssembler.h#l640
Attachment #8455423 - Flags: review?(jdemooij)
Jan pointed out on IRC that BaselineCompiler::compile() isn't the only place making these checks. This version changes all the places in js/src/jit that call Zone::needsBarrier() and then call some function to set the prebarriers. There are other uses which do tracing (e.g. [1]), and I left these alone.

Try (cancelled the other one): https://tbpl.mozilla.org/?tree=Try&rev=886c290ab2e5

[1] http://hg.mozilla.org/integration/mozilla-inbound/annotate/ab2b0fb5cfff/js/src/jit/BaselineIC.cpp#l521
Attachment #8455423 - Attachment is obsolete: true
Attachment #8455423 - Flags: review?(jdemooij)
Attachment #8455444 - Flags: review?(jdemooij)
Comment on attachment 8455444 [details] [diff] [review]
Part 2 v2: Make the JIT barrier checks check the right thing.

Review of attachment 8455444 [details] [diff] [review]:
-----------------------------------------------------------------

ionUsingBarriers_ != needsBarriers_ only when we're in an AutoGCSlice scope, right? In that case we shouldn't be running/compiling JS code.

It'd be great if we could assert ionUsingBarriers_ == needsBarriers_ in some places though, to verify this. Like the AutoGCSlice constructor and destructor..
Attachment #8455444 - Flags: review?(jdemooij)
Bug 1017165 backed out.
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
Thanks Ryan!

(In reply to Jan de Mooij [:jandem] from comment #20)
> ionUsingBarriers_ != needsBarriers_ only when we're in an AutoGCSlice scope,
> right? In that case we shouldn't be running/compiling JS code.

I pushed some assertions to try to verify this: https://tbpl.mozilla.org/?tree=Try&rev=2f5d6cc07633

> It'd be great if we could assert ionUsingBarriers_ == needsBarriers_ in some
> places though, to verify this. Like the AutoGCSlice constructor and
> destructor..

While there *are* some other callers to Zone::setNeedsBarrier(), the only caller that passes in Zone::DontUpdateIon is the AutoGCSlice constructor, so this is pretty much true by definition right now.

That doesn't necessarily mean that ionUsingBarriers_ is always accurate right now though, as the try run from comment #14 showed (the debugging patch I pushed in that try run added some checks for this).

I asked RyanVM to back out bug 1017165 for now, since there are still some questions to answer here and I'd rather take the pressure off.
Attachment #8454968 - Attachment is obsolete: true
Attachment #8454968 - Flags: review?(wmccloskey)
You need to log in before you can comment on or make changes to this bug.