Closed Bug 1037772 Opened 5 years ago Closed 5 years ago
N-height N-bottom A .html | application crashed [@ js::jit::Assembler X86Shared::Toggle To Cmp(js::jit::Code Location Label)] after Assertion failure: *ptr == 0x E9
https://tbpl.mozilla.org/php/getParsedLog.php?id=43661233&tree=Mozilla-Inbound Ubuntu VM 12.04 x64 mozilla-inbound debug test reftest on 2014-07-11 18:35:13 PDT for push 7c366c305105 slave: tst-linux64-spot-1047 19:25:33 INFO - REFTEST TEST-START | file:///builds/slave/test/build/tests/reftest/tests/layout/reftests/position-dynamic-changes/vertical/toauto-topN-heightN-bottomA.html 19:25:33 INFO - REFTEST TEST-LOAD | file:///builds/slave/test/build/tests/reftest/tests/layout/reftests/position-dynamic-changes/vertical/toauto-topN-heightN-bottomA.html | 7783 / 11072 (70%) 19:25:33 INFO - ++DOMWINDOW == 84 (0x7f6c6aff4400) [pid = 1719] [serial = 20681] [outer = 0x7f6c82abfc00] 19:25:33 INFO - Assertion failure: *ptr == 0xE9, at /builds/slave/m-in-l64-d-0000000000000000000/build/js/src/jit/shared/Assembler-x86-shared.h:1694 19:25:33 INFO - TEST-INFO | Main app process: killed by SIGSEGV 19:25:33 WARNING - TEST-UNEXPECTED-FAIL | file:///builds/slave/test/build/tests/reftest/tests/layout/reftests/position-dynamic-changes/vertical/toauto-topN-heightN-bottomA.html | Exited with code 11 during test run 19:25:33 INFO - INFO | automation.py | Application ran for: 0:48:01.964953 19:25:33 INFO - INFO | zombiecheck | Reading PID log: /tmp/tmp04Nn0Ypidlog 19:25:33 INFO - ==> process 1719 launched child process 1775 19:25:33 INFO - ==> process 1719 launched child process 1827 19:25:33 INFO - INFO | zombiecheck | Checking for orphan process with PID: 1775 19:25:33 INFO - INFO | zombiecheck | Checking for orphan process with PID: 1827 19:25:41 WARNING - PROCESS-CRASH | file:///builds/slave/test/build/tests/reftest/tests/layout/reftests/position-dynamic-changes/vertical/toauto-topN-heightN-bottomA.html | application crashed [@ js::jit::AssemblerX86Shared::ToggleToCmp(js::jit::CodeLocationLabel)] 19:25:41 INFO - Crash dump filename: /tmp/tmp5wQ6ep.mozrunner/minidumps/4b50fbfb-dad9-ddec-53137f53-579fb887.dmp 19:25:41 INFO - Operating system: Linux 19:25:41 INFO - 0.0.0 Linux 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 19:25:41 INFO - CPU: amd64 19:25:41 INFO - family 6 model 44 stepping 2 19:25:41 INFO - 1 CPU 19:25:41 INFO - Crash reason: SIGSEGV 19:25:41 INFO - Crash address: 0x0 19:25:41 INFO - Thread 0 (crashed) 19:25:41 INFO - 0 libxul.so!js::jit::AssemblerX86Shared::ToggleToCmp(js::jit::CodeLocationLabel) [Assembler-x86-shared.h:7c366c305105 : 1694 + 0x5] 19:25:41 INFO - rbx = 0x00007f6c90b0cec6 r12 = 0x00007f6c90b0cece 19:25:41 INFO - r13 = 0x00007f6c7ee8b220 r14 = 0x0000000000000001 19:25:41 INFO - r15 = 0x00007f6c72b52128 rip = 0x00007f6ca69b5b70 19:25:41 INFO - rsp = 0x00007fff885a7680 rbp = 0x00007fff885a7680 19:25:41 INFO - Found by: given as instruction pointer in context 19:25:41 INFO - 1 libxul.so!js::jit::JitCode::togglePreBarriers(bool) [CompactBuffer.h:7c366c305105 : 57 + 0xe] 19:25:41 INFO - rbx = 0x00007f6c90b0cec6 r12 = 0x00007f6c90b0cece 19:25:41 INFO - r13 = 0x00007f6c7ee8b220 r14 = 0x0000000000000001 19:25:41 INFO - r15 = 0x00007f6c72b52128 rip = 0x00007f6ca69de04f 19:25:41 INFO - rsp = 0x00007fff885a7690 rbp = 0x00007fff885a76c0 19:25:41 INFO - Found by: call frame info 19:25:41 INFO - 2 libxul.so!js::jit::ToggleBarriers(JS::Zone*, bool) [BaselineJIT.h:7c366c305105 : 285 + 0xa] 19:25:41 INFO - rbx = 0x00007f6c82631800 r12 = 0x0000000000000001 19:25:41 INFO - r13 = 0x00007fff885a7700 r14 = 0x0000000000000001 19:25:41 INFO - r15 = 0x00007f6c72b52128 rip = 0x00007f6ca6a25dfb 19:25:41 INFO - rsp = 0x00007fff885a76d0 rbp = 0x00007fff885a7770 19:25:41 INFO - Found by: call frame info 19:25:41 INFO - 3 libxul.so!JS::Zone::setNeedsBarrier(bool, JS::Zone::ShouldUpdateIon) [Zone.cpp:7c366c305105 : 75 + 0x8] 19:25:41 INFO - rbx = 0x00007f6c82631800 r12 = 0x0000000000000001 19:25:41 INFO - r13 = 0x0000000000000001 r14 = 0x00007f6c8ca69af8 19:25:41 INFO - r15 = 0x00007f6c82631800 rip = 0x00007f6ca68e4306 19:25:41 INFO - rsp = 0x00007fff885a7780 rbp = 0x00007fff885a77a0 19:25:41 INFO - Found by: call frame info 19:25:41 INFO - 4 libxul.so!AutoGCSlice::~AutoGCSlice [jsgc.cpp:7c366c305105 : 4799 + 0x4] 19:25:41 INFO - rbx = 0x00007f6c8ca69af0 r12 = 0x00007fff885a7810 19:25:41 INFO - r13 = 0x0000000000000001 r14 = 0x00007f6c8ca69af8 19:25:41 INFO - r15 = 0x00007f6c82631800 rip = 0x00007f6ca6bc8103 19:25:41 INFO - rsp = 0x00007fff885a77b0 rbp = 0x00007fff885a77e0 19:25:41 INFO - Found by: call frame info
The earliest instance I see of this on inbound is on https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=reftest&rev=7c366c305105 I'm retriggering a lot of debug reftests on the previous push to see if any show up there, but it's looking like it's either bug 1017165 or bug 1034910 that introduced this. Ccing patch authors from those bugs for investigation.
Yeah, not seeing any of this failure before bug 1034910 and bug 1017165 landed.
Okay, I can reproduce these crashes running the following jit-tests with JS_GC_ZEAL=10 in a debug-opt build: self-test\assertDeepEq.js debug\Environment-variables.js Neither of these tests crash without the patch from bug 1017165 applied. I'm compiling an unoptimized debug build now to see exactly where things first fail.
So very roughly, this seems to be a situation where we construct an AutoGCSlice while zone->isGCMarking() and call zone->setNeedsBarrier(false, Zone::DontUpdateIon), then destruct the AutoGCSlice with zone->isGCMarking() still, and call zone->setNeedsBarrier(true, Zone::UpdateIon), without updating the Ion code in between. Calling zone->setNeedsBarrier(false, Zone::UpdateIon) in the constructor instead fixes the crashes, but we don't want to do that because it's slow (according to the comment). So either the assert in AssemblerX86Shared::ToggleToCmp() is invalid, or something is failing to update the ion code before we return. I'll try to figure out what's supposed to be happening here.
Actually, that seems to be too simplistic: Zone::setNeedsBarrier only calls jit::ToggleBarriers if the requested state doesn't match Zone::ionUsingBarriers_. So this must be out-of-date with the actual scripts in the zone, somehow.
Okay, I think I understand what's going on. ArenaIter (used by ZoneCellIterUnderGC) iterates over all arenas in a zone, including those in zone->allocator.arenas.arenaListsToSweep. During incremental GC, we used to finalize an arena from arenaListsToSweep, and immediately insert it into arenaLists - thus allowing ArenaIter to walk the same list each time. But after my patch, we build up a SortedArenaList first, then merge the swept arenas back into the arenaLists again at the end. This means that swept arenas may be temporarily unavailable to ArenaIter between slices, stopping it from updating their pre-barrier state! The most reasonable solution that I can think of here is to: 1) Create a RuntimeSweptCellIter that walks the GCRuntime's incrementalSweepList. 2) Use it inside AutoGCSlice::AutoGCSlice and AutoGCSlice::~AutoGCSlice.
This isn't what I said above, but I think it's a more robust approach. This adds members to ArenaLists to track swept arenas and makes ArenaIter walk them as well. This should be pretty much performance neutral unless we go over budget during an incremental slice, but making a SortedArenaList contiguous isn't free (worst case, it involves updating some 256 pointers). Still, this approach ensures that regardless of *why* we iterate over the arenas during a GC (and I haven't audited all the places where we do this), we'll see the whole set like we did before bug 1017165. I also don't *really* expect it to have much effect on performance, considering we already iterate over all the cells in each arena at the end of each incremental slice. It might have a noticeable impact if the ArenaLists were tiny, but in that case we shouldn't go over budget anyway, so we wouldn't take this path.
Assignee: nobody → emanuel.hoogeveen
Status: NEW → ASSIGNED
Attachment #8454968 - Flags: review?(wmccloskey)
Reasonably constrained try run: https://tbpl.mozilla.org/?tree=Try&rev=10fb6ca9c6a5 (though this was very intermittent, so this is more to see if I broke anything)
The try push actually shows the same failure as before, and with the same stack no less. Going to have to see if I can reproduce it on that reftest I guess.
I still haven't been able to reproduce this failure locally (after trying various things), so I spun up a try build with some additional checks: https://tbpl.mozilla.org/?tree=Try&rev=6d6647941ba6 (they don't compile on ARM, but that's okay) So far it looks like they shifted the timing to fail on a crashtest instead - I've done some retriggers to see if it still reproduces on anything else, and I'll see if I have more luck reproducing locally using the crashtests.
One interesting detail: in the one crashtest failure so far, the checks I added moved the failure to the *constructor* of AutoGCSlice (assuming the line number in the stack is to be believed), so it looks like things became out-of-sync outside of GC (between slices?). Could it be that code is generated with barriers at a time when barriers are disabled?
Note that we have Zone::needsBarrier_ and Zone::ionUsingBarriers_. Nothing checks the latter outside Zone::setNeedsBarrier(), so maybe we're generating code (or setting barriers on that code) based on Zone::needsBarrier_ when we should be checking Zone::ionUsingBarriers_? I have to run right now, but I'll see if I can answer that question in a bit.
Attachment #8454968 - Attachment description: Let ArenaLists track swept arenas between slices and extend ArenaIter to walk incrementally swept arenas as well. → Part 1: Let ArenaLists track swept arenas between slices and extend ArenaIter to walk incrementally swept arenas as well.
I think this is the culprit. There are other uses of Zone::needsBarrier_ (including one that gets baked into JITcode ) that might need to be audited (by someone with more knowledge than I), but I think this one is probably responsible for the assertion failures we're seeing here. Try run: https://tbpl.mozilla.org/?tree=Try&rev=75317be18244  http://hg.mozilla.org/integration/mozilla-inbound/annotate/67c13008762f/js/src/jit/IonMacroAssembler.h#l640
Jan pointed out on IRC that BaselineCompiler::compile() isn't the only place making these checks. This version changes all the places in js/src/jit that call Zone::needsBarrier() and then call some function to set the prebarriers. There are other uses which do tracing (e.g. ), and I left these alone. Try (cancelled the other one): https://tbpl.mozilla.org/?tree=Try&rev=886c290ab2e5  http://hg.mozilla.org/integration/mozilla-inbound/annotate/ab2b0fb5cfff/js/src/jit/BaselineIC.cpp#l521
Comment on attachment 8455444 [details] [diff] [review] Part 2 v2: Make the JIT barrier checks check the right thing. Review of attachment 8455444 [details] [diff] [review]: ----------------------------------------------------------------- ionUsingBarriers_ != needsBarriers_ only when we're in an AutoGCSlice scope, right? In that case we shouldn't be running/compiling JS code. It'd be great if we could assert ionUsingBarriers_ == needsBarriers_ in some places though, to verify this. Like the AutoGCSlice constructor and destructor..
Bug 1017165 backed out.
Thanks Ryan! (In reply to Jan de Mooij [:jandem] from comment #20) > ionUsingBarriers_ != needsBarriers_ only when we're in an AutoGCSlice scope, > right? In that case we shouldn't be running/compiling JS code. I pushed some assertions to try to verify this: https://tbpl.mozilla.org/?tree=Try&rev=2f5d6cc07633 > It'd be great if we could assert ionUsingBarriers_ == needsBarriers_ in some > places though, to verify this. Like the AutoGCSlice constructor and > destructor.. While there *are* some other callers to Zone::setNeedsBarrier(), the only caller that passes in Zone::DontUpdateIon is the AutoGCSlice constructor, so this is pretty much true by definition right now. That doesn't necessarily mean that ionUsingBarriers_ is always accurate right now though, as the try run from comment #14 showed (the debugging patch I pushed in that try run added some checks for this). I asked RyanVM to back out bug 1017165 for now, since there are still some questions to answer here and I'd rather take the pressure off.
You need to log in before you can comment on or make changes to this bug.