Closed Bug 1430743 Opened 2 years ago Closed 2 years ago
Arm64 / aarch64 JS baseline and wasm baseline jit-test debug/ failures
On SoftIron Overdrive 1000 (4 x Cortex-57A) compiled with gcc6, run with --baseline-eager: --enable-debug --disable-optimize: FAILURES: debug/Frame-onPop-10.js debug/Frame-onStep-03.js debug/Frame-onStep-06.js debug/Frame-onStep-07.js debug/Frame-onStep-11.js debug/Frame-onStep-12.js debug/Frame-onStep-14.js debug/Frame-onStep-15.js debug/Frame-onStep-17.js debug/Frame-onStep-18.js debug/Frame-onStep-lines-01.js debug/Frame-onStep-resumption-05.js debug/Frame-this-10.js debug/Frame-this-12.js --baseline-eager --ion-eager debug/RematerializedFrame-retval.js debug/Script-getLineOffsets-03.js debug/Script-getLineOffsets-05.js debug/breakpoint-08.js debug/breakpoint-11.js debug/class-05.js debug/noExecute-04.js debug/noExecute-05.js debug/noExecute-06.js debug/noExecute-07.js TIMEOUTS: basic/bug688939.js basic/bug832203.js coverage/bug1214548.js debug/bug-1248162.js debug/bug-1260725.js debug/bug1251919.js debug/bug1254123.js debug/bug1370905.js gc/bug-1143706.js gc/bug-1215678.js --ion-pgo=on gc/bug-1226896.js gc/bug-1292564.js gc/bug-1303015.js gc/bug-1384047.js gc/oomInNewGlobal.js --disable-debug --enable-optimize (note one GC failure here that was not in the debug run): FAILURES: debug/Frame-onPop-10.js debug/Frame-onStep-06.js debug/Frame-onStep-12.js debug/Frame-onStep-14.js debug/Frame-onStep-16.js debug/Frame-onStep-17.js debug/Frame-onStep-18.js debug/Frame-onStep-lines-01.js debug/Frame-onStep-resumption-01.js debug/Frame-onStep-resumption-02.js debug/Frame-this-10.js debug/Frame-this-12.js debug/Script-getLineOffsets-03.js debug/Script-getLineOffsets-05.js debug/breakpoint-06.js debug/breakpoint-08.js debug/breakpoint-resume-01.js debug/Script-getLineOffsets-06.js debug/bug1368736.js debug/class-05.js debug/noExecute-01.js debug/noExecute-02.js debug/noExecute-04.js debug/noExecute-07.js debug/noExecute-06.js debug/noExecute-05.js gc/bug-1155455.js
These failures only occur on hardware, btw.
(In reply to Lars T Hansen [:lth] from comment #1) > These failures only occur on hardware, btw. I guess this lead to 2 questions: 1. Why is it different? 2. Can we fix the simulator? (P2, as these failures seems to be restricted to debug test cases)
I'm seeing similar failures now for the wasm debug tests, on hardware, both with and without the baseline jit, so it's not clear the baseline jit is at fault here. I should re-run the tests above with --no-baseline to see if any of them reproduce. In the case of the wasm tests I'm seeing an onStep handler that's being called a variable number of times; the tests always fail when the number of calls is low (2) and always succeeds when it is higher, but still variable. Since the wasm tests use a different harness I'm not sure how much to read into that, but it's a lead, maybe. Will follow up more next week.
The failures above are definitely a JS baseline jit problem, there are no failures in a debug build with --no-baseline, but many failures with --baseline-eager.
We're not flushing the icache after patching for debugging. ExecutableAllocator::makeExecutable() must flush the icache, or its caller must - in this case, ~AutoWritableJitCode(). Not sure what's best yet. This is possibly OS-dependent. On Linux, which I'm running, this problem appears to have been promoted to a feature. A StackOverflow thread  observes that some libc variants perform a flush after an mprotect on ARM, and a thread  on the linux-arm-kernel list suggests very strongly that icache invalidation should not be done by mprotect except to ensure cache coherence across CPUs; this thread discusses ARM64 specifically and the conclusion appears to be that icache flushing is the application's responsibility.  https://stackoverflow.com/questions/2777725/does-mprotect-flush-the-instruction-cache-on-arm-linux/4991147  Rooted roughly at https://www.spinics.net/lists/arm-kernel/msg506068.html This will affect us on all platforms other than x86, so we should look for a general solution here, and try to figure out what different OSs need. Presumably Android might differ from stock Linux; iOS and Windows-on-ARM64 will likely do their own thing... Until now we've probably not had issues with debugging failing on ARM so it would have gone unnoticed for good reason. But the times they are a-changing.
Summary: Arm64 / aarch64 JS baseline jit-test failures → AutoWritableJitCode must flush icache when reprotecting (was: Arm64 / aarch64 JS baseline jit-test failures)
Not just debug related anymore.
Assignee: nobody → lhansen
Status: NEW → ASSIGNED
Priority: P2 → P1
If we do this in AutoWritableJitCode, we should check if we can remove some explicit icache flushes elsewhere.
More worms under this rock: Turns out that the JS debug tests fail for a slightly different reason than the Wasm debug tests. In the case of the JS baseline, it looks like missing cache flushing (it uses AutoWritableJitCode which does not flush). In WasmDebug.cpp, however, there's an additional AutoFlushICache object along with each AutoWritableJitCode. But in two cases, that AutoFlushICache object is not given a range to flush, so it flushes nothing. That's kind of a footgun, is it not? I understand why it is that way, but even so it seems brittle.
Final update, now that I've read more code: The debug-test failures are a consequence of a missing implementation of AutoFlushICache::flush() on ARM64 (evil ifdef, see bug 1439333) and failure to use that function in the ARM64 MacroAssembler and Assembler. When those problems are fixed, debug tests pass for both JS and Wasm. The patches that fix this will appear on bug 1439333, bug 1436953 (updated patch to be reviewed), and bug 1313336 (amended already-reviewed patch). From that point of view, the present bug can be closed. I'm a little concerned that the name AutoWritableJitCode implies icache flushing but since I don't have anything actionable I don't think I'll follow up further on that now. Leaving this open until the dependencies have landed.
Fixed because all dependencies are fixed.
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.