Closed Bug 1442583 Opened 7 years ago Closed 7 years ago

ARM64: Must initialize icache flushing machinery

Categories

(Core :: JavaScript Engine: JIT, defect, P1)

ARM64
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox61 --- fixed

People

(Reporter: lth, Assigned: lth)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Compiling tanks.wasm in a JS release shell on SoftIron Overdrive 1000 / OpenSUSE with --no-threads. Fully 52% of the run time - and apparently all of it in a single sequential section following the nicely parallel compilation, if my eyeballing of the htop activity is right - is reported by perf as being spent in vixl::CPU::EnsureIAndDCacheCoherency() which is called by our cacheFlush primitive. perf thinks the culprits are inline asm dc and ic instructions.
The flushing code has two loops which can't be merged: one for pushing dcache contents to memory, the other for invalidating the icache for the appropriate range. These loops operate on variable-sized lines whose sizes are detected during setup or default to 1. It appears that the SetUp() function that determines the actual line size is never called, so that should be fixed first. If we're lucky we get a 64x speedup, which would be nice. There are clearly some simple improvements possible: - if we know that the memory containing new code is not in the icache (it could be freshly mapped) then the second icache syncing loop is not necessary - conceivably it might be faster to unmap and remap than to invalidate, but let's reserve judgement on that - we can unroll both loops to avoid loop overhead, which will matter more once the line sizes are correct - we can generate code for these functions so that line sizes etc are hardcoded, though with unrolling it probably does not matter much
Just to make sure I understand: you're seeing 52% of wasm baseline compile time? How many threads? Another thing to check: are we calling vixl::CPU::EnsureIAndDCache only once here? In the past, sometimes spurious icache flushes have crept in b/c it's easy to sprinkle them about or forget to turn them off.
52% of end-to-end time, the program just reads tanks.wasm (a few ms) and does new WebAssembly.Module() (about 4500 ms). Compile time measured manually inside the JIT is something like 30% of total runtime with --no-threads. Without --no-threads we do get four threads, so far as I can tell, because compile time measured manually speeds up four-fold :) There's just some massive sequential phase at the end, I don't have all data yet. Good point about excessive icache flushing calls, I'll check for that once I've fixed the obvious initialization bug.
Fixing this by calling the SetUp() function first, the iflush time drops down to about 0.5% of the runtime or so, so we can ignore that problem for a while.
Just to wrap this up, we're now seeing a reasonably fast parallel compile (400ms) followed by a mysterious sequential phase (another 500ms); this latter one turns out to be stubs generation. So presumably the base mozilla-inbound code I'm running here has an intermediate state of the lazy stubs work. It also appears that (as I feared) the vixl assembler greatly dominates the overall profile, but I'll reserve final judgement until the stubs generation is out of the picture and we can more clearly see baseline compilation only.
With the lazy stubs code fully in place, the long sequential tail disappears, so we're down to a nice parallel compilation phase of about 400ms for Tanks on this machine, the speedup over --no-threads is about 3.7, which is pretty good for a four-core machine. I'll address the vixl issues in another bug. Patch coming up here for the icache flush initialization.
Blocks: 1443082
Summary: ARM64: 50% of runtime spent in icache flushing → ARM64: Must initialize icache flushing machinery
I opted to put the initialization code under the JS_CODEGEN_ARM64 define since that just works. The obvious alternative is __aarch64__. The non-obvious alternative -- but the one that follows the code in ExecutableAllocator.h -- is JS_CODEGEN_ARM64 && !JS_SIMULATOR_ARM64.
Attachment #8956107 - Flags: review?(sstangl)
Priority: -- → P1
Comment on attachment 8956107 [details] [diff] [review] bug1442583-setup-arm64-icache-flush.patch Review of attachment 8956107 [details] [diff] [review]: ----------------------------------------------------------------- JS_CODEGEN_ARM64 is correct. That's the same include guard used around the cacheFlush() definition that calls EnsureIAndDCacheCoherency().
Attachment #8956107 - Flags: review?(sstangl) → review+
Pushed by lhansen@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/1aa685a4d838 Properly initialize ARM64 icache flushing machinery. r=sstangl
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: