Closed
Bug 1442583
Opened 7 years ago
Closed 7 years ago
ARM64: Must initialize icache flushing machinery
Categories
(Core :: JavaScript Engine: JIT, defect, P1)
Tracking
()
RESOLVED
FIXED
mozilla61
Tracking | Status | |
---|---|---|
firefox61 | --- | fixed |
People
(Reporter: lth, Assigned: lth)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
959 bytes,
patch
|
sstangl
:
review+
|
Details | Diff | Splinter Review |
Compiling tanks.wasm in a JS release shell on SoftIron Overdrive 1000 / OpenSUSE with --no-threads.
Fully 52% of the run time - and apparently all of it in a single sequential section following the nicely parallel compilation, if my eyeballing of the htop activity is right - is reported by perf as being spent in vixl::CPU::EnsureIAndDCacheCoherency() which is called by our cacheFlush primitive. perf thinks the culprits are inline asm dc and ic instructions.
Assignee | ||
Comment 1•7 years ago
|
||
The flushing code has two loops which can't be merged: one for pushing dcache contents to memory, the other for invalidating the icache for the appropriate range. These loops operate on variable-sized lines whose sizes are detected during setup or default to 1.
It appears that the SetUp() function that determines the actual line size is never called, so that should be fixed first. If we're lucky we get a 64x speedup, which would be nice.
There are clearly some simple improvements possible:
- if we know that the memory containing new code is not in the icache (it could
be freshly mapped) then the second icache syncing loop is not necessary
- conceivably it might be faster to unmap and remap than to invalidate, but
let's reserve judgement on that
- we can unroll both loops to avoid loop overhead, which will matter more once
the line sizes are correct
- we can generate code for these functions so that line sizes etc
are hardcoded, though with unrolling it probably does not matter much
![]() |
||
Comment 2•7 years ago
|
||
Just to make sure I understand: you're seeing 52% of wasm baseline compile time? How many threads?
Another thing to check: are we calling vixl::CPU::EnsureIAndDCache only once here? In the past, sometimes spurious icache flushes have crept in b/c it's easy to sprinkle them about or forget to turn them off.
Assignee | ||
Comment 3•7 years ago
|
||
52% of end-to-end time, the program just reads tanks.wasm (a few ms) and does new WebAssembly.Module() (about 4500 ms). Compile time measured manually inside the JIT is something like 30% of total runtime with --no-threads. Without --no-threads we do get four threads, so far as I can tell, because compile time measured manually speeds up four-fold :) There's just some massive sequential phase at the end, I don't have all data yet.
Good point about excessive icache flushing calls, I'll check for that once I've fixed the obvious initialization bug.
Assignee | ||
Comment 4•7 years ago
|
||
Fixing this by calling the SetUp() function first, the iflush time drops down to about 0.5% of the runtime or so, so we can ignore that problem for a while.
Assignee | ||
Comment 5•7 years ago
|
||
Just to wrap this up, we're now seeing a reasonably fast parallel compile (400ms) followed by a mysterious sequential phase (another 500ms); this latter one turns out to be stubs generation. So presumably the base mozilla-inbound code I'm running here has an intermediate state of the lazy stubs work.
It also appears that (as I feared) the vixl assembler greatly dominates the overall profile, but I'll reserve final judgement until the stubs generation is out of the picture and we can more clearly see baseline compilation only.
Assignee | ||
Comment 6•7 years ago
|
||
With the lazy stubs code fully in place, the long sequential tail disappears, so we're down to a nice parallel compilation phase of about 400ms for Tanks on this machine, the speedup over --no-threads is about 3.7, which is pretty good for a four-core machine.
I'll address the vixl issues in another bug. Patch coming up here for the icache flush initialization.
Assignee | ||
Updated•7 years ago
|
Summary: ARM64: 50% of runtime spent in icache flushing → ARM64: Must initialize icache flushing machinery
Assignee | ||
Comment 7•7 years ago
|
||
I opted to put the initialization code under the JS_CODEGEN_ARM64 define since that just works.
The obvious alternative is __aarch64__.
The non-obvious alternative -- but the one that follows the code in ExecutableAllocator.h -- is JS_CODEGEN_ARM64 && !JS_SIMULATOR_ARM64.
Attachment #8956107 -
Flags: review?(sstangl)
Updated•7 years ago
|
Priority: -- → P1
Comment 8•7 years ago
|
||
Comment on attachment 8956107 [details] [diff] [review]
bug1442583-setup-arm64-icache-flush.patch
Review of attachment 8956107 [details] [diff] [review]:
-----------------------------------------------------------------
JS_CODEGEN_ARM64 is correct. That's the same include guard used around the cacheFlush() definition that calls EnsureIAndDCacheCoherency().
Attachment #8956107 -
Flags: review?(sstangl) → review+
Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1aa685a4d838
Properly initialize ARM64 icache flushing machinery. r=sstangl
Comment 10•7 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
status-firefox61:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
You need to log in
before you can comment on or make changes to this bug.
Description
•