ARM64 Baseline missing ICache invalidation somewhere
Categories
(Core :: JavaScript Engine: JIT, defect, P1)
Tracking
()
People
(Reporter: cpeterson, Assigned: djvj)
References
(Blocks 1 open bug)
Details
(Keywords: crash, topcrash, Whiteboard: [arm64:m3])
Crash Data
Attachments
(2 files)
47 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
47 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
This bug is for crash report bp-716a0508-dc70-48db-9325-8f5bf0190118.
Top 10 frames of crashing thread:
0 @0x7f37c71500
1 @0x7f37c714f4
2 libxul.so js::LiveSavedFrameCache::~LiveSavedFrameCache mfbt/Variant.h:639
3 libxul.so js::CloneFunctionReuseScript js/src/gc/Marking.cpp:553
4 libxul.so JSObject* js::Allocate<JSObject, js/src/gc/Nursery.cpp:281
5 libxul.so JSObject* js::Allocate<JSObject, js/src/gc/Nursery.cpp:281
6 libxul.so js::NewObjectCache::newObjectFromHit js/src/gc/Nursery.cpp:281
7 libxul.so js::NativeObject::create js/src/vm/NativeObject-inl.h:524
8 libxul.so js::NewObjectCache::newObjectFromHit js/src/gc/Nursery.cpp:281
9 libxul.so js::CallObject::createTemplateObject js/src/gc/Barrier.h:720
Over 95% of these ~LiveSavedFrameCache crashes are on Android. About 75% are on ARM and 20% are on ARM64.
Comment 1•6 years ago
|
||
Jon, would this be something you could look at or someone else on the GC team.
Comment 2•6 years ago
|
||
This stack doesn't make any sense. The GC doesn't call CloneFunctionReuseScript or touch LiveSavedFrameCache objects.
This stack is a one off though. The top signatures that involve ~LiveSavedFrameCache are:
1 js::LiveSavedFrameCache::~LiveSavedFrameCache | js::jit::MaybeEnterJit | js::LiveSavedFrameCache::~LiveSavedFrameCache 81 8.52 %
2 js::LiveSavedFrameCache::~LiveSavedFrameCache | js::jit::JitActivation::JitActivation 27 2.84 %
3 @0x11dd9015117 | js::LiveSavedFrameCache::~LiveSavedFrameCache 3 0.32 %
So the most frequent ones are JIT related.
Jan, any ideas?
Comment 3•6 years ago
•
|
||
I think this is a crash in JIT code unrelated to LiveSavedFrameCache. The one in comment 0 is a SIGILL / ILL_ILLOPC crash in the code below.
Lars, this should be valid code right? I wonder if there's a problem with invalidating the instruction cache.
7f37c714b8: 52800022 mov w2, #0x1 // #1
7f37c714bc: d2901910 mov x16, #0x80c8 // #32968
7f37c714c0: f2a55fd0 movk x16, #0x2afe, lsl #16
7f37c714c4: f2c00ff0 movk x16, #0x7f, lsl #32
7f37c714c8: f9400209 ldr x9, [x16]
7f37c714cc: f9400120 ldr x0, [x9]
7f37c714d0: d63f0000 blr x0
7f37c714d4: 9100639c add x28, x28, #0x18
7f37c714d8: f81c82e2 stur x2, [x23, #-56]
7f37c714dc: d2ffff53 mov x19, #0xfffa000000000000 // #-1688849860263936
7f37c714e0: d2901b10 mov x16, #0x80d8 // #32984
7f37c714e4: f2a55fd0 movk x16, #0x2afe, lsl #16
7f37c714e8: f2c00ff0 movk x16, #0x7f, lsl #32
7f37c714ec: f9400209 ldr x9, [x16]
7f37c714f0: f9400120 ldr x0, [x9]
7f37c714f4: d63f0000 blr x0
7f37c714f8: d100239f sub sp, x28, #0x8
7f37c714fc: f81f8f82 str x2, [x28, #-8]!
7f37c71500: d2808510 mov x16, #0x428 // #1064
^^^^^^^^^^^^^ ^^^^^^^^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^
7f37c71504: f2a5c050 movk x16, #0x2e02, lsl #16
7f37c71508: f2c00ff0 movk x16, #0x7f, lsl #32
7f37c7150c: f9400211 ldr x17, [x16]
7f37c71510: 91000631 add x17, x17, #0x1
7f37c71514: f9000211 str x17, [x16]
7f37c71518: f8408782 ldr x2, [x28], #8
7f37c7151c: 936ffc50 asr x16, x2, #47
7f37c71520: 31003a1f cmn w16, #0xe
7f37c71524: 540000e0 b.eq 0x7f37c71540 // b.none
7f37c71528: d2901d10 mov x16, #0x80e8 // #33000
7f37c7152c: f2a55fd0 movk x16, #0x2afe, lsl #16
7f37c71530: f2c00ff0 movk x16, #0x7f, lsl #32
7f37c71534: f9400209 ldr x9, [x16]
7f37c71538: f9400120 ldr x0, [x9]
Comment 4•6 years ago
|
||
This is clearly jitted JS code (look at the parallel uses of sp and x28). The address in question is a cache line boundary so a cache flush incident is likely; the instruction we see is valid and is probably not what's being executed by the CPU.
Updated•6 years ago
|
Comment 5•6 years ago
|
||
This has risen to the #3 top browser crash on Fennec nightly (even though the volume is not very large).
Updated•6 years ago
|
Comment 6•6 years ago
|
||
I agree with the above assessment; without STR the best we can do is a full audit of all sites that patch instructions. Luckily this is without Ion, so it's just Baseline and caches.
Reporter | ||
Comment 7•6 years ago
|
||
[arm64:m3] as a reminder to watch this crash signature as we get closer to release ARM64 Fennec.
Updated•6 years ago
|
Reporter | ||
Comment 8•6 years ago
|
||
This is the top crash for ARM64 Fennec Nightly.
Comment 9•6 years ago
|
||
I might fix it by accident while looking at the error reported while running with the instrumentation added in Bug 1441436.
Reporter | ||
Comment 10•6 years ago
|
||
Changing whiteboard tag from [arm64:m2] to [arm64:m3]. Sean says this bug should block Ion riding the trains, but doesn't need to block enabling Ion in Nightly.
Comment 11•6 years ago
|
||
(In reply to Nicolas B. Pierron [:nbp] from comment #9)
I might fix it by accident while looking at the error reported while running with the instrumentation added in Bug 1441436.
Bug 1441436 did not found anything buggy within the test suite, our best plan to investigate a missing cache flush (as mentioned in comment 4) is to land Bug 1441436 and let fuzzers find test cases.
Updated•6 years ago
|
Comment 12•6 years ago
|
||
Summary of past comments:
- Land Bug 1441436 to get fuzzers to detect missing icache-flushes. (comment 11)
- Audit Baseline & CacheIR code. (where are relocate incremented counters based comment 3 code)
Comment 13•6 years ago
|
||
This signature is spiking suddenly in 67.0b11 (Fennec).
Updated•6 years ago
|
Updated•6 years ago
|
Hi Steven, this is a spike in ARM64 crashes since we added the ARM64 APK to Fennec beta in GPS. Is the team actively looking at these crashes? Is there an indication that this is due to ARM64 IonMonkey vs native JIT compiler?
Comment 16•6 years ago
|
||
Nicolas, Sean: could you help to answer Ritu's question in Comment 15 and looking into this if needed.
Comment 17•6 years ago
|
||
Kannan, as you know the best this part of the code, could you do a preliminary investigation of what is going on?
This is the top crash of Firefox 67.0b13 on Fennec Android.
Assignee | ||
Comment 18•6 years ago
|
||
Cache-flush issues still seem to be the most likely culprit here. Looking around the various places in the engine which introduce new jitcode, the ARM64 sections seem to contain the same cache flush guards as the corresponding code for other architectures.
Iain pointed out the following article: https://www.mono-project.com/news/2016/09/12/arm64-icache/
Which may be the cause of the problem (namely: cache flushes for one processor not working for another processor with different sized cache lines, which is something that happens in big.LITTLE CPU architectures).
Looking at the machine code that is crashing, there's no clear patchable jumps involved, or other instructions which seem like they'd be part of the runtime patching infrastructure (e.g. PatchableJump and friends). This is most likely related to new code being compiled - potentially in the same place that old discarded code was compiled into.
A quick initial test fix for this is to simply change the strides at which we are flushing cache lines to some lowest common denominator, instead of the exact CPU's exact cache line sizes.
Cache flush instructions on ARM do not need to be provided aligned addresses, so this should be fine to do.
Assignee | ||
Updated•6 years ago
|
Comment 19•6 years ago
|
||
(In reply to Kannan Vijayan [:djvj] from comment #18)
A quick initial test fix for this is to simply change the strides at which we are flushing cache lines to some lowest common denominator, instead of the exact CPU's exact cache line sizes.
As far as I know this is already the case, we flush ranges of instructions and do not round to any cache line sizes.
The ARM64 simulator cache emulation also check the cache sanity as a per-instruction basis, so any missing invalidation should be captured by it.
Comment 20•6 years ago
|
||
(In reply to Nicolas B. Pierron [:nbp] from comment #19)
As far as I know this is already the case, we flush ranges of instructions and do not round to any cache line sizes.
The problem is that a single ARM device can have processors with different cache line sizes, but we only check the cache line size on startup. If a background thread running on a processor with smaller cache lines tries to flush the cache, it will stride by the larger size, but only actually clear the smaller size, leaving unflushed cache lines in between.
How expensive is it to access the cache type register? Relative to an actual cache flush, it seems like it shouldn't matter. Can't we just retrieve the cache line size every time we flush, instead of trying to store it? (That is to say, move this code to here.)
Comment 21•6 years ago
|
||
It's worth looking at more crash dumps to see if they're similar to the one in comment 3. Maybe most of the crashes are unrelated. Maybe there's a pattern.
Assignee | ||
Comment 22•6 years ago
|
||
jandem: How do I get access to the crash dumps? Ted told me there's some clearance thing to apply for. Currently I don't have access.
I managed to make a functioning arm64 build instrumented to print the cache line sizes. Works on my device but always prints consistent sizes of 64. Double-checked my phone specs and realized it's only got two LITTLE cores (one is clocked slightly higher than the others), so it wouldn't show any variation in sizes anyway. I need to get my hands on an appropriate device in the office - should be able to find one.
I have an old Pixel device that should have the appropriate big.LITTLE processor config, but it doesn't seem to be able to communicate over USB - just charge (I stopped using it because the charger was glitchy, doesn't surprise me that the USB functionality doesn't work).
Comment 23•6 years ago
|
||
Kannan, the docs for crash dump access are here: https://crash-stats.mozilla.com/documentation/memory_dump_access/
Comment 24•6 years ago
|
||
Comment 25•6 years ago
|
||
If Iain hypothesis is correct, then the patch from comment 24 should remove most/all of the crashes reported with this signature, by simply taking the instruction size as the stride instead of the cache line size. If so, I suggest we go with this patch for beta (which is urgently waiting for us to come up with a solution) and refine our invalidation stride later in Firefox 68 / 69.
Comment 26•6 years ago
|
||
(In reply to Iain Ireland [:iain] from comment #20)
Can't we just retrieve the cache line size every time we flush, instead of trying to store it? (That is to say, move this code to here.)
Unfortunately this would not completely work as long as WASM generate code off-thread, and also this might not work if we switch from a big to a little core in the middle of the invalidation.
Comment 27•6 years ago
|
||
nbp, incrementing by 4 bytes instead of cache line size will probably have bad performance implications. At least initially on the arm64 port, a bug caused us to flush every byte, and it was incredibly slow.
Assignee | ||
Comment 28•6 years ago
|
||
I'd say incrementing by a lowest-common-denominator amount (if the hypothesis is correct, that would be 64 bytes on ARM64) is at least a good way to test the hypothesis on nightly. The crash rates should drop precipitously if it's because of the cache-line size mismatch issue on some big.LITTLE setups. I was planning on going that route after confirming that the cache-line sizes actually differ on an appropriate mobile device.
Comment 29•6 years ago
•
|
||
So, if we do not expect anything below 64, we can try 32 bytes (just to be even more conservative) and avoid the slowness seen with 1, and expected with 4.
Comment 30•6 years ago
|
||
The report in comment 0 is a Sony E5823. Looking at specs, it has these CPUs: Octa-core (4x1.5 GHz Cortex-A53 & 4x2.0 GHz Cortex-A57). Both Cortex A53 and A57 have a cache line size of 64.
Comment 31•6 years ago
|
||
That said, SM-G960F, SM-G965F, SM-N960F are responsible for 45% of the crashes and these use Exynos CPUs as far as I can see (like the one in the Mono blog post), so clamping to 64 bytes might make sense.
V8 code has a workaround for Cortex A53 errata: https://github.com/v8/v8/commit/fec99c689b8587b863df4a5c4793c601772ef663
Assignee | ||
Comment 32•6 years ago
|
||
The commit you linked seems very relevant. I'll roll up a patch for both of these (clamping to 64, civac vs cvau) - should be cheap enough to test on nighly.
Assignee | ||
Comment 33•6 years ago
|
||
Updated•6 years ago
|
Comment 34•6 years ago
|
||
Comment 35•6 years ago
|
||
As a reference to better understand v8 changes imported by Kannan:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/ch11s04.html
Comment 36•6 years ago
|
||
bugherder |
Comment 37•6 years ago
|
||
Kannan, could you request an uplift to beta today please? Thanks
Assignee | ||
Comment 38•6 years ago
|
||
Treeherder job looks ok here (the oranges seem unrelated): https://treeherder.mozilla.org/#/jobs?repo=try&revision=b0002d0be22dc1b542602dfc0fbc95da94defbe2&selectedJob=242639385
Removed the redundant bits with nbp's patch. Got review from jandem. Landing on inbound now and will request beta uplift immediately.
Assignee | ||
Updated•6 years ago
|
Comment 39•6 years ago
|
||
Assignee | ||
Comment 40•6 years ago
|
||
:pascalc - quick process question - do we need to wait until the push gets merged with central before requesting uplift?
Comment 41•6 years ago
|
||
(In reply to Kannan Vijayan [:djvj] from comment #40)
:pascalc - quick process question - do we need to wait until the push gets merged with central before requesting uplift?
Yes please, there is always the risk of a backout. I'll evaluate on Monday if it's worth taking it but given that we only have 2 betas left, I prefer to have uplift requests already prepared beforehand, thanks.
Assignee | ||
Comment 42•6 years ago
|
||
Comment on attachment 9060750 [details]
Bug 1521158: Robustify cache-line invalidations on AARCH64. r?jandem
Beta/Release Uplift Approval Request
- User impact if declined: High number of crashes on ARM64 android platforms.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): First patch (nbp's) simply clamps the cache invalidation jump size. Second patch (mine) is ported from V8 code, small one-line change to assembler instruction related to cache invalidations. Very small scope and easy to reason about.
- String changes made/needed:
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 43•6 years ago
|
||
@pascalc - so, I marked the patches for beta review. Both have been landed in inbound, but unforunately I landed mine directly instead of going through the phabricator/lando - so it's not marked as published in the phabricator UI - not sure if that'll cause problems or not for uplift.
Comment 44•6 years ago
|
||
bugherder |
Pascal said he might consider doing another Fennec 67 beta builds this week so the fix from comment 44 can be tested on Beta67 GPStore.
Comment 46•6 years ago
|
||
Comment on attachment 9060750 [details]
Bug 1521158: Robustify cache-line invalidations on AARCH64. r?jandem
Uplift accepted for 67 beta 16, thanks.
Updated•6 years ago
|
Comment 47•6 years ago
|
||
bugherder uplift |
Comment 48•6 years ago
|
||
One of the Fennec signatures is still crashing in beta even after the uplifted patch, and it looks as if there is still some crashes in nightly with the patch. Should I spin off a new bug?
Comment 49•6 years ago
|
||
Bug 1550525 was filed for the signature that remains.
Updated•5 years ago
|
Comment 50•5 years ago
|
||
Please specify a root cause for this bug. See :tmaity for more information.
Assignee | ||
Updated•4 years ago
|
Comment 51•4 years ago
|
||
I disagree with this root cause analysis, I would have set it to "External software affecting Firefox", as this is not a faulty implementation of Firefox but a faulty specification of the cache invalidation in the big.LITTLE cpu.
To be clearer Bug 1441436 prevented us from even having a faulty implementation in the first place.
Description
•