Closed Bug 1521158 Opened 5 years ago Closed 5 years ago

ARM64 Baseline missing ICache invalidation somewhere

Categories

(Core :: JavaScript Engine: JIT, defect, P1)

ARM64
Android
defect

Tracking

()

RESOLVED FIXED
mozilla68
Tracking Status
firefox-esr60 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 + fixed
firefox68 blocking fixed

People

(Reporter: cpeterson, Assigned: djvj)

References

(Blocks 1 open bug)

Details

(Keywords: crash, topcrash, Whiteboard: [arm64:m3])

Crash Data

Attachments

(2 files)

This bug is for crash report bp-716a0508-dc70-48db-9325-8f5bf0190118.

Top 10 frames of crashing thread:

0  @0x7f37c71500 
1  @0x7f37c714f4 
2 libxul.so js::LiveSavedFrameCache::~LiveSavedFrameCache mfbt/Variant.h:639
3 libxul.so js::CloneFunctionReuseScript js/src/gc/Marking.cpp:553
4 libxul.so JSObject* js::Allocate<JSObject,  js/src/gc/Nursery.cpp:281
5 libxul.so JSObject* js::Allocate<JSObject,  js/src/gc/Nursery.cpp:281
6 libxul.so js::NewObjectCache::newObjectFromHit js/src/gc/Nursery.cpp:281
7 libxul.so js::NativeObject::create js/src/vm/NativeObject-inl.h:524
8 libxul.so js::NewObjectCache::newObjectFromHit js/src/gc/Nursery.cpp:281
9 libxul.so js::CallObject::createTemplateObject js/src/gc/Barrier.h:720

Over 95% of these ~LiveSavedFrameCache crashes are on Android. About 75% are on ARM and 20% are on ARM64.

Jon, would this be something you could look at or someone else on the GC team.

Flags: needinfo?(jcoppeard)

This stack doesn't make any sense. The GC doesn't call CloneFunctionReuseScript or touch LiveSavedFrameCache objects.

This stack is a one off though. The top signatures that involve ~LiveSavedFrameCache are:

1 	js::LiveSavedFrameCache::~LiveSavedFrameCache | js::jit::MaybeEnterJit | js::LiveSavedFrameCache::~LiveSavedFrameCache 	81 	8.52 %
2 	js::LiveSavedFrameCache::~LiveSavedFrameCache | js::jit::JitActivation::JitActivation 	27 	2.84 %
3 	@0x11dd9015117 | js::LiveSavedFrameCache::~LiveSavedFrameCache 	3 	0.32 %

So the most frequent ones are JIT related.

Jan, any ideas?

Component: JavaScript Engine → JavaScript Engine: JIT
Flags: needinfo?(jcoppeard) → needinfo?(jdemooij)

I think this is a crash in JIT code unrelated to LiveSavedFrameCache. The one in comment 0 is a SIGILL / ILL_ILLOPC crash in the code below.

Lars, this should be valid code right? I wonder if there's a problem with invalidating the instruction cache.

  7f37c714b8:	52800022 	mov	w2, #0x1                   	// #1
  7f37c714bc:	d2901910 	mov	x16, #0x80c8                	// #32968
  7f37c714c0:	f2a55fd0 	movk	x16, #0x2afe, lsl #16
  7f37c714c4:	f2c00ff0 	movk	x16, #0x7f, lsl #32
  7f37c714c8:	f9400209 	ldr	x9, [x16]
  7f37c714cc:	f9400120 	ldr	x0, [x9]
  7f37c714d0:	d63f0000 	blr	x0
  7f37c714d4:	9100639c 	add	x28, x28, #0x18
  7f37c714d8:	f81c82e2 	stur	x2, [x23, #-56]
  7f37c714dc:	d2ffff53 	mov	x19, #0xfffa000000000000    	// #-1688849860263936
  7f37c714e0:	d2901b10 	mov	x16, #0x80d8                	// #32984
  7f37c714e4:	f2a55fd0 	movk	x16, #0x2afe, lsl #16
  7f37c714e8:	f2c00ff0 	movk	x16, #0x7f, lsl #32
  7f37c714ec:	f9400209 	ldr	x9, [x16]
  7f37c714f0:	f9400120 	ldr	x0, [x9]
  7f37c714f4:	d63f0000 	blr	x0
  7f37c714f8:	d100239f 	sub	sp, x28, #0x8
  7f37c714fc:	f81f8f82 	str	x2, [x28, #-8]!
  7f37c71500:	d2808510 	mov	x16, #0x428                 	// #1064
^^^^^^^^^^^^^	^^^^^^^^^	^^^	^^^^^^^^^^^^^^^^^^^^^^^^^^^^	^^^^^^^^^
  7f37c71504:	f2a5c050 	movk	x16, #0x2e02, lsl #16
  7f37c71508:	f2c00ff0 	movk	x16, #0x7f, lsl #32
  7f37c7150c:	f9400211 	ldr	x17, [x16]
  7f37c71510:	91000631 	add	x17, x17, #0x1
  7f37c71514:	f9000211 	str	x17, [x16]
  7f37c71518:	f8408782 	ldr	x2, [x28], #8
  7f37c7151c:	936ffc50 	asr	x16, x2, #47
  7f37c71520:	31003a1f 	cmn	w16, #0xe
  7f37c71524:	540000e0 	b.eq	0x7f37c71540  // b.none
  7f37c71528:	d2901d10 	mov	x16, #0x80e8                	// #33000
  7f37c7152c:	f2a55fd0 	movk	x16, #0x2afe, lsl #16
  7f37c71530:	f2c00ff0 	movk	x16, #0x7f, lsl #32
  7f37c71534:	f9400209 	ldr	x9, [x16]
  7f37c71538:	f9400120 	ldr	x0, [x9]
Flags: needinfo?(jdemooij) → needinfo?(lhansen)

This is clearly jitted JS code (look at the parallel uses of sp and x28). The address in question is a cache line boundary so a cache flush incident is likely; the instruction we see is valid and is probably not what's being executed by the CPU.

Flags: needinfo?(lhansen)
Hardware: Unspecified → ARM64
Flags: needinfo?(sstangl)

This has risen to the #3 top browser crash on Fennec nightly (even though the volume is not very large).

Keywords: topcrash
Flags: needinfo?(sstangl)
Summary: ARM crash in js::LiveSavedFrameCache::~LiveSavedFrameCache → ARM64 Baseline missing ICache invalidation somewhere

I agree with the above assessment; without STR the best we can do is a full audit of all sites that patch instructions. Luckily this is without Ion, so it's just Baseline and caches.

Flags: needinfo?(sstangl)

[arm64:m3] as a reminder to watch this crash signature as we get closer to release ARM64 Fennec.

Blocks: 1526993

This is the top crash for ARM64 Fennec Nightly.

Whiteboard: [arm64:m3] → [arm64:m2]

I might fix it by accident while looking at the error reported while running with the instrumentation added in Bug 1441436.

Depends on: 1441436

Changing whiteboard tag from [arm64:m2] to [arm64:m3]. Sean says this bug should block Ion riding the trains, but doesn't need to block enabling Ion in Nightly.

Whiteboard: [arm64:m2] → [arm64:m3]

(In reply to Nicolas B. Pierron [:nbp] from comment #9)

I might fix it by accident while looking at the error reported while running with the instrumentation added in Bug 1441436.

Bug 1441436 did not found anything buggy within the test suite, our best plan to investigate a missing cache flush (as mentioned in comment 4) is to land Bug 1441436 and let fuzzers find test cases.

Summary of past comments:

  1. Land Bug 1441436 to get fuzzers to detect missing icache-flushes. (comment 11)
  2. Audit Baseline & CacheIR code. (where are relocate incremented counters based comment 3 code)
Priority: -- → P1

This signature is spiking suddenly in 67.0b11 (Fennec).

Tracking because of the spike in beta 11

Hi Steven, this is a spike in ARM64 crashes since we added the ARM64 APK to Fennec beta in GPS. Is the team actively looking at these crashes? Is there an indication that this is due to ARM64 IonMonkey vs native JIT compiler?

Flags: needinfo?(sdetar)

Nicolas, Sean: could you help to answer Ritu's question in Comment 15 and looking into this if needed.

Flags: needinfo?(sdetar) → needinfo?(nicolas.b.pierron)

Kannan, as you know the best this part of the code, could you do a preliminary investigation of what is going on?
This is the top crash of Firefox 67.0b13 on Fennec Android.

Flags: needinfo?(kvijayan)

Cache-flush issues still seem to be the most likely culprit here. Looking around the various places in the engine which introduce new jitcode, the ARM64 sections seem to contain the same cache flush guards as the corresponding code for other architectures.

Iain pointed out the following article: https://www.mono-project.com/news/2016/09/12/arm64-icache/
Which may be the cause of the problem (namely: cache flushes for one processor not working for another processor with different sized cache lines, which is something that happens in big.LITTLE CPU architectures).

Looking at the machine code that is crashing, there's no clear patchable jumps involved, or other instructions which seem like they'd be part of the runtime patching infrastructure (e.g. PatchableJump and friends). This is most likely related to new code being compiled - potentially in the same place that old discarded code was compiled into.

A quick initial test fix for this is to simply change the strides at which we are flushing cache lines to some lowest common denominator, instead of the exact CPU's exact cache line sizes.

Cache flush instructions on ARM do not need to be provided aligned addresses, so this should be fine to do.

Flags: needinfo?(kvijayan)
Assignee: nobody → kvijayan

(In reply to Kannan Vijayan [:djvj] from comment #18)

A quick initial test fix for this is to simply change the strides at which we are flushing cache lines to some lowest common denominator, instead of the exact CPU's exact cache line sizes.

As far as I know this is already the case, we flush ranges of instructions and do not round to any cache line sizes.

The ARM64 simulator cache emulation also check the cache sanity as a per-instruction basis, so any missing invalidation should be captured by it.

(In reply to Nicolas B. Pierron [:nbp] from comment #19)

As far as I know this is already the case, we flush ranges of instructions and do not round to any cache line sizes.

The problem is that a single ARM device can have processors with different cache line sizes, but we only check the cache line size on startup. If a background thread running on a processor with smaller cache lines tries to flush the cache, it will stride by the larger size, but only actually clear the smaller size, leaving unflushed cache lines in between.

How expensive is it to access the cache type register? Relative to an actual cache flush, it seems like it shouldn't matter. Can't we just retrieve the cache line size every time we flush, instead of trying to store it? (That is to say, move this code to here.)

It's worth looking at more crash dumps to see if they're similar to the one in comment 3. Maybe most of the crashes are unrelated. Maybe there's a pattern.

jandem: How do I get access to the crash dumps? Ted told me there's some clearance thing to apply for. Currently I don't have access.

I managed to make a functioning arm64 build instrumented to print the cache line sizes. Works on my device but always prints consistent sizes of 64. Double-checked my phone specs and realized it's only got two LITTLE cores (one is clocked slightly higher than the others), so it wouldn't show any variation in sizes anyway. I need to get my hands on an appropriate device in the office - should be able to find one.

I have an old Pixel device that should have the appropriate big.LITTLE processor config, but it doesn't seem to be able to communicate over USB - just charge (I stopped using it because the charger was glitchy, doesn't surprise me that the USB functionality doesn't work).

Kannan, the docs for crash dump access are here: https://crash-stats.mozilla.com/documentation/memory_dump_access/

If Iain hypothesis is correct, then the patch from comment 24 should remove most/all of the crashes reported with this signature, by simply taking the instruction size as the stride instead of the cache line size. If so, I suggest we go with this patch for beta (which is urgently waiting for us to come up with a solution) and refine our invalidation stride later in Firefox 68 / 69.

Flags: needinfo?(nicolas.b.pierron)

(In reply to Iain Ireland [:iain] from comment #20)

Can't we just retrieve the cache line size every time we flush, instead of trying to store it? (That is to say, move this code to here.)

Unfortunately this would not completely work as long as WASM generate code off-thread, and also this might not work if we switch from a big to a little core in the middle of the invalidation.

nbp, incrementing by 4 bytes instead of cache line size will probably have bad performance implications. At least initially on the arm64 port, a bug caused us to flush every byte, and it was incredibly slow.

Flags: needinfo?(nicolas.b.pierron)

I'd say incrementing by a lowest-common-denominator amount (if the hypothesis is correct, that would be 64 bytes on ARM64) is at least a good way to test the hypothesis on nightly. The crash rates should drop precipitously if it's because of the cache-line size mismatch issue on some big.LITTLE setups. I was planning on going that route after confirming that the cache-line sizes actually differ on an appropriate mobile device.

So, if we do not expect anything below 64, we can try 32 bytes (just to be even more conservative) and avoid the slowness seen with 1, and expected with 4.

Flags: needinfo?(nicolas.b.pierron)

The report in comment 0 is a Sony E5823. Looking at specs, it has these CPUs: Octa-core (4x1.5 GHz Cortex-A53 & 4x2.0 GHz Cortex-A57). Both Cortex A53 and A57 have a cache line size of 64.

That said, SM-G960F, SM-G965F, SM-N960F are responsible for 45% of the crashes and these use Exynos CPUs as far as I can see (like the one in the Mono blog post), so clamping to 64 bytes might make sense.

V8 code has a workaround for Cortex A53 errata: https://github.com/v8/v8/commit/fec99c689b8587b863df4a5c4793c601772ef663

The commit you linked seems very relevant. I'll roll up a patch for both of these (clamping to 64, civac vs cvau) - should be cheap enough to test on nighly.

Attachment #9060699 - Attachment description: Bug 1521158 - Invalidate ARM64 caches by increments of 4 bytes instead of increments of cache lines. → Bug 1521158 - Invalidate ARM64 caches by increments of at most 32 bytes instead of increments of cache lines.
Pushed by npierron@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ae2f7d5c145e
Invalidate ARM64 caches by increments of at most 32 bytes instead of increments of cache lines. r=sstangl

As a reference to better understand v8 changes imported by Kannan:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/ch11s04.html

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68

Kannan, could you request an uplift to beta today please? Thanks

Flags: needinfo?(kvijayan)

Treeherder job looks ok here (the oranges seem unrelated): https://treeherder.mozilla.org/#/jobs?repo=try&revision=b0002d0be22dc1b542602dfc0fbc95da94defbe2&selectedJob=242639385

Removed the redundant bits with nbp's patch. Got review from jandem. Landing on inbound now and will request beta uplift immediately.

Flags: needinfo?(kvijayan)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Pushed by kvijayan@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1b18ac7619e3
Robustify cache-line invalidations on AARCH64. r=jandem

:pascalc - quick process question - do we need to wait until the push gets merged with central before requesting uplift?

Flags: needinfo?(pascalc)

(In reply to Kannan Vijayan [:djvj] from comment #40)

:pascalc - quick process question - do we need to wait until the push gets merged with central before requesting uplift?

Yes please, there is always the risk of a backout. I'll evaluate on Monday if it's worth taking it but given that we only have 2 betas left, I prefer to have uplift requests already prepared beforehand, thanks.

Flags: needinfo?(pascalc)

Comment on attachment 9060750 [details]
Bug 1521158: Robustify cache-line invalidations on AARCH64. r?jandem

Beta/Release Uplift Approval Request

  • User impact if declined: High number of crashes on ARM64 android platforms.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): First patch (nbp's) simply clamps the cache invalidation jump size. Second patch (mine) is ported from V8 code, small one-line change to assembler instruction related to cache invalidations. Very small scope and easy to reason about.
  • String changes made/needed:
Attachment #9060750 - Flags: approval-mozilla-beta?
Attachment #9060699 - Flags: approval-mozilla-beta?

@pascalc - so, I marked the patches for beta review. Both have been landed in inbound, but unforunately I landed mine directly instead of going through the phabricator/lando - so it's not marked as published in the phabricator UI - not sure if that'll cause problems or not for uplift.

Flags: needinfo?(pascalc)
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED

Pascal said he might consider doing another Fennec 67 beta builds this week so the fix from comment 44 can be tested on Beta67 GPStore.

Comment on attachment 9060750 [details]
Bug 1521158: Robustify cache-line invalidations on AARCH64. r?jandem

Uplift accepted for 67 beta 16, thanks.

Flags: needinfo?(pascalc)
Attachment #9060750 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9060699 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

One of the Fennec signatures is still crashing in beta even after the uplifted patch, and it looks as if there is still some crashes in nightly with the patch. Should I spin off a new bug?

Flags: needinfo?(kvijayan)

Bug 1550525 was filed for the signature that remains.

Flags: needinfo?(kvijayan)
Flags: needinfo?(sstangl)

Please specify a root cause for this bug. See :tmaity for more information.

Root Cause: --- → ?
Root Cause: ? → Coding: Concurrency Issue

I disagree with this root cause analysis, I would have set it to "External software affecting Firefox", as this is not a faulty implementation of Firefox but a faulty specification of the cache invalidation in the big.LITTLE cpu.

To be clearer Bug 1441436 prevented us from even having a faulty implementation in the first place.

Root Cause: Coding: Concurrency Issue → External Software Affecting Firefox
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: