1521158 - ARM64 Baseline missing ICache invalidation somewhere

Reporter

Description

•

6 years ago

This bug is for crash report bp-716a0508-dc70-48db-9325-8f5bf0190118.

Top 10 frames of crashing thread:

0  @0x7f37c71500 
1  @0x7f37c714f4 
2 libxul.so js::LiveSavedFrameCache::~LiveSavedFrameCache mfbt/Variant.h:639
3 libxul.so js::CloneFunctionReuseScript js/src/gc/Marking.cpp:553
4 libxul.so JSObject* js::Allocate<JSObject,  js/src/gc/Nursery.cpp:281
5 libxul.so JSObject* js::Allocate<JSObject,  js/src/gc/Nursery.cpp:281
6 libxul.so js::NewObjectCache::newObjectFromHit js/src/gc/Nursery.cpp:281
7 libxul.so js::NativeObject::create js/src/vm/NativeObject-inl.h:524
8 libxul.so js::NewObjectCache::newObjectFromHit js/src/gc/Nursery.cpp:281
9 libxul.so js::CallObject::createTemplateObject js/src/gc/Barrier.h:720

Over 95% of these ~LiveSavedFrameCache crashes are on Android. About 75% are on ARM and 20% are on ARM64.

Steven DeTar [:sdetar]

Comment 1

•

6 years ago

Jon, would this be something you could look at or someone else on the GC team.

Flags: needinfo?(jcoppeard)

Jon Coppeard (:jonco)

Comment 2

•

6 years ago

This stack doesn't make any sense. The GC doesn't call CloneFunctionReuseScript or touch LiveSavedFrameCache objects.

This stack is a one off though. The top signatures that involve ~LiveSavedFrameCache are:

1 	js::LiveSavedFrameCache::~LiveSavedFrameCache | js::jit::MaybeEnterJit | js::LiveSavedFrameCache::~LiveSavedFrameCache 	81 	8.52 %
2 	js::LiveSavedFrameCache::~LiveSavedFrameCache | js::jit::JitActivation::JitActivation 	27 	2.84 %
3 	@0x11dd9015117 | js::LiveSavedFrameCache::~LiveSavedFrameCache 	3 	0.32 %

So the most frequent ones are JIT related.

Jan, any ideas?

Component: JavaScript Engine → JavaScript Engine: JIT

Flags: needinfo?(jcoppeard) → needinfo?(jdemooij)

Jan de Mooij [:jandem]

Comment 3

•

6 years ago

•

Edited

I think this is a crash in JIT code unrelated to LiveSavedFrameCache. The one in comment 0 is a SIGILL / ILL_ILLOPC crash in the code below.

Lars, this should be valid code right? I wonder if there's a problem with invalidating the instruction cache.

  7f37c714b8:	52800022 	mov	w2, #0x1                   	// #1
  7f37c714bc:	d2901910 	mov	x16, #0x80c8                	// #32968
  7f37c714c0:	f2a55fd0 	movk	x16, #0x2afe, lsl #16
  7f37c714c4:	f2c00ff0 	movk	x16, #0x7f, lsl #32
  7f37c714c8:	f9400209 	ldr	x9, [x16]
  7f37c714cc:	f9400120 	ldr	x0, [x9]
  7f37c714d0:	d63f0000 	blr	x0
  7f37c714d4:	9100639c 	add	x28, x28, #0x18
  7f37c714d8:	f81c82e2 	stur	x2, [x23, #-56]
  7f37c714dc:	d2ffff53 	mov	x19, #0xfffa000000000000    	// #-1688849860263936
  7f37c714e0:	d2901b10 	mov	x16, #0x80d8                	// #32984
  7f37c714e4:	f2a55fd0 	movk	x16, #0x2afe, lsl #16
  7f37c714e8:	f2c00ff0 	movk	x16, #0x7f, lsl #32
  7f37c714ec:	f9400209 	ldr	x9, [x16]
  7f37c714f0:	f9400120 	ldr	x0, [x9]
  7f37c714f4:	d63f0000 	blr	x0
  7f37c714f8:	d100239f 	sub	sp, x28, #0x8
  7f37c714fc:	f81f8f82 	str	x2, [x28, #-8]!
  7f37c71500:	d2808510 	mov	x16, #0x428                 	// #1064
^^^^^^^^^^^^^	^^^^^^^^^	^^^	^^^^^^^^^^^^^^^^^^^^^^^^^^^^	^^^^^^^^^
  7f37c71504:	f2a5c050 	movk	x16, #0x2e02, lsl #16
  7f37c71508:	f2c00ff0 	movk	x16, #0x7f, lsl #32
  7f37c7150c:	f9400211 	ldr	x17, [x16]
  7f37c71510:	91000631 	add	x17, x17, #0x1
  7f37c71514:	f9000211 	str	x17, [x16]
  7f37c71518:	f8408782 	ldr	x2, [x28], #8
  7f37c7151c:	936ffc50 	asr	x16, x2, #47
  7f37c71520:	31003a1f 	cmn	w16, #0xe
  7f37c71524:	540000e0 	b.eq	0x7f37c71540  // b.none
  7f37c71528:	d2901d10 	mov	x16, #0x80e8                	// #33000
  7f37c7152c:	f2a55fd0 	movk	x16, #0x2afe, lsl #16
  7f37c71530:	f2c00ff0 	movk	x16, #0x7f, lsl #32
  7f37c71534:	f9400209 	ldr	x9, [x16]
  7f37c71538:	f9400120 	ldr	x0, [x9]

Flags: needinfo?(jdemooij) → needinfo?(lhansen)

Lars T Hansen [:lth]

Comment 4

•

6 years ago

This is clearly jitted JS code (look at the parallel uses of sp and x28). The address in question is a cache line boundary so a cache flush incident is likely; the instruction we see is valid and is probably not what's being executed by the CPU.

Flags: needinfo?(lhansen)

Hardware: Unspecified → ARM64

Jan de Mooij [:jandem]

Updated

•

6 years ago

Flags: needinfo?(sstangl)

Marcia Knous [:marcia]

Comment 5

•

6 years ago

This has risen to the #3 top browser crash on Fennec nightly (even though the volume is not very large).

Keywords: topcrash

Sean Stangl [:sstangl]

Updated

•

6 years ago

Flags: needinfo?(sstangl)

Summary: ARM crash in js::LiveSavedFrameCache::~LiveSavedFrameCache → ARM64 Baseline missing ICache invalidation somewhere

Sean Stangl [:sstangl]

Comment 6

•

6 years ago

I agree with the above assessment; without STR the best we can do is a full audit of all sites that patch instructions. Luckily this is without Ion, so it's just Baseline and caches.

Flags: needinfo?(sstangl)

Chris Peterson [:cpeterson]

Reporter

Comment 7

•

6 years ago

[arm64:m3] as a reminder to watch this crash signature as we get closer to release ARM64 Fennec.

status-firefox64: affected → wontfix

status-firefox65: affected → wontfix

Whiteboard: [arm64:m3]

Marcia Knous [:marcia]

Updated

•

6 years ago

status-firefox67: --- → affected

Steven DeTar [:sdetar]

Updated

•

6 years ago

Blocks: 1526993

Chris Peterson [:cpeterson]

Reporter

Comment 8

•

6 years ago

This is the top crash for ARM64 Fennec Nightly.

Whiteboard: [arm64:m3] → [arm64:m2]

Nicolas B. Pierron [:nbp]

Comment 9

•

6 years ago

I might fix it by accident while looking at the error reported while running with the instrumentation added in Bug 1441436.

Depends on: 1441436

Chris Peterson [:cpeterson]

Reporter

Comment 10

•

6 years ago

Changing whiteboard tag from [arm64:m2] to [arm64:m3]. Sean says this bug should block Ion riding the trains, but doesn't need to block enabling Ion in Nightly.

Whiteboard: [arm64:m2] → [arm64:m3]

Nicolas B. Pierron [:nbp]

Comment 11

•

6 years ago

(In reply to Nicolas B. Pierron [:nbp] from comment #9)

I might fix it by accident while looking at the error reported while running with the instrumentation added in Bug 1441436.

Bug 1441436 did not found anything buggy within the test suite, our best plan to investigate a missing cache flush (as mentioned in comment 4) is to land Bug 1441436 and let fuzzers find test cases.

Marcia Knous [:marcia]

Updated

•

6 years ago

status-firefox68: --- → affected

Nicolas B. Pierron [:nbp]

Comment 12

•

6 years ago

Summary of past comments:

Land Bug 1441436 to get fuzzers to detect missing icache-flushes. (comment 11)
Audit Baseline & CacheIR code. (where are relocate incremented counters based comment 3 code)

Priority: -- → P1

Liz Henry (:lizzard) (relman/hg->git project)

Comment 13

•

6 years ago

This signature is spiking suddenly in 67.0b11 (Fennec).

Liz Henry (:lizzard) (relman/hg->git project)

Updated

•

6 years ago

status-firefox66: affected → wontfix

Pascal Chevrel:pascalc

Comment 14

•

6 years ago

Tracking because of the spike in beta 11

tracking-firefox67: --- → +

Julien Cristau [:jcristau]

Updated

•

6 years ago

tracking-firefox68: --- → blocking

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 15

•

6 years ago

Hi Steven, this is a spike in ARM64 crashes since we added the ARM64 APK to Fennec beta in GPS. Is the team actively looking at these crashes? Is there an indication that this is due to ARM64 IonMonkey vs native JIT compiler?

Flags: needinfo?(sdetar)

Steven DeTar [:sdetar]

Comment 16

•

6 years ago

Nicolas, Sean: could you help to answer Ritu's question in Comment 15 and looking into this if needed.

Flags: needinfo?(sdetar) → needinfo?(nicolas.b.pierron)

Nicolas B. Pierron [:nbp]

Comment 17

•

6 years ago

Kannan, as you know the best this part of the code, could you do a preliminary investigation of what is going on?
This is the top crash of Firefox 67.0b13 on Fennec Android.

Flags: needinfo?(kvijayan)

Kannan Vijayan [:djvj]

Assignee

Comment 18

•

6 years ago

Cache-flush issues still seem to be the most likely culprit here. Looking around the various places in the engine which introduce new jitcode, the ARM64 sections seem to contain the same cache flush guards as the corresponding code for other architectures.

Iain pointed out the following article: https://www.mono-project.com/news/2016/09/12/arm64-icache/
Which may be the cause of the problem (namely: cache flushes for one processor not working for another processor with different sized cache lines, which is something that happens in big.LITTLE CPU architectures).

Looking at the machine code that is crashing, there's no clear patchable jumps involved, or other instructions which seem like they'd be part of the runtime patching infrastructure (e.g. PatchableJump and friends). This is most likely related to new code being compiled - potentially in the same place that old discarded code was compiled into.

A quick initial test fix for this is to simply change the strides at which we are flushing cache lines to some lowest common denominator, instead of the exact CPU's exact cache line sizes.

Cache flush instructions on ARM do not need to be provided aligned addresses, so this should be fine to do.

Flags: needinfo?(kvijayan)

Kannan Vijayan [:djvj]

Assignee

Updated

•

6 years ago

Assignee: nobody → kvijayan

Nicolas B. Pierron [:nbp]

Comment 19

•

6 years ago

(In reply to Kannan Vijayan [:djvj] from comment #18)

A quick initial test fix for this is to simply change the strides at which we are flushing cache lines to some lowest common denominator, instead of the exact CPU's exact cache line sizes.

As far as I know this is already the case, we flush ranges of instructions and do not round to any cache line sizes.

The ARM64 simulator cache emulation also check the cache sanity as a per-instruction basis, so any missing invalidation should be captured by it.

Iain Ireland [:iain]

Comment 20

•

6 years ago

(In reply to Nicolas B. Pierron [:nbp] from comment #19)

As far as I know this is already the case, we flush ranges of instructions and do not round to any cache line sizes.

The problem is that a single ARM device can have processors with different cache line sizes, but we only check the cache line size on startup. If a background thread running on a processor with smaller cache lines tries to flush the cache, it will stride by the larger size, but only actually clear the smaller size, leaving unflushed cache lines in between.

How expensive is it to access the cache type register? Relative to an actual cache flush, it seems like it shouldn't matter. Can't we just retrieve the cache line size every time we flush, instead of trying to store it? (That is to say, move this code to here.)

Jan de Mooij [:jandem]

Comment 21

•

6 years ago

It's worth looking at more crash dumps to see if they're similar to the one in comment 3. Maybe most of the crashes are unrelated. Maybe there's a pattern.

Kannan Vijayan [:djvj]

Assignee

Comment 22

•

6 years ago

jandem: How do I get access to the crash dumps? Ted told me there's some clearance thing to apply for. Currently I don't have access.

I managed to make a functioning arm64 build instrumented to print the cache line sizes. Works on my device but always prints consistent sizes of 64. Double-checked my phone specs and realized it's only got two LITTLE cores (one is clocked slightly higher than the others), so it wouldn't show any variation in sizes anyway. I need to get my hands on an appropriate device in the office - should be able to find one.

I have an old Pixel device that should have the appropriate big.LITTLE processor config, but it doesn't seem to be able to communicate over USB - just charge (I stopped using it because the charger was glitchy, doesn't surprise me that the USB functionality doesn't work).

Julien Cristau [:jcristau]

Comment 23

•

6 years ago

Kannan, the docs for crash dump access are here: https://crash-stats.mozilla.com/documentation/memory_dump_access/

Nicolas B. Pierron [:nbp]

Comment 24

•

6 years ago

Attached file Bug 1521158 - Invalidate ARM64 caches by increments of at most 32 bytes instead of increments of cache lines. — Details

Nicolas B. Pierron [:nbp]

Comment 25

•

6 years ago

If Iain hypothesis is correct, then the patch from comment 24 should remove most/all of the crashes reported with this signature, by simply taking the instruction size as the stride instead of the cache line size. If so, I suggest we go with this patch for beta (which is urgently waiting for us to come up with a solution) and refine our invalidation stride later in Firefox 68 / 69.

Flags: needinfo?(nicolas.b.pierron)

Nicolas B. Pierron [:nbp]

Comment 26

•

6 years ago

(In reply to Iain Ireland [:iain] from comment #20)

Can't we just retrieve the cache line size every time we flush, instead of trying to store it? (That is to say, move this code to here.)

Unfortunately this would not completely work as long as WASM generate code off-thread, and also this might not work if we switch from a big to a little core in the middle of the invalidation.

Lars T Hansen [:lth]

Comment 27

•

6 years ago

nbp, incrementing by 4 bytes instead of cache line size will probably have bad performance implications. At least initially on the arm64 port, a bug caused us to flush every byte, and it was incredibly slow.

Flags: needinfo?(nicolas.b.pierron)

Kannan Vijayan [:djvj]

Assignee

Comment 28

•

6 years ago

I'd say incrementing by a lowest-common-denominator amount (if the hypothesis is correct, that would be 64 bytes on ARM64) is at least a good way to test the hypothesis on nightly. The crash rates should drop precipitously if it's because of the cache-line size mismatch issue on some big.LITTLE setups. I was planning on going that route after confirming that the cache-line sizes actually differ on an appropriate mobile device.

Nicolas B. Pierron [:nbp]

Comment 29

•

6 years ago

•

Edited

So, if we do not expect anything below 64, we can try 32 bytes (just to be even more conservative) and avoid the slowness seen with 1, and expected with 4.

Flags: needinfo?(nicolas.b.pierron)

Jan de Mooij [:jandem]

Comment 30

•

6 years ago

The report in comment 0 is a Sony E5823. Looking at specs, it has these CPUs: Octa-core (4x1.5 GHz Cortex-A53 & 4x2.0 GHz Cortex-A57). Both Cortex A53 and A57 have a cache line size of 64.

Jan de Mooij [:jandem]

Comment 31

•

6 years ago

That said, SM-G960F, SM-G965F, SM-N960F are responsible for 45% of the crashes and these use Exynos CPUs as far as I can see (like the one in the Mono blog post), so clamping to 64 bytes might make sense.

V8 code has a workaround for Cortex A53 errata: https://github.com/v8/v8/commit/fec99c689b8587b863df4a5c4793c601772ef663

Kannan Vijayan [:djvj]

Assignee

Comment 32

•

6 years ago

The commit you linked seems very relevant. I'll roll up a patch for both of these (clamping to 64, civac vs cvau) - should be cheap enough to test on nighly.

Kannan Vijayan [:djvj]

Assignee

Comment 33

•

6 years ago

Attached file Bug 1521158: Robustify cache-line invalidations on AARCH64. r?jandem — Details

Phabricator Automation

Updated

•

6 years ago

Attachment #9060699 - Attachment description: Bug 1521158 - Invalidate ARM64 caches by increments of 4 bytes instead of increments of cache lines. → Bug 1521158 - Invalidate ARM64 caches by increments of at most 32 bytes instead of increments of cache lines.

Pulsebot

Comment 34

•

6 years ago

Pushed by npierron@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ae2f7d5c145e Invalidate ARM64 caches by increments of at most 32 bytes instead of increments of cache lines. r=sstangl

Nicolas B. Pierron [:nbp]

Comment 35

•

6 years ago

As a reference to better understand v8 changes imported by Kannan:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/ch11s04.html

Oana Pop-Rus

Comment 36

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/ae2f7d5c145e

Status: NEW → RESOLVED

Closed: 6 years ago

status-firefox68: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla68

Pascal Chevrel:pascalc

Comment 37

•

6 years ago

Kannan, could you request an uplift to beta today please? Thanks

Flags: needinfo?(kvijayan)

Kannan Vijayan [:djvj]

Assignee

Comment 38

•

6 years ago

Treeherder job looks ok here (the oranges seem unrelated): https://treeherder.mozilla.org/#/jobs?repo=try&revision=b0002d0be22dc1b542602dfc0fbc95da94defbe2&selectedJob=242639385

Removed the redundant bits with nbp's patch. Got review from jandem. Landing on inbound now and will request beta uplift immediately.

Flags: needinfo?(kvijayan)

Kannan Vijayan [:djvj]

Assignee

Updated

•

6 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Pulsebot

Comment 39

•

6 years ago

Pushed by kvijayan@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/1b18ac7619e3 Robustify cache-line invalidations on AARCH64. r=jandem

Kannan Vijayan [:djvj]

Assignee

Comment 40

•

6 years ago

:pascalc - quick process question - do we need to wait until the push gets merged with central before requesting uplift?

Flags: needinfo?(pascalc)

Pascal Chevrel:pascalc

Comment 41

•

6 years ago

(In reply to Kannan Vijayan [:djvj] from comment #40)

:pascalc - quick process question - do we need to wait until the push gets merged with central before requesting uplift?

Yes please, there is always the risk of a backout. I'll evaluate on Monday if it's worth taking it but given that we only have 2 betas left, I prefer to have uplift requests already prepared beforehand, thanks.

Flags: needinfo?(pascalc)

Kannan Vijayan [:djvj]

Assignee

Comment 42

•

6 years ago

Comment on attachment 9060750 [details]
Bug 1521158: Robustify cache-line invalidations on AARCH64. r?jandem

Beta/Release Uplift Approval Request

User impact if declined: High number of crashes on ARM64 android platforms.
Is this code covered by automated tests?: Yes
Has the fix been verified in Nightly?: No
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): First patch (nbp's) simply clamps the cache invalidation jump size. Second patch (mine) is ported from V8 code, small one-line change to assembler instruction related to cache invalidations. Very small scope and easy to reason about.
String changes made/needed:

Attachment #9060750 - Flags: approval-mozilla-beta?

Kannan Vijayan [:djvj]

Assignee

Updated

•

6 years ago

Attachment #9060699 - Flags: approval-mozilla-beta?

Kannan Vijayan [:djvj]

Assignee

Comment 43

•

6 years ago

@pascalc - so, I marked the patches for beta review. Both have been landed in inbound, but unforunately I landed mine directly instead of going through the phabricator/lando - so it's not marked as published in the phabricator UI - not sure if that'll cause problems or not for uplift.

Flags: needinfo?(pascalc)

Narcis Beleuzu [:NarcisB]

Comment 44

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/1b18ac7619e3

Status: REOPENED → RESOLVED

Closed: 6 years ago → 6 years ago

Resolution: --- → FIXED

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 45

•

6 years ago

Pascal said he might consider doing another Fennec 67 beta builds this week so the fix from comment 44 can be tested on Beta67 GPStore.

Pascal Chevrel:pascalc

Comment 46

•

6 years ago

Comment on attachment 9060750 [details]
Bug 1521158: Robustify cache-line invalidations on AARCH64. r?jandem

Uplift accepted for 67 beta 16, thanks.

Flags: needinfo?(pascalc)

Attachment #9060750 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Pascal Chevrel:pascalc

Updated

•

6 years ago

Attachment #9060699 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Andreea Pavel [:apavel]

Comment 47

•

6 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/fab87214b264
https://hg.mozilla.org/releases/mozilla-beta/rev/e41d56e7c5b8

status-firefox67: affected → fixed

Marcia Knous [:marcia]

Comment 48

•

6 years ago

One of the Fennec signatures is still crashing in beta even after the uplifted patch, and it looks as if there is still some crashes in nightly with the patch. Should I spin off a new bug?

Flags: needinfo?(kvijayan)

Marcia Knous [:marcia]

Comment 49

•

6 years ago

Bug 1550525 was filed for the signature that remains.

Flags: needinfo?(kvijayan)

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

status-firefox-esr60: --- → wontfix

Flags: needinfo?(sstangl)

Firefox Bug Husbandry Bot

Comment 50

•

5 years ago

Please specify a root cause for this bug. See :tmaity for more information.

Root Cause: --- → ?

Kannan Vijayan [:djvj]

Assignee

Updated

•

4 years ago

Root Cause: ? → Coding: Concurrency Issue

Nicolas B. Pierron [:nbp]

Comment 51

•

4 years ago

I disagree with this root cause analysis, I would have set it to "External software affecting Firefox", as this is not a faulty implementation of Firefox but a faulty specification of the cache invalidation in the big.LITTLE cpu.

To be clearer Bug 1441436 prevented us from even having a faulty implementation in the first place.

Root Cause: Coding: Concurrency Issue → External Software Affecting Firefox

Bug 1521158 - Invalidate ARM64 caches by increments of at most 32 bytes instead of increments of cache lines. 6 years ago Nicolas B. Pierron [:nbp] 47 bytes, text/x-phabricator-request	pascalc : approval-mozilla-beta+	Details \| Review
Bug 1521158: Robustify cache-line invalidations on AARCH64. r?jandem 6 years ago Kannan Vijayan [:djvj] 47 bytes, text/x-phabricator-request	pascalc : approval-mozilla-beta+	Details \| Review