Closed Bug 1550525 Opened 7 months ago Closed 5 months ago

Crash in [@ js::LiveSavedFrameCache::~LiveSavedFrameCache]

Categories

(Core :: JavaScript Engine: JIT, defect, P1, critical)

Unspecified
Android
defect

Tracking

()

RESOLVED DUPLICATE of bug 1461724
Tracking Status
firefox67 --- wontfix
firefox68 --- wontfix
firefox69 --- ?

People

(Reporter: marcia, Assigned: djvj)

References

(Depends on 1 open bug)

Details

(Keywords: crash)

Crash Data

This bug is for crash report bp-0f67d666-107c-4159-91ad-314390190509.

After the fix for Bug 1521158, there is still one signature showing up in both Beta 67 and Nightly 68: https://bit.ly/2JbvEd4

Based on irc discussion with jcristau filing and marking this as blocking 68.

Top 4 frames of crashing thread:

0  @0x71d4f88b5c 
1  @0x71d4f88ae4 
2 libxul.so js::LiveSavedFrameCache::~LiveSavedFrameCache js/src/vm/Stack.h
3  @0x721e10f2ec 

Flags: needinfo?(kvijayan)

This is the top crash in 67.0b18.

I'll take a look at this as soon as I checkpoint some other work today.

Flags: needinfo?(kvijayan)

Tracking as a potential dot release driver for Fennec

Pascal, we're not shipping arm64 fennec in 67.

Flags: needinfo?(pascalc)

True, it's an arm64 crash, no need to track for 67 indeed.

Flags: needinfo?(pascalc)

(In reply to Kannan Vijayan [:djvj] from comment #2)

I'll take a look at this as soon as I checkpoint some other work today.

Did you manage to take a look at the issue, and noticed anything?

Flags: needinfo?(kvijayan)
Priority: -- → P2

oops, setting as P1 as it is blocking Fx68 release.

Priority: P2 → P1

Yeah. The crash data didn't really provide any additional useful info. I spent most of my time looking carefully at all the places where we might be updating code without invalidating, and looking for ARM64 specifics that differed, with no luck.

We need to discuss some way to capture one of these in a bottle for analysis.

Flags: needinfo?(kvijayan)

No new crashes with this signature in 68 since the May 9 build. Nothing in 68.0b3 yet either. Maybe this is fixed after all?

Hrm, or maybe the signature morphed. Seeing some similarity with the crashes from bug 1551205.

See Also: → 1551205
Depends on: 1554933

:jcristau - what do you mean by similarities? The stack trace doesn't look too similar to me. Are there other attributes you're looking at?

Ok, so manual investigation of various points where cache invalidations may be missing didn't yield any results. I'm still working on getting a way to disassemble the raw crash dumps at the current IP.

Summary of what's done so far:

  1. Audit of places where we're introducing new code, comparing ARM64 and other arch implementations.
  2. Got WINDBG installed on an ARM laptop and tried to open DMP files to no avail.
  3. Opened fine with VisualStudio, but only summary information is displayed with no ability to get at instruction disassembly.
  4. Tried to use https://hg.mozilla.org/users/tmielczarek_mozilla.com/get-minidump-instructions, on jandem's advice. Got it built but it does not recognize the CPU arch for ARM64 machines, and so doesn't print disassembly.

With some super-hacky changes to tmielczarek's code, I'm getting some instruction dumps for ARM64, but it seems to be dumping garbage:

./get-minidump-instructions --disassemble --address 0x7eb0a45570 ~/Downloads/63a31c18-1b31-412a-8152-c8ca90190528.dmp 
2019-05-28 14:38:08: minidump.cc:5103: INFO: Minidump opened minidump /home/kvijayan/Downloads/63a31c18-1b31-412a-8152-c8ca90190528.dmp
2019-05-28 14:38:08: minidump.cc:5233: INFO: Minidump not byte-swapping minidump
Faulting instruction pointer: 0x7eb0a45570

2019-05-28 14:38:08: minidump.cc:2870: INFO: MinidumpModuleList has no module at 0x0
/tmp/minidump-instructions-vvWxYm:     file format binary


2019-05-28 14:38:08: minidump.cc:2870: INFO: MinidumpModuleList has no module at 0xd
Disassembly of section .data:

b0a454f0 <.data>:
2019-05-28 14:38:08: minidump.cc:2870: INFO: MinidumpModuleList has no module at 0xb0a454f0
b0a454f0:       d4200000        strtle  r0, [r0], #-0
2019-05-28 14:38:08: minidump.cc:2870: INFO: MinidumpModuleList has no module at 0xb0a454f4
b0a454f4:       f9400529                        ; <UNDEFINED> instruction: 0xf9400529
2019-05-28 14:38:08: minidump.cc:2870: INFO: MinidumpModuleList has no module at 0xb0a454f8
b0a454f8:       f9400130                        ; <UNDEFINED> instruction: 0xf9400130
2019-05-28 14:38:08: minidump.cc:2870: INFO: MinidumpModuleList has no module at 0xb0a454fc
b0a454fc:       d61f0200        ldrle   r0, [pc], -r0, lsl #4
2019-05-28 14:38:08: minidump.cc:2870: INFO: MinidumpModuleList has no module at 0xb0a45500
b0a45500:       aa0003e2        bge     0xb0a46490
2019-05-28 14:38:08: minidump.cc:2870: INFO: MinidumpModuleList has no module at 0xb0a45504
b0a45504:       b24f3a73        sublt   r3, pc, #471040 ; 0x73000

Next steps:

I'm going to keep trying to get a working disassembly from these files.

In the meantime, a survey of the processors involved might reveal some information.

I also talked to jeff muizelaar about this and he mentioned that there is some code used by gfx to dump a certain window of recorded past information as part of the crash-dump. If I can do something similar with keeping track of the last K ranges of memory addresses that code was generated into, we can check to see whether the crashes are occurring at addresses that had had their jitcode changed for one reason or another.

(In reply to Kannan Vijayan [:djvj] from comment #11)

:jcristau - what do you mean by similarities? The stack trace doesn't look too similar to me. Are there other attributes you're looking at?

Basically the breakdown of devices and crash reasons for both signatures. Added to the fact that this one last showed up in the May 9 build, and SprintfLiteral<T> appeared May 10.

I didn't really find a pattern as far as stack traces ("proto signature" in crash-stats), there isn't really much to go on there.

Talking to froydnj about getting valid disassemblies. Got a working rust-minidump going after froydnj helpfully updated that code to handle ARM64 dumps, but the get-minidump-instructions tool is still failing. He is fixing up the minidump tool to handle this and pushing patches soon.

Much thanks to :froydnj and his changes to the minidump tool, I'm able to get clean disassemblies of dumps. There are some new conclusions we can draw from the gleaned information:

This is almost definitely an icache issue. One of the issues that was bothering me was that a significant fraction (about 20-40%) of the crash addresses were not cache-line aligned. However, if we consider the case of code that jumps into the middle of a cache line, it can explain this behaviour.

This was confirmed from the disassembly of a non-cache-line aligned dump, which confirmed that the instruction address was a jump target. The prior instruction was an unconditional jump, which means that there was no fall-through access to the crashing instruction address.

Secondly, analysis of the instruction sequence being disassembled suggests that the crash address is occurring in baseline mainline jitcode. The pattern of "load IC first stub in IC chain, then load code pointer, then jlr into it" was clear to spot.

Lastly, the fact that the crash address is occurring immediately after a jlr into an IC chain suggests that we are crashing on return from an IC chain back into baseline mainline jitcode. This is really interesting because it would imply that we are dealing with cache invalidations for jitcode that should be "on stack". One would expect that this jitcode is not messed with at all - how could it be discarded or overwritten when there's a stack frame referencing it?

So now my thoughts turn to this: what are the situations where we may somehow mess with baseline mainline jitcode while it's on the stack.

I want to firm up the above hypothesis by confirming that the situation is true across several crashdumps. If that holds up, next step is to investigate potential causes.

In the meantime, it would be great if someone independently could treat the above as a strawman and add their thoughts. Does the above reasoning hang together? Are there other possibilities I'm missing? What are possible ways I could go about trying to isolate both potential causes, as well as validation that they are the real cause of remaining problems.

Given the above line of thought, I'm leaning in the direction of the following possible components as the source of the issue: Profiler (rewrites baseline jicode - but unlikely because who in the wild is profiling on Android Fx Beta?), GC majors (wild stab in the dark, but maybe a major GC does some stuff to mess with baseline jitcode?), and debugger (also feels unlikely, but who knows).

Open call for opinions and weigh-in. And I'm pinging :jandem and :tcampbell and :nbp on this directly because I want their input. Time-frame for resolving this is short, this is a hard one, and I could use the third party insight.

Assignee: nobody → kvijayan
Flags: needinfo?(tcampbell)
Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(jdemooij)

One word of caution: when we were looking at this yesterday, I remember seeing jumps into baseline IC chains (meaning that it's baseline code), but IIRC they were not immediately prior to the crash address. I seem to remember them coming slightly later in the instruction stream. Wasn't the faulting code incrementing a counter somewhere? (Probably a hitcount?)

In other words: I agree with everything up until "Lastly, ...", but we should double-check to make sure your last bit of evidence is correct before drawing too many conclusions.

This is looking a lot like Bug 1461724 which was an ARM crash on Sony hardware on msm8994 / msm8952 boards. At the time I was wondering if there were strange Sony kernel configurations or hardware oddities for those snapdragon SOCs.

Flags: needinfo?(tcampbell)
See Also: → 1461724

I'm starting to wonder if this is a pre-existing problem that simply manifests more now because of ARM64 CPUs being less conservative with memory ordering, and more cores leading to more situations where cache coherency issues can pop up.

(In reply to Kannan Vijayan [:djvj] from comment #19)

I'm starting to wonder if this is a pre-existing problem that simply manifests more now because of ARM64 CPUs being less conservative with memory ordering, and more cores leading to more situations where cache coherency issues can pop up.

Are they less conservative than ARM CPUs, though? I'd expect the same cache coherency problems to manifest on ARM CPUs as well. And there are a very very few crashes for this signature on x86 Android--which could also have cache coherency problems, but x86 has much stronger ordering than ARM{,64}...

The crashing PC here doesn't seem to be the listed signature and looks like EnterJit crashes similar to Bug 1461724 (which does affect ARM as well). The 'trust' level for the LiveSavedFrameCache is 'scan' which is doesn't mean a lot within JIT frames. I also am suspicious that the LiveSavedFrameCache may just be the last code symbol and not actually a match, especially if the signature appears to have jumped to Bug 1551205.

Bug 1461480 is also maybe relevant. It shows the EnterJit frame after the (probably wrong) LiveSavedFrameCache::~LiveSavedFrameCache that we see here. That was ARM64 specific

See Also: → 1461480

I don't have anything to add unfortunately, but I think icache invalidation bugs in our shared code are unlikely given simulator assertions and limited baseline code patching. Maybe worth looking for patterns like crash address alignment, chipsets...

Flags: needinfo?(jdemooij)

(In reply to Kannan Vijayan [:djvj] from comment #16)

This is almost definitely an icache issue.

An icache issue should not have any reasons to crash on a cache line boundary if the cache line is not yet loaded in the instruction cache. The reason to crash on a cache line boundary would be if the instruction cache contains instructions which are invalid. However this is never supposed to happen, except for constant pools. And this would imply that the code got reused. Maybe we can try a patch dedicated to ARM64 which force the reuse of discarded code, to see if can these issues more frequent?

Is the code contained in the cache line patchable, or is it supposed to be immutable?

Could it be that we change core while doing the cache invalidation in CPU::EnsureIAndDCacheCoherency? This could explain non-patchable code being invalidated.

Secondly, analysis of the instruction sequence being disassembled suggests that the crash address is occurring in baseline mainline jitcode. The pattern of "load IC first stub in IC chain, then load code pointer, then jlr into it" was clear to spot.

Lastly, the fact that the crash address is occurring immediately after a jlr into an IC chain suggests that we are crashing on return from an IC chain back into baseline mainline jitcode. […]

Stupid question, how would a miss-aligned returned address be reported? While I do not see any reasons for having a miss-aligned returned address (not aligned on instructions), I could imagine the CPU rounding the PC before crashing on a non-rounded PC.

[…] how could it be discarded or overwritten when there's a stack frame referencing it?

Looking at when code is discarded, ExecutableAllocator::poisonCode should replace the content with 0xA3. Therefore this is code is not discarded, or it is and then overwritten with new Baseline code.

However, we do not call ExecutableAllocator::cacheFlush after poisoning the code, which I suggest doing here as this would prevent us from executing stalled code when returning to discarded code. Also I do not think this would be our issue either, as the memory dump is reading data and not code, and this would have scrambled the disassembly.

Flags: needinfo?(nicolas.b.pierron)

So looking at aggregations reveals some interesting info (across both the LiveSavedFrameCache and the SprintfLiteral crash signatures):

For SprintfLiteral, >63% of crashes are coming from sony. For LiveSavedFrameCache, ~47% of crashes are coming from sony devices. Samsung is pulling ~16% of the weight for SprintfLiteral, and ~30% of the weight for LiveSavedFrameCache.

Looking at the android models, the top culprits are as follows.

SprintfLiteral:
1 F5321 39 14.72 %
2 E5823 24 9.06 %
3 SAMSUNG-SM-G920A 23 8.68 %
4 F5121 21 7.92 %
5 SO-02J 20 7.55 %
6 E6653 12 4.53 %
7 SOV32 8 3.02 %

LiveSavedFrameCache:
1 F5321 13 11.30 %
2 E5823 11 9.57 %
3 F5121 9 7.83 %
4 Redmi Note 3 7 6.09 %
5 SM-G960F 7 6.09 %
6 SM-G965F 5 4.35 %
7 SM-T813 5 4.35 %
8 E6653 4 3.48 %
9 SM-G920F 4 3.48 %

Checking the processors across all of these, I observe the following: all of the android models I've checked use a big.LITTLE setup. They all have a Cortex A-53 for the LITTLE side, with either an A-57 or a A-72 for the big component. These correspond to the Snapdragon 650 (a-72) and the Snapdragon 650 or Snapdragon 810. The 650s have the A-72 for the big cores, and the 810s have the A-57 for a big cores.

Now, this may simply be an artifact of the ecosystem - the most commonly used cpu setups in the average android phone worldwide might be A-53 + A-57/72, Snapdragon 810/650.

The exception to this is the the Samsung (G920A), which is an exynos chipset, but that one too is a Cortex A-57 / A-53 big.LITTLE combo.

I'm going to compare against the general distribution of crashes on ARM64 to see if this is due to underlying distributions or if the distribution is specific to this crash.

The general distribution of ARM64 crashes does not match this distribution:

1 SHIELD Android TV 106 3.45 %
2 F5321 65 2.12 %
3 SM-G920F 52 1.69 %
4 SM-G935F 45 1.47 %
5 ANE-AL00 43 1.40 %
6 E5823 41 1.34 %
7 SM-G950F 39 1.27 %
8 SM-A520F 36 1.17 %
9 F5121 34 1.11 %

The samsung models are much more prevalent, all the numbers are far more finely distributed with a much fatter tail. The biggest standout is the prevalence of sony models.

The General distribution of manufacturers is so:
1 samsung 818 26.64 %
2 HUAWEI 490 15.96 %
3 Sony 355 11.56 %
4 Xiaomi 301 9.80 %
5 OnePlus 150 4.88 %

So it seems that Sony devices are highly over-represented for our bug. I also looked through each of the device models and found that for the Sony ones, they are always Xperia models.

Further analysis of the dumps from the crash yielded some new information.

  1. The crashes are not at return addresses exclusively (that was just a fluke from the first few readings)

  2. They do not exclusively happen in baseline mainline jitcode. They seem to originate just as well from ion jitcode or CacheIR code as well.

This signature has pretty much disappeared now, so removing tracking.

Duping in favour of Bug 1550525. This signature is misleading as these are actually JIT crashes. The crashes appear on both 32 and 64-bit ARM but under different signatures. I recommend using super-search across multiple signatures such as [1].

(Analysis here is still useful, so I'll link from other bug)

[1] https://crash-stats.mozilla.org/search/?reason=~SIGILL&cpu_arch=arm64&platform=Android&date=%3E%3D2019-06-19T00%3A21%3A00.000Z&date=%3C2019-07-03T00%3A21%3A00.000Z&_facets=signature&_facets=android_manufacturer&_facets=android_board&_facets=cpu_arch&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Status: NEW → RESOLVED
Closed: 5 months ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1461724
You need to log in before you can comment on or make changes to this bug.