Much thanks to :froydnj and his changes to the minidump tool, I'm able to get clean disassemblies of dumps. There are some new conclusions we can draw from the gleaned information:
This is almost definitely an icache issue. One of the issues that was bothering me was that a significant fraction (about 20-40%) of the crash addresses were not cache-line aligned. However, if we consider the case of code that jumps into the middle of a cache line, it can explain this behaviour.
This was confirmed from the disassembly of a non-cache-line aligned dump, which confirmed that the instruction address was a jump target. The prior instruction was an unconditional jump, which means that there was no fall-through access to the crashing instruction address.
Secondly, analysis of the instruction sequence being disassembled suggests that the crash address is occurring in baseline mainline jitcode. The pattern of "load IC first stub in IC chain, then load code pointer, then
jlr into it" was clear to spot.
Lastly, the fact that the crash address is occurring immediately after a
jlr into an IC chain suggests that we are crashing on return from an IC chain back into baseline mainline jitcode. This is really interesting because it would imply that we are dealing with cache invalidations for jitcode that should be "on stack". One would expect that this jitcode is not messed with at all - how could it be discarded or overwritten when there's a stack frame referencing it?
So now my thoughts turn to this: what are the situations where we may somehow mess with baseline mainline jitcode while it's on the stack.
I want to firm up the above hypothesis by confirming that the situation is true across several crashdumps. If that holds up, next step is to investigate potential causes.
In the meantime, it would be great if someone independently could treat the above as a strawman and add their thoughts. Does the above reasoning hang together? Are there other possibilities I'm missing? What are possible ways I could go about trying to isolate both potential causes, as well as validation that they are the real cause of remaining problems.
Given the above line of thought, I'm leaning in the direction of the following possible components as the source of the issue: Profiler (rewrites baseline jicode - but unlikely because who in the wild is profiling on Android Fx Beta?), GC majors (wild stab in the dark, but maybe a major GC does some stuff to mess with baseline jitcode?), and debugger (also feels unlikely, but who knows).
Open call for opinions and weigh-in. And I'm pinging :jandem and :tcampbell and :nbp on this directly because I want their input. Time-frame for resolving this is short, this is a hard one, and I could use the third party insight.
Assignee: nobody → kvijayan