Closed Bug 1583907 Opened 6 years ago Closed 6 years ago

Intermittent [tier 2] Android Jit tests/jit-test/jit-test/* | Segmentation fault (code 139, args "--no-asmjs") [0.3 s]

Categories

(Core :: JavaScript Engine, defect, P5)

defect

Tracking

()

RESOLVED FIXED
mozilla71
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- unaffected
firefox69 --- unaffected
firefox70 --- unaffected
firefox71 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: jandem)

References

(Regression)

Details

(Keywords: crash, intermittent-failure, regression)

Attachments

(2 files)

Filed by: ccoroiu [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=268383886&repo=mozilla-inbound
Full log: https://queue.taskcluster.net/v1/task/UivmY5VxQyyXjEok9aPBEA/runs/0/artifacts/public/logs/live_backing.log


task 2019-09-25T16:06:07.953Z] 16:06:02 INFO - TEST-PASS | tests/jit-test/jit-test/tests/asm.js/testBug1437534.js | Success (code 0, args "--blinterp-eager") [0.3 s]
[task 2019-09-25T16:06:07.953Z] 16:06:03 INFO - Segmentation faultSegmentation faultExit code: 139
[task 2019-09-25T16:06:07.953Z] 16:06:03 INFO - FAIL - asm.js/testBug1437534.js
[task 2019-09-25T16:06:07.953Z] 16:06:03 WARNING - TEST-UNEXPECTED-FAIL | tests/jit-test/jit-test/tests/asm.js/testBug1437534.js | Segmentation fault (code 139, args "--no-asmjs") [0.3 s]
[task 2019-09-25T16:06:07.953Z] 16:06:03 INFO - INFO exit-status : 139
[task 2019-09-25T16:06:07.953Z] 16:06:03 INFO - INFO timed-out : False
[task 2019-09-25T16:06:07.953Z] 16:06:03 INFO - INFO stdout > Segmentation fault
[task 2019-09-25T16:06:07.953Z] 16:06:03 INFO - INFO stderr 2> Segmentation fault

Flags: needinfo?(dmajor)

The "intermittent" in the title is not quite right, this is a perma fail, but I'm too scared to touch it in case I break the sheriff team's workflow.

Random needinfo victims picked from bug 1555479: how can I debug these "Android 8.0 Pixel2 pgo" failures caused by a compiler upgrade? Is it hopeless without an actual device? How long can I let this tier2 failure sit before upsetting you?

Flags: needinfo?(jmuizelaar)
Flags: needinfo?(gwatson)
Flags: needinfo?(gbrown)

Why did you pick bug 1555479? It looks jit tests are failing and the wrench tests are a separate job.

Flags: needinfo?(jmuizelaar)

I didn't read closely enough.

Flags: needinfo?(gwatson)
Flags: needinfo?(gbrown)
Keywords: crash
Summary: Intermittent [tier 2] Androd Jit tests/jit-test/jit-test/* | Segmentation fault (code 139, args "--no-asmjs") [0.3 s] → Intermittent [tier 2] Android Jit tests/jit-test/jit-test/* | Segmentation fault (code 139, args "--no-asmjs") [0.3 s]

There is a bit more information in the logcat artifact, and some failures also have a tombstone artifact for the crash; they don't look very helpful to me, but maybe there's something useful there for you.

You can run an android arm emulator locally if you have the android sdk installed, with 'mach android-emulator --version 4.3'; I don't know if you can reproduce the failure that way, but it might be worth checking.

:aerickson and :bc know all about the "Android 8.0 Pixel2" environment and might have additional advice?

At this rate, I would expect this to be on the intermittent sheriff's radar within a few days.

tombstone and the logcat says "Cause: null pointer dereference"

Looking at the WARNINGs from the clang 9.0 build and the previous, the only thing in js land is

+WARNING - [style 0.0.1] /builds/worker/workspace/build/src/obj-firefox/dist/include/js/Proxy.h:222:43: warning: offset of on non-standard-layout type 'js::BaseProxyHandler' [-Winvalid-offsetof], err: false

(In reply to Geoff Brown [:gbrown] from comment #6)

You can run an android arm emulator locally if you have the android sdk installed, with 'mach android-emulator --version 4.3'; I don't know if you can reproduce the failure that way, but it might be worth checking.

I tried this but it doesn't work - the shell crashes and logcat shows it's a SIGILL in libmozglue.so. What's the simplest way to start the JS shell in the emulator? I tried this zip.

Flags: needinfo?(gbrown)

(In reply to Jan de Mooij [:jandem] from comment #9)

(In reply to Geoff Brown [:gbrown] from comment #6)

You can run an android arm emulator locally if you have the android sdk installed, with 'mach android-emulator --version 4.3'; I don't know if you can reproduce the failure that way, but it might be worth checking.

I tried this but it doesn't work - the shell crashes and logcat shows it's a SIGILL in libmozglue.so. What's the simplest way to start the JS shell in the emulator? I tried this zip.

That reminds me of https://bugzilla.mozilla.org/show_bug.cgi?id=1582838#c3; I don't know what's happening.

(In reply to Geoff Brown [:gbrown] from comment #11)

That reminds me of https://bugzilla.mozilla.org/show_bug.cgi?id=1582838#c3; I don't know what's happening.

Hm I hit that issue too when I tried to run the GeckoView Example APK in the emulator.

I'm sorry but I can't do anything here if we can't even get things to run in the ARM emulator.

Sorry, it looks like local runs of the arm emulator are not very usable at this time. I've updated bug 1582838 and will continue to try to move that forward.

Flags: needinfo?(gbrown)

(In reply to Geoff Brown [:gbrown] from comment #13)

Sorry, it looks like local runs of the arm emulator are not very usable at this time. I've updated bug 1582838 and will continue to try to move that forward.

Thanks!


For what it's worth, all tests that fail (the browser jsreftest + shell jit-tests) use Function or eval to create/call a function with a ton of arguments. Combined with the jsreftest InterpreterStack LifoAlloc::release stack I wonder if it's related to the LifoAlloc's oversized chunk handling (pushing many arguments => large expression stack => large stack frame).

I'm doing some Try debugging but it takes time.

See Also: → 1582838
See Also: → 1584056
See Also: → 1584055

(In reply to Jan de Mooij [:jandem] from comment #14)

I'm doing some Try debugging but it takes time.

We fail the range check assertion in BumpChunk::release(Mark) because we have a BumpChunk::Mark that has:

  • chunk_: 0xd03e6000
  • bump_: 0xd03e6000

They're identical. This shouldn't happen because the chunk has a fixed-size BumpChunkReservedSpace header. I wonder if we're miscompiling BumpChunk::begin() somewhere.

The BumpChunk itself appears to be valid: the bump_ pointer when we crash is 0xd03f98f0, that's 80112 bytes in - the test pushes 10,000 JS Values of 8 bytes each so considering BumpChunkReservedSpace and InterpreterFrame that looks about right.

I'll see if I can figure out why bump_ and chunk_ are equal.

(In reply to Jan de Mooij [:jandem] from comment #15)

The BumpChunk itself appears to be valid: the bump_ pointer when we crash is 0xd03f98f0, that's 80112 bytes in - the test pushes 10,000 JS Values of 8 bytes each so considering BumpChunkReservedSpace and InterpreterFrame that looks about right.

For what it's worth, when I run this test locally in a 32-bit opt JS shell, I get the same 80112 number so that's all correct. The only difference is that my BumpChunk::Mark struct has a bump_ pointer that matches BumpChunk::bump_ instead of the BumpChunk itself.

LLVM bug. It ends up inlining pushInlineFrame => LifoAlloc::mark into Interpret but messes up codegen for it. This shows where/how it goes wrong.

I verified this workaround fixes the jsreftest + jit-test crashes on Try.

Pushed by jdemooij@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/5bd04359efb6 Add MOZ_NEVER_INLINE to LifoAlloc::mark to work around Clang 9 miscompilation on Android. r=nbp

Thanks very much Jan for the investigation and patch!

Flags: needinfo?(dmajor)

I should mention, we'll still file this upstream, but the current form of the repro is not a great thing to attach to a bug report. I'd like to reduce and/or bisect it first, then I'll report back.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla71
Assignee: nobody → jdemooij
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: