Closed Bug 1530874 Opened 7 years ago Closed 6 years ago

Some cppunit tests crash on startup on Android 7.0 x86_64 emulator (on modern android?)

Categories

(Firefox for Android Graveyard :: Testing, defect, P2)

defect

Tracking

(firefox66 wontfix, firefox67 wontfix, firefox68 fixed)

RESOLVED FIXED
Firefox 68
Tracking Status
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- fixed

People

(Reporter: gbrown, Assigned: glandium)

References

Details

Attachments

(2 files)

Most cppunit tests run okay on the Android 7.0 x86_64 emulator, but 3 tests crash on startup, apparently before main() starts to run.

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=c0b2cd09739c7954073fe691a7d1853b7de66ecc

The problematic tests are:

mozglue/tests/TestPrintf
mozglue/tests/TestSSEConfig
js/src/jsapi-tests

All fail with a segfault and adb reports an exit code of 139, but there are no minidumps.

Logcats have a minimal crash report

https://taskcluster-artifacts.net/NpdugNZRSQWkdVuK__i1XA/0/public/test_info//logcat-emulator-5554.log

02-26 00:45:12.360 2280 2280 F libc : Fatal signal 11 (SIGSEGV), code 1, fault addr 0x544fa in tid 2280 (ShowSSEConfig)
02-26 00:45:12.360 988 988 W : debuggerd: handling request: pid=2280 uid=0 gid=0 tid=2280
02-26 00:45:12.380 2281 2281 F DEBUG : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
02-26 00:45:12.380 2281 2281 F DEBUG : Build fingerprint: 'Android/sdk_phone_x86_64/generic_x86_64:7.0/NYC/4174735:userdebug/test-keys'
02-26 00:45:12.380 2281 2281 F DEBUG : Revision: '0'
02-26 00:45:12.380 2281 2281 F DEBUG : ABI: 'x86_64'
02-26 00:45:12.380 2281 2281 F DEBUG : pid: 2280, tid: 2280, name: ShowSSEConfig >>> /data/local/tests/cppunittests/b/ShowSSEConfig <<<
02-26 00:45:12.380 2281 2281 F DEBUG : signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x544fa
02-26 00:45:12.380 2281 2281 F DEBUG : rax 00007f65280c6400 rbx 0000000000000a6f rcx 00007f6527e742ff rdx 0000000000000000
02-26 00:45:12.380 2281 2281 F DEBUG : rsi 00007f65280c64b0 rdi 0000000000000a6f
02-26 00:45:12.380 2281 2281 F DEBUG : r8 000000000000000a r9 00007fffc57c0410 r10 000000000000004d r11 00007f6527f1b808
02-26 00:45:12.380 2281 2281 F DEBUG : r12 0000000000000003 r13 00007f6527f42c88 r14 0000000000000000 r15 00007f6527f3fb70
02-26 00:45:12.380 2281 2281 F DEBUG : cs 0000000000000033 ss 000000000000002b
02-26 00:45:12.380 2281 2281 F DEBUG : rip 00000000000544fa rbp 0000000000000004 rsp 00007fffc57bed08 eflags 0000000000000202
02-26 00:45:12.380 2281 2281 F DEBUG :
02-26 00:45:12.380 2281 2281 F DEBUG : backtrace:
02-26 00:45:12.380 2281 2281 F DEBUG : #00 pc 00000000000544fa <unknown>
02-26 00:45:12.380 1300 1390 W NativeCrashListener: Couldn't find ProcessRecord for pid 2280
02-26 00:45:12.380 1300 1320 I BootReceiver: Copying /data/tombstones/tombstone_00 to DropBox (SYSTEM_TOMBSTONE)
02-26 00:45:12.380 988 988 W : debuggerd: resuming target 2280
02-26 00:45:12.380 988 988 E : debuggerd: failed to send signal 18 to target: No such process
02-26 00:45:12.680 2291 2291 F libc : Fatal signal 11 (SIGSEGV), code 1, fault addr 0x544fa in tid 2291 (TestPrintf)
02-26 00:45:12.680 988 988 W : debuggerd: handling request: pid=2291 uid=0 gid=0 tid=2291
02-26 00:45:12.700 2292 2292 F DEBUG : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
02-26 00:45:12.700 2292 2292 F DEBUG : Build fingerprint: 'Android/sdk_phone_x86_64/generic_x86_64:7.0/NYC/4174735:userdebug/test-keys'
02-26 00:45:12.700 2292 2292 F DEBUG : Revision: '0'
02-26 00:45:12.700 2292 2292 F DEBUG : ABI: 'x86_64'
02-26 00:45:12.700 2292 2292 F DEBUG : pid: 2291, tid: 2291, name: TestPrintf >>> /data/local/tests/cppunittests/b/TestPrintf <<<
02-26 00:45:12.700 2292 2292 F DEBUG : signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x544fa
02-26 00:45:12.700 2292 2292 F DEBUG : rax 00007923f9d12400 rbx 0000000000000a6f rcx 00007923f9ab32ff rdx 0000000000000000
02-26 00:45:12.700 2292 2292 F DEBUG : rsi 00007923f9d124b0 rdi 0000000000000a6f
02-26 00:45:12.700 2292 2292 F DEBUG : r8 000000000000000a r9 00007ffff941fb40 r10 000000000000004d r11 00007923f9b5a808
02-26 00:45:12.700 2292 2292 F DEBUG : r12 0000000000000003 r13 00007923f9b81c88 r14 0000000000000000 r15 00007923f9b7eb70
02-26 00:45:12.700 2292 2292 F DEBUG : cs 0000000000000033 ss 000000000000002b
02-26 00:45:12.700 2292 2292 F DEBUG : rip 00000000000544fa rbp 0000000000000004 rsp 00007ffff941e438 eflags 0000000000000206
02-26 00:45:12.700 2292 2292 F DEBUG :
02-26 00:45:12.700 2292 2292 F DEBUG : backtrace:
02-26 00:45:12.700 2292 2292 F DEBUG : #00 pc 00000000000544fa <unknown>
02-26 00:45:12.700 1300 1390 W NativeCrashListener: Couldn't find ProcessRecord for pid 2291
02-26 00:45:12.700 1300 1320 I BootReceiver: Copying /data/tombstones/tombstone_01 to DropBox (SYSTEM_TOMBSTONE)
02-26 00:45:12.700 988 988 W : debuggerd: resuming target 2291
02-26 00:45:12.700 988 988 E : debuggerd: failed to send signal 18 to target: No such process
02-26 00:45:13.830 1959 1994 D InitAlarmsService: Clearing and rescheduling alarms.
02-26 00:49:42.570 1300 1315 I GnssLocationProvider: WakeLock acquired by handleInjectNtpTime()
02-26 00:49:42.660 1300 2620 D SntpClient: round trip: 80ms, clock offset: 22493ms
02-26 00:49:42.660 1300 2620 I GnssLocationProvider: WakeLock acquired by sendMessage(10, 0, null)
02-26 00:49:42.660 1300 1315 I GnssLocationProvider: WakeLock released by handleMessage(10, 0, null)
02-26 00:49:42.660 1300 2620 I GnssLocationProvider: WakeLock released by handleInjectNtpTime()
02-26 00:54:41.100 1300 1315 I UsageStatsService: Time changed in UsageStats by 2 seconds
02-26 00:54:41.100 1300 1315 I UsageStatsService: User[0] Flushing usage stats to disk
02-26 00:54:41.120 1300 1315 I UsageStatsDatabase: Time changed by +2s269ms. files deleted: 0 files moved: 5
02-26 00:54:41.130 1300 1315 I UsageStatsService: User[0] Rollover scheduled @ 2019-02-27 00:44:36(1551228276491)
02-26 00:58:53.290 2827 2827 F libc : Fatal signal 11 (SIGSEGV), code 1, fault addr 0x544fa in tid 2827 (jsapi-tests)
02-26 00:58:53.290 988 988 W : debuggerd: handling request: pid=2827 uid=0 gid=0 tid=2827
02-26 00:58:53.290 2828 2828 F DEBUG : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
02-26 00:58:53.290 2828 2828 F DEBUG : Build fingerprint: 'Android/sdk_phone_x86_64/generic_x86_64:7.0/NYC/4174735:userdebug/test-keys'
02-26 00:58:53.290 2828 2828 F DEBUG : Revision: '0'
02-26 00:58:53.290 2828 2828 F DEBUG : ABI: 'x86_64'
02-26 00:58:53.290 2828 2828 F DEBUG : pid: 2827, tid: 2827, name: jsapi-tests >>> /data/local/tests/cppunittests/b/jsapi-tests <<<
02-26 00:58:53.290 2828 2828 F DEBUG : signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x544fa
02-26 00:58:53.290 2828 2828 F DEBUG : rax 00007771def0e400 rbx 0000000000000a6f rcx 00007771decc12ff rdx 0000000000000000
02-26 00:58:53.290 2828 2828 F DEBUG : rsi 00007771def0e4b0 rdi 0000000000000a6f
02-26 00:58:53.290 2828 2828 F DEBUG : r8 000000000000000a r9 00007fffa75dd250 r10 000000000000004d r11 00007771ded68808
02-26 00:58:53.290 2828 2828 F DEBUG : r12 0000000000000003 r13 00007771ded8fc88 r14 0000000000000000 r15 00007771ded8cb70
02-26 00:58:53.290 2828 2828 F DEBUG : cs 0000000000000033 ss 000000000000002b
02-26 00:58:53.290 2828 2828 F DEBUG : rip 00000000000544fa rbp 0000000000000004 rsp 00007fffa75dbb38 eflags 0000000000000206
02-26 00:58:53.290 2828 2828 F DEBUG :
02-26 00:58:53.290 2828 2828 F DEBUG : backtrace:
02-26 00:58:53.290 2828 2828 F DEBUG : #00 pc 00000000000544fa <unknown>

All cppunit tests pass on Android 4.3 armv7.

In a local test with an x86 (32 bit) build, TestPrintf and ShowSSEConfig pass on the Android 4.2 x86 emulator; jsapi-tests do not run - I don't know why!

In a local test with an x86 (32 bit) build, TestPrintf and ShowSSEConfig crash on the Android 7.0 x86_64 emulator; jsapi-tests do not run -- I don't know why!

Priority: -- → P2

(In reply to Geoff Brown [:gbrown] from comment #1)

jsapi-tests do not run - I don't know why!

That just seems to be a fault in the 'mach cppunittests' command.

The crashing tests - TestPrintf, ShowSSEConfig, jsapi-tests, all use mozglue; I cannot find a non-crashing cppunit test that uses mozglue.

Running locally, I removed the body of TestPrintf.cpp, reducing it to an empty main() function -- it still crashed. Then I re-linked without libmozglue.so (I captured the normal link command, then removed "../build/libmozglue.so"), and TestPrintf completed without crashing.

Attached file tombstone_07

These crashes produce tombstones; here's one, from a local run.

:glandium - Can you have a look at this? 3 cppunit tests that normally run fine on Android 4.3 crash on Android 7.0. It looks like mozglue is involved.

Flags: needinfo?(mh+mozilla)
See Also: → 1451930

Can you produce and attach a tombstone using an apk from try? The one you attached doesn't corresponding neither to the opt nor the debug build from the try in comment 0.

Flags: needinfo?(mh+mozilla) → needinfo?(gbrown)
Flags: needinfo?(gbrown)
Flags: needinfo?(mh+mozilla)

btw, we ran cppunit tests on Android 8.0 on real hardware (a Pixel 2 phone) and reproduced the same crashes:

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=6ffffeea5cfca59e57f5e2cb0b3259ea1459697e

(In reply to Geoff Brown [:gbrown] from comment #9)

btw, we ran cppunit tests on Android 8.0 on real hardware (a Pixel 2 phone) and reproduced the same crashes:

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=6ffffeea5cfca59e57f5e2cb0b3259ea1459697e

That may not be the same crash.

(In reply to Geoff Brown [:gbrown] from comment #8)

Here is a new run, reproducing the crashes with tombstones:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=231290593&repo=try&lineNumber=1377

See the tombstone_xx artifact links in the logviewer, like https://queue.taskcluster.net/v1/task/B98NHxKUQ7aMas5Bpjr18A/runs/0/artifacts/public/test_info//tombstone_00.

This one is not even the same as the one in comment 5.

Flags: needinfo?(mh+mozilla)

Can you do another try with LD_DEBUG set to 2 in the environment of the cppunit tests?

(In reply to Mike Hommey [:glandium] from comment #10)

That may not be the same crash.

Good point. I only meant, the same tests crash on Android 8.0 on the Pixel 2 as on the Android 7.0 x86_64 emulator.

(In reply to Mike Hommey [:glandium] from comment #11)

Can you do another try with LD_DEBUG set to 2 in the environment of the cppunit tests?

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f4b26421f6982152d3dcdc4b814e5987affc740b

I was able to get a green task with pthread_atfork removed from mozglue/build/BionicGlue.cpp and disabling mozjemalloc. I have a vague idea why the former causes problems, but not the latter. How can we get this in a real debugger?

I wonder if it's possible to extract all the files in an android emulator image, and use that as a chroot on a linux host. Since those are unit tests that don't involve dalvik, and probably no android-specific system calls, that could work.

After running 'mach cppunittest' on a device, you can run a test manually with

adb shell 'export MOZ_CRASHREPORTER=1&& export XPCOM_DEBUG_BREAK=stack-and-abort&& export MOZ_XRE_DIR=/data/local/tests/cppunittests/b&& export HOME=/data/local/tests/cppunittests/h&& export MOZ_CRASHREPORTER_NO_REPORT=1&& export LD_LIBRARY_PATH=/data/local/tests/cppunittests/b&& export TMPDIR=/data/local/tests/cppunittests/tmp&& cd /data/local/tests/cppunittests/h && /data/local/tests/cppunittests/b/TestPrintf'

Does that help?

The crashing tests started passing for me today, in local runs only, after I updated my local mozconfig. I am trying to determine which specific mozconfig change was responsible.

(In reply to Geoff Brown [:gbrown] from comment #17)

The crashing tests started passing for me today, in local runs only, after I updated my local mozconfig. I am trying to determine which specific mozconfig change was responsible.

MOZILLA_OFFICIAL: If mozconfig does not define MOZILLA_OFFICIAL, all tests pass.

(In reply to Geoff Brown [:gbrown] from comment #18)

(In reply to Geoff Brown [:gbrown] from comment #17)

The crashing tests started passing for me today, in local runs only, after I updated my local mozconfig. I am trying to determine which specific mozconfig change was responsible.

MOZILLA_OFFICIAL: If mozconfig does not define MOZILLA_OFFICIAL, all tests pass.

I tried that on automation, and the build fails in some gradle stuff that doesn't happen in MOZILLA_OFFICIAL builds...

(In reply to Mike Hommey [:glandium] from comment #19)

I tried that on automation, and the build fails in some gradle stuff that doesn't happen in MOZILLA_OFFICIAL builds...

Yes, I got that too. The gradle error can be worked around, but then something else goes wrong...Sigh.

(In reply to Mike Hommey [:glandium] from comment #14)

I was able to get a green task with pthread_atfork removed from mozglue/build/BionicGlue.cpp and disabling mozjemalloc. I have a vague idea why the former causes problems, but not the latter.

That sounds promising. And there is a relation between jemalloc and pthread_atfork, right? https://searchfox.org/mozilla-central/rev/2c912888e3b7ae4baf161d98d7a01434f31830aa/mozglue/build/BionicGlue.cpp#28

How can we get this in a real debugger?

"Real debugger"? Not just gdb? :snorp might be able to help.

(In reply to Mike Hommey [:glandium] from comment #19)

I tried that on automation, and the build fails in some gradle stuff that doesn't happen in MOZILLA_OFFICIAL builds...

glandium, what are the next steps to diagnose this bug? What is the relationship between MOZILLA_OFFICIAL and mozjemalloc/pthread_atfork? That sounds pretty random.

Is this cppunit crash the same issue as the xpcshell crash bug 1451930?

Flags: needinfo?(mh+mozilla)

(In reply to Chris Peterson [:cpeterson] from comment #22)

(In reply to Mike Hommey [:glandium] from comment #19)

I tried that on automation, and the build fails in some gradle stuff that doesn't happen in MOZILLA_OFFICIAL builds...

glandium, what are the next steps to diagnose this bug?

I guess I can try to reproduce it, now that I got a x86 emulator image locally after bug 1535196.

What is the relationship between MOZILLA_OFFICIAL and mozjemalloc/pthread_atfork? That sounds pretty random.

There shouldn't be. And I don't know how to make sense of it considering it doesn't build without.

Is this cppunit crash the same issue as the xpcshell crash bug 1451930?

Maybe?

Flags: needinfo?(mh+mozilla)

I was able to reproduce locally, even outside the emulator, and a combination of three things are involved on our end:

  • mozjemalloc
  • pthread_atfork
  • elfhack

So here's the deal:

  • Executing e.g. TestPrintf makes the system linker open the libraries it depends on.
  • The first it depends on is libmozglue.so.
  • Then both TestPrintf and libmozglue.so depend on liblog.so, libstdc++.so, libm.so, libdl.so and libc.so.
  • So, the linker loads all those libraries, and executes their static initializers.
  • The static initializers have to run in an order that makes sense wrt the library dependencies, so the static initializers for e.g. liblog.so need to run before libmozglue.so's.
  • Obviously, the libc.so's run first.
  • The first libc.so static initializer is __libc_preinit which does nothing very interesting besides calling __libc_preinit_impl, which then proceeds to call other functions, one of which is _Z18__libc_init_commonv, which then calls pthread_atfork.
  • This is where things go awry.
  • Because pthread_atfork is provided by libmozglue.so, the pthread_atfork call from _Z18__libc_init_commonv ends up in libmozglue.so's, and that fails, for some reason I'm not too sure about, and I don't actually get the same kind of error that e.g. comment 0 pointed, but whatever... turns out we don't need our pthread_atfork anymore -> bug 1545007.
  • Now, with pthread_atfork out of the way, libc.so's pthread_atfork can be used, and that calls __register_atfork, which calls... malloc.
  • Guess what else libmozglue.so provides? malloc!
  • So here again, we end up in mozglue kind of semi-unexpectedly. But things should work here, right?
  • Well the problem here is that we're in static initialization code. Before libmozglue.so's static initialization code has run. Remember what the last of the tree things I listed earlier was? elfhack. Elfhack applies relocations via a static initializer. IOW, libmozglue.so is not relocated by the time malloc is called, so some memory access and/or function call ends up using a non-relocated address, and kaboom.

So, with the patch from bug 1545007 applied, and mozjemalloc disabled, this works, because libc.so's pthread_atfork call from static initialization doesn't end up in libmozglue.so code before its static initialization because it doesn't provide malloc.

And with no other change than disabling elfhack it works too, because libmozglue is fully relocated by the time libc.so's static initializers run.

I'm not entirely sure how to fix this, but one possible workaround is to set LD_PRELOAD to /bionic/lib64/libc.so, which forces libc.so to be used for symbols that also appear in libmozglue.so.

Assignee: nobody → mh+mozilla
Depends on: 1545007

See comment 24 in the bug for details on what can go wrong without this
change. This change ensures system libraries are not going to pick
symbols from mozglue when running processes outside dalvik.

As a side effect, this makes things kind of closer to what happens when
dalvik is involved, exposing unit tests to possible allocator mismatches
that could happen like bug 1531887.

On the flip side, libraries that link against mozglue explicitly are
going to get a reference to the versioned symbols, so everything is fine
in that regard. The custom linker, however, will ignore the versions
altogether, and its symbols resolution just ends up unchanged. So we're
fine there too.

We use something that is close to what using a SYMBOLS_FILE would
generate as a version script, but we need to do so manually because
SYMBOLS_FILE doesn't support exporting all the symbols.

Pushed by mh@glandium.org: https://hg.mozilla.org/integration/autoland/rev/9bcfda8f31f0 Version the mozglue symbols on Android. r=froydnj
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 68

(In reply to Pulsebot from comment #26)

Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/9bcfda8f31f0
Version the mozglue symbols on Android. r=froydnj

author Geoff Brown <gbrown@mozilla.com>

WTH Lando?

This totally unblocks my efforts to run cppunit and xpcshell tests on modern Android.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f5012919d5faea3840eebb9a9ee6d63c33bc293c

Many thanks!

(In reply to Geoff Brown [:gbrown] from comment #29)

This totally unblocks my efforts to run cppunit and xpcshell tests on modern Android.

Is there a bug tracking this?

Flags: needinfo?(gbrown)
Blocks: 1546553
Flags: needinfo?(gbrown)
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: