Closed Bug 1584976 Opened 2 years ago Closed 2 years ago

LUL on x86_64-{android, linux}: accept .eh_frame with type as either SHT_PROGBITS or SHT_X86_64_UNWIND

Categories

(Core :: Gecko Profiler, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla72
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- unaffected
firefox69 --- unaffected
firefox70 --- unaffected
firefox71 + fixed
firefox72 --- verified

People

(Reporter: dluca, Assigned: jseward)

References

(Regression)

Details

(Keywords: regression)

Attachments

(3 files)

[Tracking Requested - why for this release]:

Central as beta:
https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=269028923&resultStatus=success%2Ctestfailed%2Cbusted%2Cexception%2Cusercancel%2Crunnable&revision=18f827ea7975724deef68b1629513fbc3a8cf496

Failure log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=269028923&repo=try&lineNumber=5203

  INFO -  TEST-START | AccessibleCaretManagerTester.TestScrollInCursorModeWithCaretShownWhenLongTappingOnEmptyContentPref
[task 2019-09-30T11:55:19.961Z] 11:55:19     INFO -  TEST-PASS | AccessibleCaretManagerTester.TestScrollInCursorModeWithCaretShownWhenLongTappingOnEmptyContentPref | test completed (time: 0ms)
[task 2019-09-30T11:55:19.961Z] 11:55:19     INFO -  TEST-START | LulIntegration.unwind_consistency
[task 2019-09-30T11:55:19.962Z] 11:55:19  WARNING -  TEST-UNEXPECTED-FAIL | LulIntegration.unwind_consistency | Value of: nTestsPassed == nTests
[task 2019-09-30T11:55:19.962Z] 11:55:19     INFO -    Actual: false
[task 2019-09-30T11:55:19.962Z] 11:55:19     INFO -  Expected: true
[task 2019-09-30T11:55:19.962Z] 11:55:19     INFO -  Not all tests passed @ /builds/worker/workspace/build/src/tools/profiler/tests/gtest/LulTest.cpp:48
[task 2019-09-30T11:55:19.962Z] 11:55:19  WARNING -  TEST-UNEXPECTED-FAIL | LulIntegration.unwind_consistency | test completed (time: 54ms)
[task 2019-09-30T11:55:19.962Z] 11:55:19     INFO -  TEST-START | LulDwarfCFI.EmptyRegion
[task 2019-09-30T11:55:19.962Z] 11:55:19     INFO -  TEST-PASS | LulDwarfCFI.EmptyRegion | test completed (time: 0ms)
[task 2019-09-30T11:55:19.962Z] 11:55:19     INFO -  TEST-START | LulDwarfCFI.IncompleteLength32
[task 2019-09-30T11:55:19.962Z] 11:55:19     INFO -  TEST-PASS | LulDwarfCFI.IncompleteLength32 | test completed (time: 0ms)
[task 2019-09-30T11:55:19.962Z] 11:55:19     INFO -  TEST-START | LulDwarfCFI.IncompleteLength64
[task 2019-09-30T11:55:19.962Z] 11:55:19     INFO -  TEST-PASS | LulDwarfCFI.IncompleteLength64 | test completed (time: 1ms)

The solution proposed in bug 1583868 (which was due to the no-sampling mode) doesn't fix this one here.

I'll need to investigate further...

Flags: needinfo?(gsquelart)
Priority: -- → P1
See Also: 1583868

Update: (I've learned many ways to fail)

I had a quick look through the pushlog from comment 0, but didn't see anything related to LUL. I'll try to explore further tomorrow.

I made the failing assertion more explicit, and added some more logging:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=269372732&repo=try&lineNumber=5210
This shows that none of the 6 tests pass! But there's no logging appearing, so that's not helping much.

I managed to run the Android simulator (first time!), and to build&run geckoview, but I could not run gtest:

gtest TEST-UNEXPECTED-FAIL | gtest | org.mozilla.geckoview.test failed to start

And I also didn't manage to connect the firefox debugger to geckoview in the simulator.

Reading the test code in LulMain.cpp, I see that it relies on the code not getting too optimized.
Around the time of the failure, clang 9 landed for a short time, could it be related? (Which could explain why the failure only happens in the "opt" build).
One of the patches in the pushlog is "Bug 1583907 - Add MOZ_NEVER_INLINE to LifoAlloc::mark to work around Clang 9 miscompilation on Android."
But then, the clang 9 patch was quickly backed out, so that's probably not it. And the test is still failing as of Oct 2, 20:42:10 AEST (3 hours before this comment) :
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d90ffb989758014ec4219049634172554013fc48&selectedJob=269406767

I'll probably need help, either with running&debugging gtest on Android, or directly with the failure in the profiler code or test.

Started with bug 1577220 (update to the r20 Android NDK) according to bisection.

James, Nathan is away, can you help Gerald (see comment 3)?

Flags: needinfo?(snorp)
Regressed by: 1577220

Nick, can you help with getting the issue resolved by the end of this week (Gecko 71 syncs to the beta branch starting next Monday)? Thank you.

Flags: needinfo?(nalexander)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #6)

Nick, can you help with getting the issue resolved by the end of this week (Gecko 71 syncs to the beta branch starting next Monday)? Thank you.

I can try, but my role will have to be more traffic director than anything else. Nathan and James are both out this week. I expect Markus will have the most context on the interactions between the profiler (LUL) and Android, so I'll NI him. I have never tried to run gtest locally at all, but I may be able to help with that; I vaguely recall a conversation about producing a gtest APK but that never went anywhere (Bug 1544496). I do, however, see some additional issues with LUL and gtest in some work gbrown did: Bug 1558885.

markus: can you see what you can add to Gerald's analysis? This is well outside of my knowledge.

gbrown: does anything here look familiar? This is really about trying to get more eyes on this issue...

Flags: needinfo?(nalexander)

Please see comment 7.

Flags: needinfo?(mstange)
Flags: needinfo?(gbrown)

I'm not familiar with this test or its domain, but I did help set up gtest for android.

Have folks noticed the linker errors in the logcat ("logcat-emulator-5554.log" artifact linked to the test task)? I see stuff like:

10-02 07:43:56.199  3996  4011 I gtest   : TEST-START | LulIntegration.unwind_consistency
10-02 07:43:56.206  3996  4011 W GeckoLinker: /system/bin/app_process64: dynamic header type #15 not handled
10-02 07:43:56.206  3996  4011 W GeckoLinker: /system/bin/app_process64: dynamic header type #6ffffef5 not handled
10-02 07:43:56.206  3996  4011 W GeckoLinker: /system/bin/app_process64: dynamic header type #20 not handled
10-02 07:43:56.206  3996  4011 W GeckoLinker: /system/bin/app_process64: dynamic header type #21 not handled
10-02 07:43:56.206  3996  4011 W GeckoLinker: /system/bin/app_process64: unhandled flags #8 not handled
10-02 07:43:56.206  3996  4011 E GeckoLinker: /system/bin/app_process64: Missing or broken DT_HASH
10-02 07:43:56.206  3996  4011 W GeckoLinker: /system/lib64/libcutils.so: dynamic header type #6ffffef5 not handled
10-02 07:43:56.206  3996  4011 W GeckoLinker: /system/lib64/libcutils.so: unhandled flags #8 not handled
10-02 07:43:56.206  3996  4011 E GeckoLinker: /system/lib64/libcutils.so: Missing or broken DT_HASH
10-02 07:43:56.207  3996  4011 E linker  : library "/system/lib64/libcutils.so" ("/system/lib64/libcutils.so") needed or dlopened by "/data/app/org.mozilla.geckoview.test-1/lib/x86_64/libmozglue.so" is not accessible for the namespace: [name="classloader-namespace", ld_library_paths="", default_library_paths="/data/app/org.mozilla.geckoview.test-1/lib/x86_64:/data/app/org.mozilla.geckoview.test-1/base.apk!/lib/x86_64", permitted_paths="/data:/mnt/expand:/data/data/org.mozilla.geckoview.test"]
...

However, I see similar warnings and errors on mozilla-central, so maybe that's not useful.

In bug 1558885 I have a long-dormant patch which moves the gtest libxul.so to a new path, outside of the apk install directory; that patch generally works, but breaks exactly one test: unwind_consistency. I've never found the time to investigate.

I have verified that 'mach gtest' and 'mach gtest <test>', like 'mach gtest LulIntegration.unwind_consistency' work for me on the x86_64 emulator. It is important to run from an x86_64 build: If run from an arm build, the arm emulator will be used, and gtest will fail because of bug 1558885. Even an x86 build may fail (I'm not sure).

My mozconfig is:

ac_add_options --enable-application=mobile/android
ac_add_options --target=x86_64
ac_add_options --enable-debug
Flags: needinfo?(gbrown)

(In reply to Geoff Brown [:gbrown] from comment #9)

I'm not familiar with this test or its domain, but I did help set up gtest for android.

Have folks noticed the linker errors in the logcat ("logcat-emulator-5554.log" artifact linked to the test task)? I see stuff like:

10-02 07:43:56.199  3996  4011 I gtest   : TEST-START | LulIntegration.unwind_consistency
10-02 07:43:56.206  3996  4011 W GeckoLinker: /system/bin/app_process64: dynamic header type #15 not handled
10-02 07:43:56.206  3996  4011 W GeckoLinker: /system/bin/app_process64: dynamic header type #6ffffef5 not handled
10-02 07:43:56.206  3996  4011 W GeckoLinker: /system/bin/app_process64: dynamic header type #20 not handled
10-02 07:43:56.206  3996  4011 W GeckoLinker: /system/bin/app_process64: dynamic header type #21 not handled
10-02 07:43:56.206  3996  4011 W GeckoLinker: /system/bin/app_process64: unhandled flags #8 not handled
10-02 07:43:56.206  3996  4011 E GeckoLinker: /system/bin/app_process64: Missing or broken DT_HASH
10-02 07:43:56.206  3996  4011 W GeckoLinker: /system/lib64/libcutils.so: dynamic header type #6ffffef5 not handled
10-02 07:43:56.206  3996  4011 W GeckoLinker: /system/lib64/libcutils.so: unhandled flags #8 not handled
10-02 07:43:56.206  3996  4011 E GeckoLinker: /system/lib64/libcutils.so: Missing or broken DT_HASH
10-02 07:43:56.207  3996  4011 E linker  : library "/system/lib64/libcutils.so" ("/system/lib64/libcutils.so") needed or dlopened by "/data/app/org.mozilla.geckoview.test-1/lib/x86_64/libmozglue.so" is not accessible for the namespace: [name="classloader-namespace", ld_library_paths="", default_library_paths="/data/app/org.mozilla.geckoview.test-1/lib/x86_64:/data/app/org.mozilla.geckoview.test-1/base.apk!/lib/x86_64", permitted_paths="/data:/mnt/expand:/data/data/org.mozilla.geckoview.test"]
...

However, I see similar warnings and errors on mozilla-central, so maybe that's not useful.

Indeed, this is a restriction of Android's. It's possible that it will impact this ticket, although since we're not seeing crashes during the test, perhaps not. See Bug 1580999 for issues profiling i686 builds on the x86/x86_64 emulator, which may be relevant here. If profiling x86_64 builds work, that would be great news for me: I will try locally shortly.

I don't have anything I can add here, sorry. Maybe Julian can?

Flags: needinfo?(mstange) → needinfo?(jseward)

Am looking at this now.

Flags: needinfo?(jseward)

My initial reaction to the report is to think that the central-as-beta builds have had some change of compiler, or build flags, or something else that changes the debug information (Dwarf CFI) created. This somehow is causing a problem with LUL, which reads and processes that data, from all the shared objects we create, most notably libxul.so. Was there some change of compiler, flags, or anything related to debug info, that happened just prior to 28 Sept?

I'm also unclear what the actual build target is. I see android-em-7.0-x86_64, but what does that actually mean?

android-em-7.0-x86_64 is more a description of the test environment: An emulator running Android 7.0 with tests running against an --target=x86_64 build of geckoview.

I can't reproduce the failure with a native build on x86_64-linux, using gcc 8.3.1 and -g -Og. I'll try again with clang 8 or 9.

If anybody has a build that shows the failure, it would be useful to set DEBUG_LUL_TEST to 1 in tools/profiler/tests/gtest/LulTest.cpp, to see if that gives any useful info.

I am generally confused by the appearance of two copies of Lul in the tree, at tools/profiler/lul and mozglue/baseprofiler/lul. How are they related? Do we have to bug-fix both of them?

An attempt to build with clang9 on x86_64-linux ended in a link failure. Can anyone advise me on how to build and repro this problem locally?

If anybody has a build that shows the failure, it would be useful to set DEBUG_LUL_TEST to 1 in tools/profiler/tests/gtest/LulTest.cpp, to see if that gives any useful info.

Would that output be shown in automation? https://treeherder.mozilla.org/#/jobs?repo=try&revision=73ad9e1418386368ebdfeb8e269892806b172bcc&selectedJob=270899120 doesn't show more info.

(In reply to Julian Seward [:jseward] from comment #19)

An attempt to build with clang9 on x86_64-linux ended in a link failure. Can anyone advise me on how to build and repro this problem locally?

In the past, I've had to add the following to my mozconfig:

CC="/Users/mstange/.mozbuild/clang/bin/clang"
CXX="/Users/mstange/.mozbuild/clang/bin/clang++"

I don't know if that'll fix your problem, though. I don't have a build that reproduces the problem that this bug is about.

Looking at the emulator adb logcat, I see:

10-11 17:08:05.960  3994  4009 I gtest   : TEST-START | LulIntegration.unwind_consistency
10-11 17:08:05.950  3994  3994 I Gecko   : type=1400 audit(0.0:4): avc: denied { getattr } for path="/data/local/gtest/libxul.so" dev="vdc" ino=21241 scontext=u:r:untrusted_app:s0:c512,c768 tcontext=u:object_r:system_data_file:s0 tclass=file permissive=1
10-11 17:08:05.969  3994  4009 W GeckoLinker: /system/bin/app_process64: dynamic header type #15 not handled
10-11 17:08:05.969  3994  4009 W GeckoLinker: /system/bin/app_process64: dynamic header type #6ffffef5 not handled
10-11 17:08:05.969  3994  4009 W GeckoLinker: /system/bin/app_process64: dynamic header type #20 not handled
10-11 17:08:05.969  3994  4009 W GeckoLinker: /system/bin/app_process64: dynamic header type #21 not handled
10-11 17:08:05.969  3994  4009 W GeckoLinker: /system/bin/app_process64: unhandled flags #8 not handled
10-11 17:08:05.969  3994  4009 E GeckoLinker: /system/bin/app_process64: Missing or broken DT_HASH
10-11 17:08:05.969  3994  4009 W GeckoLinker: /system/lib64/libcutils.so: dynamic header type #6ffffef5 not handled
10-11 17:08:05.969  3994  4009 W GeckoLinker: /system/lib64/libcutils.so: unhandled flags #8 not handled
10-11 17:08:05.969  3994  4009 E GeckoLinker: /system/lib64/libcutils.so: Missing or broken DT_HASH
...
10-11 17:08:06.000  3994  4009 W GeckoLinker: /system/lib64/libstagefright_omx.so: dynamic header type #6ffffef5 not handled
10-11 17:08:06.000  3994  4009 W GeckoLinker: /system/lib64/libstagefright_omx.so: unhandled flags #8 not handled
10-11 17:08:06.000  3994  4009 E GeckoLinker: /system/lib64/libstagefright_omx.so: Missing or broken DT_HASH
10-11 17:08:06.000  3994  4009 E linker  : library "/system/lib64/libstagefright_omx.so" ("/system/lib64/libstagefright_omx.so") needed or dlopened by "/data/app/org.mozilla.geckoview.test-1/lib/x86_64/libmozglue.so" is not accessible for the namespace: [name="classloader-namespace", ld_library_paths="", default_library_paths="/data/app/org.mozilla.geckoview.test-1/lib/x86_64:/data/app/org.mozilla.geckoview.test-1/base.apk!/lib/x86_64", permitted_paths="/data:/mnt/expand:/data/data/org.mozilla.geckoview.test"]
10-11 17:08:06.000  3994  4009 W GeckoLinker: /system/lib64/libstagefright_yuv.so: dynamic header type #6ffffef5 not handled
10-11 17:08:06.000  3994  4009 W GeckoLinker: /system/lib64/libstagefright_yuv.so: unhandled flags #8 not handled
10-11 17:08:06.000  3994  4009 E GeckoLinker: /system/lib64/libstagefright_yuv.so: Missing or broken DT_HASH
10-11 17:08:06.000  3994  4009 E linker  : library "/system/lib64/libstagefright_yuv.so" ("/system/lib64/libstagefright_yuv.so") needed or dlopened by "/data/app/org.mozilla.geckoview.test-1/lib/x86_64/libmozglue.so" is not accessible for the namespace: [name="classloader-namespace", ld_library_paths="", default_library_paths="/data/app/org.mozilla.geckoview.test-1/lib/x86_64:/data/app/org.mozilla.geckoview.test-1/base.apk!/lib/x86_64", permitted_paths="/data:/mnt/expand:/data/data/org.mozilla.geckoview.test"]
10-11 17:08:06.001  3994  4009 W GeckoLinker: /system/lib64/libvorbisidec.so: dynamic header type #6ffffef5 not handled
10-11 17:08:06.001  3994  4009 W GeckoLinker: /system/lib64/libvorbisidec.so: unhandled flags #8 not handled
10-11 17:08:06.001  3994  4009 E GeckoLinker: /system/lib64/libvorbisidec.so: Missing or broken DT_HASH
10-11 17:08:06.001  3994  4009 E linker  : library "/system/lib64/libvorbisidec.so" ("/system/lib64/libvorbisidec.so") needed or dlopened by "/data/app/org.mozilla.geckoview.test-1/lib/x86_64/libmozglue.so" is not accessible for the namespace: [name="classloader-namespace", ld_library_paths="", default_library_paths="/data/app/org.mozilla.geckoview.test-1/lib/x86_64:/data/app/org.mozilla.geckoview.test-1/base.apk!/lib/x86_64", permitted_paths="/data:/mnt/expand:/data/data/org.mozilla.geckoview.test"]
10-11 17:08:06.001  3994  4009 W GeckoLinker: /system/lib64/libpowermanager.so: dynamic header type #6ffffef5 not handled
10-11 17:08:06.001  3994  4009 W GeckoLinker: /system/lib64/libpowermanager.so: unhandled flags #8 not handled
10-11 17:08:06.001  3994  4009 E GeckoLinker: /system/lib64/libpowermanager.so: Missing or broken DT_HASH
<snip>
10-11 17:08:06.010  3994  4009 W GeckoLinker: /system/lib64/libwebviewchromium_loader.so: unhandled flags #8 not handled
10-11 17:08:06.010  3994  4009 E GeckoLinker: /system/lib64/libwebviewchromium_loader.so: Missing or broken DT_HASH
10-11 17:08:06.010  3994  4009 E linker  : library "/system/lib64/libwebviewchromium_loader.so" ("/system/lib64/libwebviewchromium_loader.so") needed or dlopened by "/data/app/org.mozilla.geckoview.test-1/lib/x86_64/libmozglue.so" is not accessible for the namespace: [name="classloader-namespace", ld_library_paths="", default_library_paths="/data/app/org.mozilla.geckoview.test-1/lib/x86_64:/data/app/org.mozilla.geckoview.test-1/base.apk!/lib/x86_64", permitted_paths="/data:/mnt/expand:/data/data/org.mozilla.geckoview.test"]
10-11 17:08:06.010  3994  4009 E GeckoLinker: /system/framework/oat/x86_64/android.test.runner.odex: Failed to mmap
10-11 17:08:06.010  3994  4009 E linker  : library "/system/framework/oat/x86_64/android.test.runner.odex" ("/system/framework/oat/x86_64/android.test.runner.odex") needed or dlopened by "/data/app/org.mozilla.geckoview.test-1/lib/x86_64/libmozglue.so" is not accessible for the namespace: [name="classloader-namespace", ld_library_paths="", default_library_paths="/data/app/org.mozilla.geckoview.test-1/lib/x86_64:/data/app/org.mozilla.geckoview.test-1/base.apk!/lib/x86_64", permitted_paths="/data:/mnt/expand:/data/data/org.mozilla.geckoview.test"]
10-11 17:08:06.010  3994  4009 E GeckoLinker: /data/app/org.mozilla.geckoview.test-1/oat/x86_64/base.odex: Failed to mmap
10-11 17:08:06.010  3994  4009 W GeckoLinker: /data/app/org.mozilla.geckoview.test-1/lib/x86_64/liblgpllibs.so: unhandled flags #8 not handled
10-11 17:08:06.010  3994  4009 W GeckoLinker: /system/lib64/hw/gralloc.ranchu.so: dynamic header type #6ffffef5 not handled
10-11 17:08:06.010  3994  4009 W GeckoLinker: /system/lib64/hw/gralloc.ranchu.so: unhandled flags #8 not handled
10-11 17:08:06.010  3994  4009 E GeckoLinker: /system/lib64/hw/gralloc.ranchu.so: Missing or broken DT_HASH
10-11 17:08:06.010  3994  4009 E linker  : library "/system/lib64/hw/gralloc.ranchu.so" ("/system/lib64/hw/gralloc.ranchu.so") needed or dlopened by "/data/app/org.mozilla.geckoview.test-1/lib/x86_64/libmozglue.so" is not accessible for the namespace: [name="classloader-namespace", ld_library_paths="", default_library_paths="/data/app/org.mozilla.geckoview.test-1/lib/x86_64:/data/app/org.mozilla.geckoview.test-1/base.apk!/lib/x86_64", permitted_paths="/data:/mnt/expand:/data/data/org.mozilla.geckoview.test"]
10-11 17:08:06.010  3994  4009 I gtest   : TEST-UNEXPECTED-FAIL | LulIntegration.unwind_consistency | Value of: nTestsPassed == nTests
10-11 17:08:06.010  3994  4009 I gtest   :   Actual: false
10-11 17:08:06.010  3994  4009 I gtest   : Expected: true
10-11 17:08:06.010  3994  4009 I gtest   : Not all tests passed @ /builds/worker/workspace/build/src/tools/profiler/tests/gtest/LulTest.cpp:48

That all looks like (existing) Android restrictions, see discussion in https://bugzilla.mozilla.org/show_bug.cgi?id=1580999#c5. I don't see anything that looks like extra logging.

Aryx, did you enable the define in that try build? Also, can these tests be run against the debug (not opt) build?

jseward, if we can run against debug builds, then you might get good mileage from an artifact build with the try build artifacts (assuming we can run gtest against artifact builds, which I have never tried). gbrown might be able to confirm. The reason I think that using artifact builds with exactly those artifacts will be fruitful is that we think this is a compiler issue, and therefore the precise toolchain and build environment matters greatly.

I usually build (against mozilla-central) with a mozconfig of only:

ac_add_options --enable-debug
ac_add_options --enable-application=mobile/android
ac_add_options --target=x86_64

Then you can 'mach gtest' or 'mach gtest LulIntegration.unwind_consistency' to run in the emulator.

nalexander's caution about using the exact same artifacts seems wise. I have never tried that with gtest, and I think it might be difficult.

(In reply to Nick Alexander :nalexander [he/him] from comment #22)

Aryx, did you enable the define in that try build? Also, can these tests be run against the debug (not opt) build?

Yes: https://hg.mozilla.org/try/rev/fd153dba45aded1a9594da70dc0f7937f5e0fccb

It has been requested for debug (where it doesn't fail): https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=270899120&revision=73ad9e1418386368ebdfeb8e269892806b172bcc&searchStr=gtest

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #24)

(In reply to Nick Alexander :nalexander [he/him] from comment #22)

Aryx, did you enable the define in that try build? Also, can these tests be run against the debug (not opt) build?

Yes: https://hg.mozilla.org/try/rev/fd153dba45aded1a9594da70dc0f7937f5e0fccb

It has been requested for debug (where it doesn't fail): https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=270899120&revision=73ad9e1418386368ebdfeb8e269892806b172bcc&searchStr=gtest

The plot thickens! This points ever more strongly at a compiler issue.

Sebastian, is it possible to get that logging output for an opt build instead?

It's the logged linked by Nick:

Looking at the emulator adb logcat

I've now built natively on x86_64-linux with clang9 at -Og and -O2, and the test won't fail. This inclines me away from thinking this is a compiler problem. Whereas ..

It's the logged linked by Nick:

Looking at the emulator adb logcat

.. this makes me think it is a runtime linker problem. Specifically, when LUL starts up, it tries to read CFI from all shared objects in the process. Typically there are around 100 of them. And that log shows what looks like around 100 failures. Also, it shows no evidence that LUL managed to do any unwinding, which is consistent with it failing to load CFI for all objects.

(In reply to Julian Seward [:jseward] from comment #18)

I am generally confused by the appearance of two copies of Lul in the tree, at tools/profiler/lul and mozglue/baseprofiler/lul. How are they related? Do we have to bug-fix both of them?

This is a temporary situation: Base Profiler is new-ish, a subset of the good old Gecko Profiler (tools/profiler) that can work without xpcom; I wanted to implement Base Profiler first, to make sure it was viable before trying to merge both. I plan to start deduplicating this quarter, see meta bug 1557566.

In the meantime, please concentrate on the one in tools/profiler/lul. Don't worry about fixing the one in mozglue, unless there are some build issues of course, or it's easy enough; I'll deal with changes when I start deduplicating...
Thank you all!

Using the mozconfig in comment 23 and mach bootstrap to install the relevant tools, I managed to build it. mach gtest LulIntegration.unwind_consistency did indeed start an emulator, and produced a window and button bar, but that seemed to hang after a while. I control-C'd it and reran; now I don't even get a window, but the run fails after a few seconds. In ~/.mozbuild/android-device/emulator.log, I see:

emulator: ERROR: Not enough disk space to run AVD 'mozemulator-x86-7.0'. Exiting...

How do I move past this?

I think that means you need more disk space on the host (your desktop). You could try some disk cleanup and try again.

But if it worked once, maybe the avd is taking up too much space itself? You can reset the avd by running 'mach android-emulator --force-update'; if that works (the emulator starts), run gtest -- it should automatically use the existing emulator session.

Ah, you're right about the disk space. Fixing that, all 102 Lul tests run successfully, including LulIntegration.unwind_consistency.

I've now tried the Android build with four different mozconfigs, including --enable-optimize, --enable-release, and enablement of LTO. All those builds succeeded, but none of them show the unwind failure. I guess the next thing to try is with the "central as beta" config, so I tried to guess what the mozconfig was based on the "Android 5.0 x86-64 opt" build log from the "Central as beta" link in comment 0. But that build fails to link.

So I'm out of ideas. Can someone point me at the definitive mozconfig as used by the failing test here?

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #34)

build + mozconfig at https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=270996235&repo=try&lineNumber=628

Working from that gives the mozconfig below. But then running mach build produces

ERROR: /home/sewardj/MOZ/BUG-1584976/mozconfig_tmp directly or indirectly includes an in-tree mozconfig.
ERROR: In-tree mozconfigs make strong assumptions about and are only meant to be used by Mozilla automation.
ERROR: Please don't use them.

By repeatedly manually inlining all the . "$topsrcdir/build/blah" statements and then unsettingMOZ_AUTOMATION_MOZCONFIG`, I can get past that. But then I'm in a maze of configuration failures.

The initial mozconfig is:

. "$topsrcdir/mobile/android/config/mozconfigs/common"
ac_add_options --with-android-min-sdk=21
ac_add_options --target=x86_64-linux-android
ac_add_options --with-branding=mobile/android/branding/nightly
export FENNEC_NIGHTLY=1
export MOZILLA_OFFICIAL=1

export AR="/home/sewardj/Tools/InstClang900/bin/llvm-ar"
export NM="/home/sewardj/Tools/InstClang900/bin/llvm-nm"
export RANLIB="/home/sewardj/Tools/InstClang900/bin/llvm-ranlib"

export MOZ_LTO=cross
Flags: needinfo?(snorp) → needinfo?(jseward)

RyanVM, do you know how this central-as-beta config works? Or do you know who does?

Flags: needinfo?(ryanvm)

Maybe Tom can help.

Flags: needinfo?(ryanvm) → needinfo?(mozilla)

I'm not very familiar with the android mozconfigs. It looks like we use the nightly mozconfig on all branches for release and opt builds, there does not appear to be any change between branches in the mozconfig. $MOZ_UPDATE_CHANNEL passed in will vary, and beta builds and central-as-beta will have that set to beta.

Note that this failure appears to be on beta as well now: https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=5d743fd9d6bd149589ae42e2de7f08e569b6ed9d&selectedJob=271526433

Flags: needinfo?(mozilla)

Finally, I can reproduce with the attached mozconfig. The host clang version is clang 9. Investigating.

Flags: needinfo?(jseward)

I am seeing the same linker errors as mentioned in comment 12, and it seems
like a fair bet that those are the cause of the problem, as described in
comment 28. In particular, eg

0-21 16:18:19.535 9321 9336 E GeckoLinker: /system/lib64/libutils.so: Missing
or broken DT_HASH

10-21 16:18:19.535 9321 9336 E linker : library "/system/lib64/libutils.so"
("/system/lib64/libutils.so") needed or dlopened by
"/data/app/org.mozilla.geckoview.test-1/lib/x86_64/libmozglue.so" is not
accessible for the namespace: [name="classloader-namespace",
ld_library_paths="",
default_library_paths="/data/app/org.mozilla.geckoview.test-1/lib/x86_64:/data/app/org.mozilla.geckoview.test-1/base.apk!/lib/x86_64",
permitted_paths="/data:/mnt/expand:/data/data/org.mozilla.geckoview.test"]

and specifically the bit library "/system/lib64/libutils.so" .. is not accessible for the namespace: [name="classloader-namespace", ..]

That sounds to me like some kind of permissions problem ("you may only read the following set of files", etc). Who here is the keeper of, or knows of, this mechanism?

I should add .. I am happy to help fix this, but I am only here this week;
am on PTO the weeks commencing 28 Oct and 4 Nov.

Mike, do you know anything about the linker errors shown in comment 41?

Flags: needinfo?(mh+mozilla)

So. There are two things going on here. Both relate to Lul -- the unwinder
library in use here -- failing to read unwind info.

(1) Per comment 41 and comment 12, the Android linker denies Lul read access
to any of the Android system libraries, so it fails to read unwind info
for them. That limitation doesn't apply to libxul.so. But for libxul.so,
Lul fails to read unwind info for a second reason:

(2) At least for this build configuration, libxul.so's unwind information is
in an ELF section called .eh_frame. Lul expects this section to have type
SHT_PROGBITS, but this appears to have changed to SHT_X86_64_UNWIND, which
means Lul ignores the section. I suspect this change is related to which
linker created libxul.so. I also checked some x86_64-linux builds I have,
and I noticed that some -- perhaps done by clang 9 -- also have this
change. So it's not Android specific.

(2) is the immediate cause for LulIntegration.unwind_consistency failing.
Fixing it is trivial; I'll attach a patch shortly.

However, until such time as (1) is fixed, unwinding (for the Gecko profiler)
is likely to be still largely broken on x86_64-android, since many stack
traces involve system libraries at some point.

Summary: Perma TEST-UNEXPECTED-FAIL | LulIntegration.unwind_consistency | Value of: nTestsPassed == nTests when Gecko 71 merges to Beta on 2019-10-14 → LUL on x86_64-{android, linux}: accept .eh_frame with type as either SHT_PROGBITS or SHT_X86_64_UNWIND
Attachment #9103463 - Flags: review+

The LUL unwinder on x86_64-android and x86_64-linux tries to read Dwarf unwind
information from the ELF section named ".eh_frame", and expects the section to
have type SHT_PROGBITS. It appears that that section might instead have the
type SHT_X86_64_UNWIND, possibly as a result of recent toolchain (linker?)
changes. This patch tracks that change.

Pushed by jseward@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3cc5605eb738
LUL on x86_64-{android, linux}: accept .eh_frame with type as either SHT_PROGBITS or SHT_X86_64_UNWIND.  r=mstange.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla72
Assignee: nobody → jseward

Comment on attachment 9103947 [details]
Bug 1584976 - LUL on x86_64-{android, linux}: accept .eh_frame with type as either SHT_PROGBITS or SHT_X86_64_UNWIND. r=mstange.

Beta/Release Uplift Approval Request

  • User impact if declined: Expected to be minimal, as it affects mostly-developer-only functionality on three non-mainstream targets.

This concerns the Gecko profiler on x86_64-android (presumably a very rare configuration) and x86_64-linux or i686-linux (more common). In some build configurations, depending I think on which linker linked libxul.so, backtrace recovery during profiling will be mostly broken, and so usage of the Gecko profiler requiring native backtraces on those targets will also be impossible.

Patch has been on nightly now for circa 12 hours and hasn't bounced. I'm not sure what "verified in Nightly" means. It is covered by the gtest "LulIntegration.unwind_consistency".

  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce: Build for android-x86_64. Do this using the mozconfig shown at https://bugzilla.mozilla.org/show_bug.cgi?id=1584976#c40. The exact mozconfig is important. I tried many mozconfigs before finding one that showed the failure.

Then run ./mach gtest LulIntegration.unwind_consistency. This should start an android simulator. The test will fail.

  • List of other uplifts needed: none
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It only affects x86_64-android, x86_64-linux and i686-linux, and then only for non-user-facing functionality. Patch is very small.
  • String changes made/needed: none
Attachment #9103947 - Flags: approval-mozilla-beta?

Small note: I am on vacation as of end of today. Markus Stange [:mstange] has
kindly agreed to shepherd this through the rest of the beta uplift process in
my absence.

Flags: needinfo?(mh+mozilla)

Comment on attachment 9103947 [details]
Bug 1584976 - LUL on x86_64-{android, linux}: accept .eh_frame with type as either SHT_PROGBITS or SHT_X86_64_UNWIND. r=mstange.

Fix for broken behaviour in the gecko profiler, no end-user impact but we are early in the beta cycle and the patch is minimal and has tests, uplift approved for 71 beta 5, thanks.

Attachment #9103947 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.