Closed Bug 1580234 Opened 1 year ago Closed 10 months ago

Can't load libclang_rt.asan-android.so from APK.

Categories

(Core :: mozglue, defect)

Unspecified
Android
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla72
Tracking Status
firefox71 --- wontfix
firefox72 --- fixed

People

(Reporter: truber, Assigned: glandium)

References

Details

Attachments

(5 files)

I've been trying to get wrap.sh working to preload libclang_rt.asan-x86_64-android.so as recommended in the NDK. Using the old method (modify /system/bin/app_process64 to always preload ASan) works, but using the same script as wrap.sh doesn't work.

I've attached logs from both with MOZ_DEBUG_LINKER=1. It looks like when ASan is in the JNI lib folder, it uses CustomElf::Load which fails, but when it is in /system/lib64, it uses dlopen which succeeds.

Flags: needinfo?(mh+mozilla)

With this workaround, the wrap.sh version works and loads libclang_rt.asan-x86_64-android.so from the APK.

--- a/mozglue/linker/ElfLoader.cpp
+++ b/mozglue/linker/ElfLoader.cpp
@@ -480,8 +480,11 @@ already_AddRefed<LibHandle> ElfLoader::Load(const char* path, int flags,

   Mappable* mappable = GetMappableFromPath(path);

-  /* Try loading with the custom linker if we have a Mappable */
-  if (mappable) handle = CustomElf::Load(mappable, path, flags);
+  /* Don't use CustomElf for libasan, it won't work */
+  if (strncmp("libclang_rt.asan-", name, 17) || strcmp("-android.so", name + strlen(name) - 11)) {
+    /* Try loading with the custom linker if we have a Mappable */
+    if (mappable) handle = CustomElf::Load(mappable, path, flags);
+  }

   /* Try loading with the system linker if everything above failed */
   if (!handle) handle = SystemElf::Load(path, flags);

Can you attach the corresponding APK?

Flags: needinfo?(mh+mozilla) → needinfo?(jschwartzentruber)
Flags: needinfo?(jschwartzentruber) → needinfo?(mh+mozilla)

Can you get another failure log with this patch applied:

--- a/mozglue/linker/CustomElf.cpp
+++ b/mozglue/linker/CustomElf.cpp
@@ -594,16 +594,18 @@ bool CustomElf::Relocate() {
     }
     /* Other relocation types need a symbol resolution */
     /* Avoid symbol resolution when it's the same symbol as last iteration */
     if (symtab_index != ELF_R_SYM(rel->r_info)) {
       symtab_index = ELF_R_SYM(rel->r_info);
       const Sym sym = symtab[symtab_index];
       if (sym.st_shndx != SHN_UNDEF) {
         symptr = GetPtr(sym.st_value);
+        DEBUG_LOG("CustomElf::Relocate(%p [\"%s\"], \"%s\") = %p",
+            reinterpret_cast<const void*>(this), GetPath(), strtab.GetStringAt(sym.st_name), symptr);
       } else {
         /* TODO: handle symbol resolving to nullptr vs. being undefined. */
         symptr = GetSymbolPtrInDeps(strtab.GetStringAt(sym.st_name));
       }
     }
 
     if (symptr == nullptr)
       WARN("%s: Relocation to NULL @0x%08" PRIxPTR, GetPath(),
Flags: needinfo?(mh+mozilla) → needinfo?(jschwartzentruber)

Failure with additional logging.

Flags: needinfo?(jschwartzentruber) → needinfo?(mh+mozilla)

This log makes no sense. As in, according to the logs, the linker skips 311 relocations... and I don't see why that would happen from the contents of libclang_rt.asan-x86_64-android.so and the code.

Specifically, following the list or relocations from the output of readelf -rW libclang_rt.asan-x86_64-android.so, and reading the log, one can follow that the linker is doing all the relocations one by one (but skipping symbol resolution for same-symbol) up to __interceptor_strtol, but then skips everything until sem_init, after which it goes on normally.

Would you be able to figure out what's going on with CustomElf::Relocate?

One thing that I spotted, though, is that CustomElf::GetSymbolPtrInDeps will likely need a special case for __libc_stack_end, but that would need a copy of the android device libc.so to be sure.

Flags: needinfo?(mh+mozilla) → needinfo?(jschwartzentruber)

I don't know what is going on yet, but the range(s) of skipped relocations is not constant. I can load the APK and it will skip a range of relocations, then in a fresh emulator load the same APK and it will skip two different ranges of relocations.

I think logcat is dropping messages. When I log only the number of times through the loop in Relocate() I get 525, which is right, but I never see that many log messages.

Is there a reason why forcing the system loader per comment 2 isn't workable?

Flags: needinfo?(jschwartzentruber) → needinfo?(mh+mozilla)

Can you attach the APK to this bug? Or at least put it someplace more permanent than a try link?

Flags: needinfo?(jschwartzentruber)

The logging thing is kind of concerning, but there's also:

09-19 13:29:20.115  5022  5045 W GeckoLinker: /data/app/org.mozilla.fennec_aurora-hiIK6eb9NICk0HmfS8_5Dg==/lib/x86_64/libclang_rt.asan-x86_64-android.so: unhandled flags #8 not handled

8 is DF_BIND_NOW, and it seems plausible that using lazy binding--instead of the requested eager binding--with some of the asan interceptor functions and the like could result in problems? Though our copy of NSS also has DF_BIND_NOW, and that hasn't seemed to cause problems...but asan is a little bit more fragile than NSS code, I suppose.

Actually, that's probably wrong: the documentation for DF_BIND_NOW says that all relocations are processed before loading, which we already do -- specifically, jump slot relocations (i.e. entries in the PLT). So ignoring that flag is fine. Unless we're getting it wrong, somehow?

I'm assuming the APK is too big for bugzilla? I attached it here: https://github.com/jschwartzentruber/gecko-dev/releases/tag/2019.10.30-1, which should be permanent.

Flags: needinfo?(jschwartzentruber)
Attached file asan-2019.10.30-1.txt

Log corresponding to 2019.10.30 APK.

Jesse pointed me to up-to-date patches, and I'm able to build an APK with what I think are all the right options. I had to modify ~/.mozbuild/android-device/avd/mozemulator-avd-x86-7.0/config.ini to set disk.dataPartition.size = 7000M to have enough space to install the APK. I see crashes when starting the APK, but I do not see any of the expected logging from MOZ_DEBUG_LINKER, and I cannot get symbolized stacks from ASan, nor are the stacks nearly as complete as Jesse's.

I'm not quite sure what I'm doing differently. :(

Investigating the crashing addresses says that we're crashing in the initialization of ElfLoader itself (o.O).

OK, so by commenting out the code that we have for detecting broken SIGSEGV handlers:

https://searchfox.org/mozilla-central/source/mozglue/linker/ElfLoader.cpp#1151-1173

I can get everything to launch, but I don't see any ASan stuff happening. I don't think wrap.sh is working, despite following:

https://developer.android.com/ndk/guides/wrap-script#debugging_when_using_wrapsh
https://developer.android.com/ndk/guides/asan

and modifying the application manifest (?). It's not clear to me how Jesse was getting things to work, but maybe launching things through fuzzing was magically setting things up in a way I'm not aware of? I do see that libxul is linked to the ASan runtime library, and so is mozglue, so I'm a little unclear on why wrap.sh is needed in the first place...

Flags: needinfo?(jschwartzentruber)

The problem is the emulator in the tree is 7.0, and wrap.sh was added in 8.1. I understood wrap.sh is needed to load the ASan runtime before libc so interceptors are used, but I could be wrong.

So loading the build in 7.0 is working for you, but crashes aren't handled by ASan?

Flags: needinfo?(jschwartzentruber)

(In reply to Jesse Schwartzentruber (:truber) from comment #18)

The problem is the emulator in the tree is 7.0, and wrap.sh was added in 8.1. I understood wrap.sh is needed to load the ASan runtime before libc so interceptors are used, but I could be wrong.

Makes sense. It is possible that you could provide whatever 8.1 system image you're using, or bulletproof instructions on how to build one?

So loading the build in 7.0 is working for you, but crashes aren't handled by ASan?

They don't seem to be. But I didn't try very hard to trigger crashes.

Flags: needinfo?(jschwartzentruber)

(In reply to Nathan Froyd [:froydnj] from comment #19)

It is possible that you could provide whatever 8.1 system image you're using, or bulletproof instructions on how to build one?

The standalone script we use to install everything in docker is here: emulator.py

setup: python3 -m pip install requests six xvfbwrapper; python3 emulator.py install avd
run: python3 emulator.py run

They don't seem to be. But I didn't try very hard to trigger crashes.

I use about:crashparent and about:crashcontent

Flags: needinfo?(jschwartzentruber)

I can't seem to mach run my android builds (with the patches for this bug) on an emulator with recent-ish trunk. mach run fails with:

:machBuildFaster> 15 actionable tasks: 15 up-to-date
:machBuildFaster> ../../../modules/libpref/Unified_cpp_modules_libpref0.o: In function `pushLabelFrame':
:machBuildFaster> /opt/build/froydnj/build-android-x86/dist/include/js/ProfilingStack.h:396: undefined reference to `ProfilingStack::ensureCapacitySlow()'
:machBuildFaster> ../../../netwerk/protocol/http/Unified_cpp_protocol_http1.o: In function `pushLabelFrame':
:machBuildFaster> /opt/build/froydnj/build-android-x86/dist/include/js/ProfilingStack.h:396: undefined reference to `ProfilingStack::ensureCapacitySlow()'
:machBuildFaster> /opt/build/froydnj/build-android-x86/dist/include/js/ProfilingStack.h:396: undefined reference to `ProfilingStack::ensureCapacitySlow()'
:machBuildFaster> /opt/build/froydnj/build-android-x86/dist/include/js/ProfilingStack.h:396: undefined reference to `ProfilingStack::ensureCapacitySlow()'
:machBuildFaster> /opt/build/froydnj/build-android-x86/dist/include/js/ProfilingStack.h:396: undefined reference to `ProfilingStack::ensureCapacitySlow()'
:machBuildFaster> ../../../netwerk/protocol/http/Unified_cpp_protocol_http1.o:/opt/build/froydnj/build-android-x86/dist/include/js/ProfilingStack.h:396: more undefined references to `ProfilingStack::ensureCapacitySlow()' follow
:machBuildFaster> ../../../netwerk/protocol/http/Unified_cpp_protocol_http2.o: In function `ToUint64':
:machBuildFaster> /opt/build/froydnj/build-android-x86/dist/include/js/Conversions.h:249: undefined reference to `js::ToUint64Slow(JSContext*, JS::Handle<JS::Value>, unsigned long*)'
:machBuildFaster> ../../../media/webrtc/signaling/src/peerconnection/Unified_cpp_src_peerconnection0.o: In function `mozilla::PeerConnectionImpl::DumpPacket_m(unsigned long, mozilla::dom::mozPacketDumpType, bool, mozilla::UniquePtr<unsigned char [], mozilla::DefaultDelete<unsigned char []> >&, unsigned long)':
:machBuildFaster> /home/froydnj/src/gecko/media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp:1717: undefined reference to `JS::NewArrayBufferWithContents(JSContext*, unsigned long, void*)'
:machBuildFaster> ../../../media/webrtc/signaling/src/peerconnection/Unified_cpp_src_peerconnection0.o: In function `Init':
:machBuildFaster> /opt/build/froydnj/build-android-x86/dist/include/mozilla/dom/TypedArray.h:60: undefined reference to `JS::UnwrapArrayBuffer(JSObject*)'
:machBuildFaster> ../../../media/webrtc/signaling/src/peerconnection/Unified_cpp_src_peerconnection0.o: In function `Call':
:machBuildFaster> /opt/build/froydnj/build-android-x86/dist/include/mozilla/dom/WebrtcGlobalInformationBinding.h:140: undefined reference to `JS::UndefinedHandleValue'
:machBuildFaster> /home/froydnj/.mozbuild/android-ndk-r20/toolchains/x86_64-4.9/prebuilt/linux-x86_64/lib/gcc/x86_64-linux-android/4.9.x/../../../../x86_64-linux-android/bin/ld.bfd: ../../../media/webrtc/signaling/src/peerconnection/Unified_cpp_src_peerconnection0.o: relocation R_X86_64_PC32 against undefined hidden symbol `_ZN2JS20UndefinedHandleValueE' can not be used when making a shared object
:machBuildFaster> /home/froydnj/.mozbuild/android-ndk-r20/toolchains/x86_64-4.9/prebuilt/linux-x86_64/lib/gcc/x86_64-linux-android/4.9.x/../../../../x86_64-linux-android/bin/ld.bfd: final link failed: Bad value
:machBuildFaster> clang-9: error: linker command failed with exit code 1 (use -v to see invocation)
:machBuildFaster> /home/froydnj/src/gecko/config/rules.mk:659: recipe for target 'libxul.so' failed
:machBuildFaster> make[1]: *** [libxul.so] Error 1
:machBuildFaster> /home/froydnj/src/gecko/config/faster/rules.mk:77: recipe for target '/opt/build/froydnj/build-android-x86/toolkit/library/build/libxul.so' failed
:machBuildFaster> make: *** [/opt/build/froydnj/build-android-x86/toolkit/library/build/libxul.so] Error 2
:machBuildFaster> 317 compiler warnings present.

> Task :machBuildFaster FAILED

which is just bizarre, because building libxul with the normal recursive make backend succeeds. I suspect that this was regressed by bug 1573560, as I think that's the only thing that's modified the FasterMake backend recently, but I'm kind of at a loss to figure out exactly how that happened.

I think the logging issue of skipping differing amounts of relocations is something to do with logcat's internal buffer (which the emulator sets to 2MB by default); using GetName() instead of GetPath() in the logging messages doesn't skip any messages for me, and I can confirm that all the symbols are being processed by the custom linker.

I can see the same crashes Jesse is seeing. Still no good ideas on what's causing them.

OK, so this is not much fun, additionally because the fuzzing emulator kernel images (?) have a bug that causes ASan to be somewhat unreliable when setting up its shadow memory:

https://github.com/google/sanitizers/issues/856

The kernel versions don't seem to match up correctly, but the issue does seem to explain what happens sometimes. Other times it's just fine, and I can't figure out why.

OK, apparently writing that out has caused the kernel to decide to always put the application at an address that conflicts with ASan. :(

The only thing I can think of at the moment is ASan's extensive use of RTLD_NEXT in its library initialization routines to grab handles to all the functions that it's going to intercept. The custom loader just forwards those requests on to the system, but maybe something is going wrong because libasan.so is in the way, or the system is finding ASan's copy of memcpy et al? Which I think ASan will actually detect and error out on. (I haven't been able to fully debug this because of the aforementioned problem...)

Jesse's patch would let the system handle the entirety of loading libasan.so (which would also forward RTLD_NEXT requests to dlsym), and maybe that changes details of how the system is looking up symbols sufficiently that it's not a problem anymore?

Depends on: 1598196

Long term, we want to remove the custom linker (bug 1291377) but without
more effort than where we're at with bug 1598196, it would break using
mozjemalloc.

However, some builds using sanitizers don't use mozjemalloc already,
and in their case, we can already disable the custom linker.

The patch I attached here + the ones from bug 1598194 and bug 1598196 /should/ make it work.

Flags: needinfo?(mh+mozilla)

This appears to work! https://treeherder.mozilla.org/#/jobs?repo=try&revision=2a513e7faf719d186dff8b603e2ba2833cb21d3d

I'm able to reproduce a heap-buffer-overflow testcase in geckoview_example, with the ASan report in logcat.

Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/92923aba79d4
Disable the custom linker when mozjemalloc is disabled. r=froydnj
Status: NEW → RESOLVED
Closed: 10 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla72
Assignee: nobody → mh+mozilla
You need to log in before you can comment on or make changes to this bug.