Closed Bug 1573820 Opened 5 months ago Closed 5 months ago

elfhack failure with gcc7-built clang on x86-64 android

Categories

(Firefox Build System :: General, task)

x86_64
Android
task
Not set

Tracking

(firefox70 fixed)

RESOLVED FIXED
mozilla70
Tracking Status
firefox70 --- fixed

People

(Reporter: froydnj, Assigned: froydnj)

References

Details

Attachments

(1 file)

Attempt to build x86-64 android with clang built with gcc 7 gives:

INFO -  package> elfhack: /builds/worker/workspace/build/src/build/unix/elfhack/elf.cpp:272: Elf::Elf(std::ifstream &): Assertion `segment->getFileSize() == phdr.p_filesz - gold_adjustment' failed.

What in the world.

OK, so there's a couple things going wrong here. We have a dist/bin/libplugin-container.so. elfhack does not have any problems with that file. When you strip that file, however, you wind up with what the packaging step is trying to elfhack, dist/fennec/lib/x86_64/libplugin_container.so. elfhack does have problems with this file.

What does stripping the file do? Well, it removes a bunch of stuff and turns the segments from this:

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x0001f8 0x0001f8 R   0x8
  INTERP         0x000238 0x0000000000000238 0x0000000000000238 0x000015 0x000015 R   0x1
      [Requesting program interpreter: /system/bin/linker64]
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000880 0x000880 R E 0x1000
  LOAD           0x000d88 0x0000000000001d88 0x0000000000001d88 0x000278 0x000280 RW  0x1000
  DYNAMIC        0x000db8 0x0000000000001db8 0x0000000000001db8 0x000200 0x000200 RW  0x8
  NOTE           0x00024e 0x000000000000024e 0x000000000000024e 0x0000be 0x0000be R   0x4
  GNU_EH_FRAME   0x00084c 0x000000000000084c 0x000000000000084c 0x000034 0x000034 R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x000d88 0x0000000000001d88 0x0000000000001d88 0x000278 0x000278 RW  0x8

 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.android.ident .note.gnu.build-id .dynsym .dynstr .hash .gnu.version .gnu.version_r .rela.plt .plt .text .rodata .eh_frame .eh_frame_hdr 
   03     .fini_array .init_array .preinit_array .dynamic .got .got.plt .bss 
   04     .dynamic 
   05     .note.android.ident .note.gnu.build-id 
   06     .eh_frame_hdr 
   07     
   08     .fini_array .init_array .preinit_array .dynamic .got .got.plt 

into this:

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x0001f8 0x0001f8 R   0x8
  INTERP         0x000238 0x0000000000000238 0x0000000000000238 0x000015 0x000015 R   0x1
      [Requesting program interpreter: /system/bin/linker64]
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000880 0x000880 R E 0x1000
  LOAD           0x000d88 0x0000000000001d88 0x0000000000001d88 0x000278 0x000280 RW  0x1000
  DYNAMIC        0x000db8 0x0000000000001db8 0x0000000000001db8 0x000200 0x000200 RW  0x8
  NOTE           0x00024e 0x000000000000024e 0x000000000000024e 0x0000be 0x0000bc R   0x4
  GNU_EH_FRAME   0x00084c 0x000000000000084c 0x000000000000084c 0x000034 0x000034 R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x000d88 0x0000000000001d88 0x0000000000001d88 0x000278 0x000278 RW  0x8

 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.android.ident .note.gnu.build-id .dynsym .dynstr .hash .gnu.version .gnu.version_r .rela.plt .plt .text .rodata .eh_frame .eh_frame_hdr 
   03     .fini_array .init_array .preinit_array .dynamic .got .got.plt .bss 
   04     .dynamic 
   05     .note.android.ident 
   06     .eh_frame_hdr 
   07     
   08     .fini_array .init_array .preinit_array .dynamic .got .got.plt 

Note that we've removed .note.gnu.build-id from the NOTE program header, and that the NOTE program header is now:

  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  NOTE           0x00024e 0x000000000000024e 0x000000000000024e 0x0000be 0x0000bc R   0x4

which says that the in-memory size of the section (0xbc) is smaller than the on-disk size of the section (0xbe). You'll also note that, bizarrely, the alignment of the section is 0x4, which isn't going to happen if your address is 0x24e.

What elfhack is complaining about is the bogusness that strip has introduced into the file: the NOTE header says that it's 0xbc in size, but elfhack has noticed that this seems inconsistent: the sections that the NOTE header contains are only 0x98 in size (0x98 for the .note.android.ident section).

If you included 0x24 for the .note.gnu.build-id section, that would add up to 0xbc if strip hadn't botched things. I suppose it still wouldn't be right, since the on-disk size would still be 0xbe, and elfhack would still complain, but it'd be closer.

And, even weirder, I have a local ~/.mozbuild/clang/bin/clang (version 8.0) that was not compiled with GCC 7, and it exhibits the same problem too! So how are our automation builds even passing currently? They have in their packages a libplugin-container.so that looks like (post-elfhack):

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x0001f8 0x0001f8 R   0x8
  INTERP         0x000238 0x0000000000000238 0x0000000000000238 0x000015 0x000015 R   0x1
      [Requesting program interpreter: /system/bin/linker64]
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000880 0x000880 R E 0x1000
  LOAD           0x000d88 0x0000000000001d88 0x0000000000001d88 0x000278 0x000280 RW  0x1000
  DYNAMIC        0x000db8 0x0000000000001db8 0x0000000000001db8 0x000200 0x000200 RW  0x8
  NOTE           0x00024e 0x000000000000024e 0x000000000000024e 0x000098 0x000098 R   0x4
  GNU_EH_FRAME   0x00084c 0x000000000000084c 0x000000000000084c 0x000034 0x000034 R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x000d88 0x0000000000001d88 0x0000000000001d88 0x000278 0x000278 RW  0x8

 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.android.ident .note.gnu.build-id .dynsym .dynstr .hash .gnu.version .gnu.version_r .rela.plt .plt .text .rodata .eh_frame .eh_frame_hdr 
   03     .fini_array .init_array .preinit_array .dynamic .got .got.plt .bss 
   04     .dynamic 
   05     .note.android.ident 
   06     .eh_frame_hdr 
   07     
   08     .fini_array .init_array .preinit_array .dynamic .got .got.plt

So there's no .note.gnu.build-id in the NOTE program header, and it has the correct size for the contained sections. I wonder if it had a .note.gnu.build-id in it prior to stripping.

Again: what in the world.

I filed a bug against the NDK for gold's behavior, which I think is clearly incorrect:

https://github.com/android-ndk/ndk/issues/1063

What I don't understand is why this bug only shows up with a gcc7-compiled clang 7. AFAICT, it triggers for me locally with the exact same settings that we use in automation (we even use gold in automation, before and after the gcc7 change!). Unless somehow the gcc7-compiled clang 7 triggers using a different gold that we did before, and the previous gold doesn't have a bug (because it's an older version?) and the new gold does? But that feels really unlikely: the gcc6 and gcc7 builds that we have on taskcluster use the exact same version of binutils to build, and the clang toolchains use the same version of binutils as the gcc builds...so that theory doesn't hold water.

If you look at the pre-elfhacked binary for a working build, there's only a single section in the NOTE segment:

Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.android.ident .note.gnu.build-id .dynsym .dynstr .hash .gnu.version .gnu.version_r .rela.plt .plt .text .rodata .eh_frame .eh_frame_hdr 
   03     .fini_array .init_array .preinit_array .dynamic .got .got.plt .bss 
   04     .dynamic 
   05     .note.android.ident 
   06     .eh_frame_hdr 
   07     
   08     .fini_array .init_array .preinit_array .dynamic .got .got.plt 

which I don't understand. Unless gcc7-compiled clang is picking up the gold from the NDK, which is busted in some peculiar way? But that theory doesn't make a lot of sense either.

I'm not sure what to do here. We could add a hack in elfhack to just ignore the bogus NOTE segment, with copious comments, or we could force ld.bfd for Android because bugs. I am leaning towards the latter; WDYT, glandium?

Sigh, needinfo for comment 2.

Flags: needinfo?(mh+mozilla)

Update: the linker command lines between the gcc6-compiled clang and the gcc7-compiled clang libplugin-container.so files are identical. So we are using identical linkers regardless of what we used to compile clang.

I'll start with this: the problem doesn't come from gold, it comes from the NDK's objcopy. So I don't expect BFD ld would help in any way. Now, the complete mystery to me is that elfhack does, in fact, fail for me in an interactive task of a green build, with the same error.

~/workspace/build/src/obj-firefox/ipc/app$ make target
...
~/workspace/build/src/obj-firefox/ipc/app$ readelf -l ../../dist/bin/libplugin-container.so 
Elf file type is DYN (Shared object file)
Entry point 0x620
There are 9 program headers, starting at offset 64
Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
                 0x00000000000001f8 0x00000000000001f8  R      0x8
  INTERP         0x0000000000000238 0x0000000000000238 0x0000000000000238
                 0x0000000000000015 0x0000000000000015  R      0x1
      [Requesting program interpreter: /system/bin/linker64]
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000880 0x0000000000000880  R E    0x1000
  LOAD           0x0000000000000d88 0x0000000000001d88 0x0000000000001d88
                 0x0000000000000278 0x0000000000000280  RW     0x1000
  DYNAMIC        0x0000000000000db8 0x0000000000001db8 0x0000000000001db8
                 0x0000000000000200 0x0000000000000200  RW     0x8
  NOTE           0x000000000000024e 0x000000000000024e 0x000000000000024e
                 0x00000000000000be 0x00000000000000be  R      0x4
  GNU_EH_FRAME   0x000000000000084c 0x000000000000084c 0x000000000000084c
                 0x0000000000000034 0x0000000000000034  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x0000000000000d88 0x0000000000001d88 0x0000000000001d88
                 0x0000000000000278 0x0000000000000278  RW     0x8
 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.android.ident .note.gnu.build-id .dynsym .dynstr .hash .gnu.version .gnu.version_r .rela.plt .plt .text .r
odata .eh_frame .eh_frame_hdr 
   03     .fini_array .init_array .preinit_array .dynamic .got .got.plt .bss 
   04     .dynamic 
   05     .note.android.ident .note.gnu.build-id 
   06     .eh_frame_hdr 
   07     
   08     .fini_array .init_array .preinit_array .dynamic .got .got.plt 

~/workspace/build/src/obj-firefox/ipc/app$ ../../build/unix/elfhack/elfhack ../../dist/bin/libplugin-container.so 
../../dist/bin/libplugin-container.so: No relocations
~/workspace/build/src/obj-firefox/ipc/app$ make syms
...
~/workspace/build/src/obj-firefox/ipc/app$ readelf -l ../../dist/bin/libplugin-container.so 
Elf file type is DYN (Shared object file)
Entry point 0x620
There are 9 program headers, starting at offset 64
Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
                 0x00000000000001f8 0x00000000000001f8  R      0x8
  INTERP         0x0000000000000238 0x0000000000000238 0x0000000000000238
                 0x0000000000000015 0x0000000000000015  R      0x1
      [Requesting program interpreter: /system/bin/linker64]
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000880 0x0000000000000880  R E    0x1000
  LOAD           0x0000000000000d88 0x0000000000001d88 0x0000000000001d88
                 0x0000000000000278 0x0000000000000280  RW     0x1000
  DYNAMIC        0x0000000000000db8 0x0000000000001db8 0x0000000000001db8
                 0x0000000000000200 0x0000000000000200  RW     0x8
  NOTE           0x000000000000024e 0x000000000000024e 0x000000000000024e
                 0x00000000000000be 0x00000000000000bc  R      0x4
  GNU_EH_FRAME   0x000000000000084c 0x000000000000084c 0x000000000000084c
                 0x0000000000000034 0x0000000000000034  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x0000000000000d88 0x0000000000001d88 0x0000000000001d88
                 0x0000000000000278 0x0000000000000278  RW     0x8
 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.android.ident .note.gnu.build-id .dynsym .dynstr .hash .gnu.version .gnu.version_r .rela.plt .plt .text .rodata .eh_frame .eh_frame_hdr 
   03     .fini_array .init_array .preinit_array .dynamic .got .got.plt .bss 
   04     .dynamic 
   05     .note.android.ident 
   06     .eh_frame_hdr 
   07     
   08     .fini_array .init_array .preinit_array .dynamic .got .got.plt 
~/workspace/build/src/obj-firefox/ipc/app$ ../../build/unix/elfhack/elfhack ../../dist/bin/libplugin-container.so 
elfhack: /builds/worker/workspace/build/src/build/unix/elfhack/elf.cpp:272: Elf::Elf(std::ifstream &): Assertion `segment->getFile
Size() == phdr.p_filesz - gold_adjustment' failed.
Aborted

But what saves the day, actually, is strip, that happens right before running elfhack during packaging:

~/workspace/build/src/obj-firefox/ipc/app$ /builds/worker/fetches/android-ndk/toolchains/x86_64-4.9/prebuilt/linux-x86_64/bin/x86_64-linux-android-strip --strip-debug ../../dist/bin/libplugin-container.so 
~/workspace/build/src/obj-firefox/ipc/app$ ../../build/unix/elfhack/elfhack ../../dist/bin/libplugin-container.so 
../../dist/bin/libplugin-container.so: No relocations
~/workspace/build/src/obj-firefox/ipc/app$ readelf -l ../../dist/bin/libplugin-container.so 
Elf file type is DYN (Shared object file)
Entry point 0x620
There are 9 program headers, starting at offset 64
Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
                 0x00000000000001f8 0x00000000000001f8  R      0x8
  INTERP         0x0000000000000238 0x0000000000000238 0x0000000000000238
                 0x0000000000000015 0x0000000000000015  R      0x1
      [Requesting program interpreter: /system/bin/linker64]
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000880 0x0000000000000880  R E    0x1000
  LOAD           0x0000000000000d88 0x0000000000001d88 0x0000000000001d88
                 0x0000000000000278 0x0000000000000280  RW     0x1000
  DYNAMIC        0x0000000000000db8 0x0000000000001db8 0x0000000000001db8
                 0x0000000000000200 0x0000000000000200  RW     0x8
  NOTE           0x000000000000024e 0x000000000000024e 0x000000000000024e
                 0x0000000000000098 0x0000000000000098  R      0x4
  GNU_EH_FRAME   0x000000000000084c 0x000000000000084c 0x000000000000084c
                 0x0000000000000034 0x0000000000000034  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x0000000000000d88 0x0000000000001d88 0x0000000000001d88
                 0x0000000000000278 0x0000000000000278  RW     0x8
 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.android.ident .note.gnu.build-id .dynsym .dynstr .hash .gnu.version .gnu.version_r .rela.plt .plt .text .rodata .eh_frame .eh_frame_hdr 
   03     .fini_array .init_array .preinit_array .dynamic .got .got.plt .bss 
   04     .dynamic 
   05     .note.android.ident 
   06     .eh_frame_hdr 
   07     
   08     .fini_array .init_array .preinit_array .dynamic .got .got.plt 

Which means the problem is not actually new. Interestingly, if I look at what happens with your try, it appears objcopy does keep it in a sane state and strip is the one that kills it, which is the opposite of what happens currently.

What this would tend to indicate is that if we work around your issue via the linker, it might just switch back to the old behavior where objcopy breaks but strip saves the day. Which seems brittle and could fail randomly again in the future.

I tried objcopy from the binutils toolchain artifact, and it still has the problem. I wonder if even more recent versions of binutils have the problem. That should probably be reported to binutils and fixed there.

Flags: needinfo?(mh+mozilla)

Using the BFD linker does fix things:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7495aed5cb2d78c0db80e84fe17c2a2034202a8f

I think the BFD linker is less likely to run into the objcopy brokenness because it creates two separate NOTE program headers, as the notes have different alignments. Each .note.* section being put in its own segment seems less likely to run into the particular bug where objcopy removes a note and then can't get the in-memory size to match the on-disk size. (lld puts them in the same section, but properly aligned without padding, which I think might also turn out OK?)

I understand the desire to fix things in objcopy (or strip), as that's what's screwing things up, but I'm not sure I want to try to get things right for local Android builds in various permutations and automation builds. (Even our automation builds use objcopy and friends out of the NDK, which are not likely to get updated with great speed--maybe we could bootstrap binutils, and fix our local copy of binutils, and make everything look in toolchain_search_paths for our binutils, but that kind of starts to feel like overkill.)

So I return to what I asked earlier:

We could add a hack in elfhack to just ignore the bogus NOTE segment, with copious comments, or we could force ld.bfd for Android because bugs. I am leaning towards the latter; WDYT, glandium?

Flags: needinfo?(mh+mozilla)

It seems BFD ld does produce something that neither objcopy nor strip break, so I guess it'd be fine to switch back to BFD, considering IIRC we haven't even purposefully switched to gold, we're just "victim" of the NDK's default being gold (not even for all platforms, BTW)

Flags: needinfo?(mh+mozilla)

The NDK uses gold as its default linker, and gold has some bugs in how
it handles .note.* sections that lead to strip and objcopy produce
invalid binaries...or at least invalid binaries according to elfhack.
When elfhack complains, the build stops, which is suboptimal.
Instead, let's use bfd ld, which doesn't have these problems.

I hate computers.

It turns out that while we can switch to bfd ld, we'd be giving up ICF to do so, which seems bad. (It also turns out that for whatever reason, our aarch64 android builds already use bfd ld, which means they've been going without ICF all this time, fantastic.) I think since this gold issue is only manifesting on x86-64, which AFAICT we only build to make sure it works, we don't actually ship the build (?), the default will just be bfd on x86-64, and we'll leave other platforms alone.

Attachment #9087461 - Attachment description: Bug 1573820 - default to bfd ld for Android; r=#build → Bug 1573820 - default to bfd ld for x86-64 Android; r=#build
Pushed by nfroyd@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d3e824ec0569
default to bfd ld for x86-64 Android; r=nalexander
Status: NEW → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70
Assignee: nobody → nfroyd
You need to log in before you can comment on or make changes to this bug.