Closed Bug 1423822 Opened 2 years ago Closed Last year

elfhack doesn't work with lld-linked binaries

Categories

(Firefox Build System :: General, defect)

defect
Not set

Tracking

(firefox63 fixed)

RESOLVED FIXED
mozilla63
Tracking Status
firefox63 --- fixed

People

(Reporter: glandium, Assigned: glandium)

References

(Blocks 1 open bug)

Details

Attachments

(7 files)

+++ This bug was initially created as a clone of Bug #1423821 +++

[This is where we should actually fix the issue somehow]

(Found while trying to make elfhack -r work with lld-linked binaries)
STR:
- Take one of the .so files attached to bug 1385783
- Run elfhack on it

Observe the ELF program headers before and after:

Before:
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x000230 0x000230 R   0x8
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x0042e8 0x0042e8 R   0x1000
  LOAD           0x005000 0x0000000000005000 0x0000000000005000 0x003040 0x003040 R E 0x1000
  LOAD           0x009000 0x0000000000009000 0x0000000000009000 0x0021c8 0x003010 RW  0x1000
  TLS            0x00a3a0 0x000000000000a3a0 0x000000000000a3a0 0x000000 0x001010 R   0x10
  DYNAMIC        0x00b010 0x000000000000b010 0x000000000000b010 0x0001a0 0x0001a0 RW  0x8
  GNU_RELRO      0x00b000 0x000000000000b000 0x000000000000b000 0x0001c8 0x001000 R   0x1
  GNU_EH_FRAME   0x004240 0x0000000000004240 0x0000000000004240 0x000024 0x000024 R   0x1
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0
  NOTE           0x00046e 0x000000000000046e 0x000000000000046e 0x000018 0x000018 R   0x1

After:
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x000268 0x000268 R   0x8
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000810 0x000810 R   0x1000
  LOAD           0x001240 0x0000000000004240 0x0000000000004240 0x0000a8 0x0000a8 R   0x1000
  LOAD           0x0012e8 0x00000000000042e8 0x00000000000042e8 0x003d58 0x003d58 R E 0x1000
  LOAD           0x006000 0x0000000000009000 0x0000000000009000 0x0021c8 0x003010 RW  0x1000
  TLS            0x0073a0 0x000000000000a3a0 0x000000000000a3a0 0x000000 0x001010 R   0x10
  DYNAMIC        0x008010 0x000000000000b010 0x000000000000b010 0x0001a0 0x0001a0 RW  0x8
  GNU_RELRO      0x008000 0x000000000000b000 0x000000000000b000 0x0001c8 0x001000 R   0x1
  GNU_EH_FRAME   0x001240 0x0000000000004240 0x0000000000004240 0x000024 0x000024 R   0x1
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0
  LOPROC+0x36968 0x6f00747365740061 0x676e697274730066 0x6600796172726100 0x770065737500726f 0x68666c6500687469     0x73006f74006b6361

The problem is that elfhack is adding a program header item, growing the program header by 56 bytes, but the data section that directly follows that is .rodata, which can't have its address moved (lld is being too smart for its own good).
Ideas:
- Move the program headers to some location where there's space. This could possibly break things that assume that mapping the first page of the binary allows to read the program headers (as a matter of fact, IIRC our custom dynamic linker does this assumption)
- Move the non-relocatable data that follows the program headers by one page, creating a one-page gap, and adjust the PT_LOAD to load it at the right address.
- Somehow take advantage of the fact that .rel.dyn and .text are not in the same PT_LOAD, and place the elfhack sections in a way that doesn't require splitting further. The current problem is that elfhack assumes it can't move the eh_frame data address (which /might/ be wrong) and since it places the elfhack data section before that, it needs to split the PT_LOAD to allow the binary to shrink in size.
Duplicate of this bug: 1425692
Product: Core → Firefox Build System
Blocks: 1457482
I do like the option of fixing lld rather than moving to another linker. I'm not sure it's easier than modifying elfhack, but I'll start somewhere.

Using readelf -l, on my local machine my linker is 'GNU ld (GNU Binutils for Ubuntu) 2.30' (bfd) and produces the following for test-array.so

>  Type           Offset             VirtAddr           PhysAddr
>                 FileSiz            MemSiz              Flags  Align
>  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
>                 0x000000000000af10 0x000000000000af10  R E    0x200000
>  LOAD           0x000000000000be40 0x000000000020be40 0x000000000020be40
>                 0x00000000000015b0 0x00000000000015b8  RW     0x200000
>  DYNAMIC        0x000000000000be50 0x000000000020be50 0x000000000020be50
>                 0x0000000000000190 0x0000000000000190  RW     0x8
>  NOTE           0x0000000000000200 0x0000000000000200 0x0000000000000200
>                 0x0000000000000024 0x0000000000000024  R      0x4
>  TLS            0x000000000000be40 0x000000000020be40 0x000000000020be40
>                 0x0000000000000000 0x0000000000001010  R      0x10
>  GNU_EH_FRAME   0x000000000000ae40 0x000000000000ae40 0x000000000000ae40
>                 0x000000000000002c 0x000000000000002c  R      0x4
>  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
>                 0x0000000000000000 0x0000000000000000  RW     0x10
>  GNU_RELRO      0x000000000000be40 0x000000000020be40 0x000000000020be40
>                 0x00000000000001c0 0x00000000000001c0  R      0x1
>
> Section to Segment mapping:
>  Segment Sections...
>   00     .note.gnu.build-id .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .plt .text .rodata .eh_frame_hdr .eh_frame
>   01     .init_array .dynamic .got .got.plt .data .bss
>   02     .dynamic
>   03     .note.gnu.build-id
>   04     .tbss
>   05     .eh_frame_hdr
>   06
>   07     .init_array .dynamic .got




The lld-produced test-array.so from Bug 1385783 (and a local one I just generated) looks like this:

>  Type           Offset             VirtAddr           PhysAddr
>                 FileSiz            MemSiz              Flags  Align
>  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
>                 0x0000000000000230 0x0000000000000230  R      0x8
>  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
>                 0x00000000000042e8 0x00000000000042e8  R      0x1000
>  LOAD           0x0000000000005000 0x0000000000005000 0x0000000000005000
>                 0x0000000000003040 0x0000000000003040  R E    0x1000
>  LOAD           0x0000000000009000 0x0000000000009000 0x0000000000009000
>                 0x00000000000021c8 0x0000000000003010  RW     0x1000
>  TLS            0x000000000000a3a0 0x000000000000a3a0 0x000000000000a3a0
>                 0x0000000000000000 0x0000000000001010  R      0x10
>  DYNAMIC        0x000000000000b010 0x000000000000b010 0x000000000000b010
>                 0x00000000000001a0 0x00000000000001a0  RW     0x8
>  GNU_RELRO      0x000000000000b000 0x000000000000b000 0x000000000000b000
>                 0x00000000000001c8 0x0000000000001000  R      0x1
>  GNU_EH_FRAME   0x0000000000004240 0x0000000000004240 0x0000000000004240
>                 0x0000000000000024 0x0000000000000024  R      0x1
>  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
>                 0x0000000000000000 0x0000000000000000  RW     0x0
>  NOTE           0x000000000000046e 0x000000000000046e 0x000000000000046e
>                 0x0000000000000018 0x0000000000000018  R      0x1
>
> Section to Segment mapping:
>  Segment Sections...
>   00
>   01     .rodata .note.gnu.build-id .dynsym .gnu.version .gnu.version_r .gnu.hash .hash .dynstr .rela.dyn .rela.plt .eh_frame_hdr .eh_frame
>   02     .text .plt
>   03     .data .got.plt .init_array .dynamic .got .bss
>   04     .tbss
>   05     .dynamic
>   06     .init_array .dynamic .got
>   07     .eh_frame_hdr
>   08
>   09     .note.gnu.build-id


As I understand it (some of this is probably wrong):

elfhack is shrinking the .rela.plt section, and using the reclaimed space to insert it's own two sections. By using the reclaimed space it can avoid relocating anything subsequent to it (like .rodata) which means it doesn't need to relocate something non-relocatable.  When doing so it creates a new segment and moves all subsequent sections to it's sections into the new segment. (I'm not sure why it does this.)

PHDR is not needed; and bfd doesn't produce it.

Because lld does produce it, we need to edit it. Editing it increases its size. That requires something right after it to moved, and that's not possible.

Additionally, we need our .elfhack sections to be in an executable segment.


In theory, possible solutions include:

a) Figuring out how to use less space in PHDR so we don't need to increase its size
b) moving .rela.dyn .rela.plt to the beginning (right after PHDR) so when we shrink them, we can use the reclaimed space
c) Not generating PHDR and generating the first segment as executable
d) others...
More conversation:

d) Using a linker script to order sections more helpfully
e) Using --sort-sections if that will do what we want - it probably won't
f) Using  --symbol-ordering-file which "will map symbols to sections and sort sections in output according to symbol ordering file"

I'm going to investigate (f) first since this need was its intended purpose.
PHDR is that thing you are displaying. LLD is just putting a self-reference in it. So of course, to add a PT_LOAD, there needs to be one entry added, which means it *has* to grow. But, actually, this gives another opportunity:
g) reuse one of the useless program headers so that PHDR doesn't have to grow.

Now, for more explanation wrt comment 3:
elfhack shrinks .rel* as you said, and to reclaim the "freed" space, needs to move the sections that appear after that. But you can only do so if you make relative addresses in the entire binary be the same, so you need to ensure the VADDRs stay the same. Which mean elfhack has to split a PT_LOAD in two so that VADDRs don't change.

Some section, however, don't really care about their VADDR, and maybe eh_frame is one of them, and if it is, then there is no need to split a PT_LOAD with lld, since elfhack could move eh_frame to follow the resized .rel*, and place its own executable section before .text. That's the 3 item in comment 1.

I'd advise against f) because it requires listing all .rodata symbols, or even more.

b) and c) are not valid options.
I'd got two pretty good paths forward on the clang side.  Both involve the --no-rosegment argument for lld to change its segments/sections to behave like bfd's. But that doesn't solve the PHDR problem.

1) Patch clang not to output PHDR

2) Offset the first section after PHDR with a linker script so there is room for it to grow.

Unfortunately; both break rust.
If --no-rosegment makes it behave more like bfd, why is there a phdr problem still?
Ah; there may not be after all.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a5cff261dda30045133faf41e3c0a87b1830153c&selectedJob=179725669

I'm hitting 'No gain, skipping': https://searchfox.org/mozilla-central/source/build/unix/elfhack/elfhack.cpp#796

I thought this might be fine; but the comment there makes me unsure.

Additionally, in the build log there are a lot of 'readelf: Error: bad dynamic symbol' but I'm not sure where it's coming from or if it's a real issue or just a 'readelf can't process the object file output at that point at the job that does it runs unconditionally' type error.
Assignee: nobody → mh+mozilla
Attachment #8996648 - Flags: review?(nfroyd)
The eh_frame thing, at the moment, breaks the profiler.
Comment on attachment 8996643 [details]
Bug 1423822 - Make `elfhack -f` work in all cases where no gain would happen.

https://reviewboard.mozilla.org/r/260730/#review267740
Attachment #8996643 - Flags: review?(nfroyd) → review+
Comment on attachment 8996646 [details]
Bug 1423822 - Check segments overlapping later.

https://reviewboard.mozilla.org/r/260736/#review267792
Attachment #8996646 - Flags: review?(nfroyd) → review+
Comment on attachment 8996647 [details]
Bug 1423822 - Set the address for the elfhack code section based on that of the section it is attached to.

https://reviewboard.mozilla.org/r/260738/#review267794
Attachment #8996647 - Flags: review?(nfroyd) → review+
Comment on attachment 8996644 [details]
Bug 1423822 - Change how elfhack determines it's not worth trying.

https://reviewboard.mozilla.org/r/260732/#review267802
Attachment #8996644 - Flags: review?(nfroyd) → review+
Comment on attachment 8996649 [details]
Bug 1423822 - Stop disabling elfhack on lto builds.

https://reviewboard.mozilla.org/r/260742/#review267804
Attachment #8996649 - Flags: review?(nfroyd) → review+
Comment on attachment 8996645 [details]
Bug 1423822 - Handle more cases of pointer reuse in DT_INIT_ARRAY.

https://reviewboard.mozilla.org/r/260734/#review267806
Attachment #8996645 - Flags: review?(nfroyd) → review+
So, for the record, it turns out llvm-dwarfdump is a little deceptive, in that it presents the raw data, and doesn't care to present "PC"-relative pointers in a resolved fashion, so the previous patch, which just moved the sections, didn't show any difference, but didn't actually work because the "PC"-relative pointers did need to be updated.

Dwarfdump -F, however, *does* show the resolved pointers, and made it more obvious the previous patch was wrong.

With the new patch, there is (almost) no difference in output for `dwarfdump -F` before and after the patch. The only difference comes from the fact that dwarfdump shows LSDA pointers as raw data, rather than interpreting them. They are pointers into the .gcc_except_table section and do need the same adjustments.
> before and after the patch

I mean before and after running elfhack.
Comment on attachment 8996648 [details]
Bug 1423822 - Allow to relocate eh_frame.

https://reviewboard.mozilla.org/r/260740/#review268048

Fun, fun.

::: build/unix/elfhack/elfhack.cpp:1090
(Diff revision 3)
> +    // The .eh_frame section usually follows the eh_frame_hdr section.
> +    ElfSection* eh_frame = eh_frame_hdr ? eh_frame_hdr->getNext() : nullptr;

"Usually follows" doesn't seem like a very strong guarantee.  Should we throw an error here if `eh_frame_hdr && !eh_frame`, just so we can refine this later if need be?
Attachment #8996648 - Flags: review?(nfroyd) → review+
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/2698872fcdef
Make `elfhack -f` work in all cases where no gain would happen. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/a5792a8f6568
Change how elfhack determines it's not worth trying. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/6d74232cf5ef
Handle more cases of pointer reuse in DT_INIT_ARRAY. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/172feaea900c
Check segments overlapping later. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/8993d26433d8
Set the address for the elfhack code section based on that of the section it is attached to. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/83e147e83538
Allow to relocate eh_frame. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/a973e7f82da4
Stop disabling elfhack on lto builds. r=froydnj
Depends on: 1480617
Depends on: 1480654
Depends on: 1499915
You need to log in before you can comment on or make changes to this bug.