Closed Bug 716295 Opened 13 years ago Closed 12 years ago

Elf-hack failing on b2g builds

Categories

(Core :: mozglue, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED FIXED
mozilla17
blocking-basecamp +

People

(Reporter: cjones, Assigned: glandium)

References

Details

Attachments

(6 files, 4 obsolete files)

2.36 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
3.54 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
3.45 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
5.33 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
4.15 KB, patch
glandium
: review+
Details | Diff | Splinter Review
19.01 KB, patch
glandium
: review+
Details | Diff | Splinter Review
I get the following spew before the exception

/home/cjones/mozilla/b2g/glue/gonk/prebuilt/linux-x86/toolchain/arm-eabi-4.4.3/bin/../lib/gcc/arm-eabi/4.4.3/../../../../arm-eabi/bin/ld: warning: /home/cjones/mozilla/b2g/glue/gonk/prebuilt/linux-x86/toolchain/arm-eabi-4.4.3/bin/../lib/gcc/arm-eabi/4.4.3/thumb/android/libgcc.a(unwind-arm.o) uses 32-bit enums yet the output is to use variable-size enums; use of enum values across objects may fail
/home/cjones/mozilla/b2g/glue/gonk/prebuilt/linux-x86/toolchain/arm-eabi-4.4.3/bin/../lib/gcc/arm-eabi/4.4.3/../../../../arm-eabi/bin/ld: warning: /home/cjones/mozilla/b2g/glue/gonk/prebuilt/linux-x86/toolchain/arm-eabi-4.4.3/bin/../lib/gcc/arm-eabi/4.4.3/thumb/android/libgcc.a(pr-support.o) uses 32-bit enums yet the output is to use variable-size enums; use of enum values across objects may fail
/home/cjones/mozilla/b2g/glue/gonk/prebuilt/linux-x86/toolchain/arm-eabi-4.4.3/bin/../lib/gcc/arm-eabi/4.4.3/../../../../arm-eabi/bin/ld: warning: /home/cjones/mozilla/b2g/glue/gonk/out/target/product/galaxys2/obj/lib/liblog.so uses 32-bit enums yet the output is to use variable-size enums; use of enum values across objects may fail
/home/cjones/mozilla/b2g/glue/gonk/prebuilt/linux-x86/toolchain/arm-eabi-4.4.3/bin/../lib/gcc/arm-eabi/4.4.3/../../../../arm-eabi/bin/ld: warning: /home/cjones/mozilla/b2g/glue/gonk/out/target/product/galaxys2/obj/lib/libstdc++.so uses 32-bit enums yet the output is to use variable-size enums; use of enum values across objects may fail
/home/cjones/mozilla/b2g/glue/gonk/prebuilt/linux-x86/toolchain/arm-eabi-4.4.3/bin/../lib/gcc/arm-eabi/4.4.3/../../../../arm-eabi/bin/ld: warning: /home/cjones/mozilla/b2g/glue/gonk/out/target/product/galaxys2/obj/lib/libm.so uses 32-bit enums yet the output is to use variable-size enums; use of enum values across objects may fail
/home/cjones/mozilla/b2g/glue/gonk/prebuilt/linux-x86/toolchain/arm-eabi-4.4.3/bin/../lib/gcc/arm-eabi/4.4.3/../../../../arm-eabi/bin/ld: warning: /home/cjones/mozilla/b2g/glue/gonk/out/target/product/galaxys2/obj/lib/libc.so uses 32-bit enums yet the output is to use variable-size enums; use of enum values across objects may fail
/home/cjones/mozilla/b2g/glue/gonk/prebuilt/linux-x86/toolchain/arm-eabi-4.4.3/bin/../lib/gcc/arm-eabi/4.4.3/../../../../arm-eabi/bin/ld: warning: /home/cjones/mozilla/b2g/glue/gonk/out/target/product/galaxys2/obj/lib/libdl.so uses 32-bit enums yet the output is to use variable-size enums; use of enum values across objects may fail
===
=== If you get failures below, please file a bug describing the error
=== and your environment (compiler and linker versions), and use
=== --disable-elf-hack until this is fixed.
===
test.so: terminate called after throwing an instance of 'std::runtime_error'
  what():  Moving section would require overlapping segments


I'm using the following mozconfig

mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/objdir-prof-gonk

# XXX our build system doesn't seem to respect toplevel -j ...
# hard-coding this sucks
mk_add_options MOZ_MAKE_FLAGS="-s $MAKE_FLAGS"

# XXX need a plan for toolchain and multi-dev-platform support
gonk="$GONK_PATH"
sys=`uname -s | tr "[[:upper:]]" "[[:lower:]]"`

ac_add_options --target=arm-android-eabi
ac_add_options --with-gonk="$gonk"
ac_add_options --with-endian=little

ac_add_options --enable-application=b2g
# ac_add_options --disable-elf-hack

ac_add_options --enable-debug-symbols
# ac_add_options --enable-profiling
ac_add_options --with-ccache

# Enable dump() from JS.
export CXXFLAGS=-DMOZ_ENABLE_JS_DUMP


The environment is https://github.com/andreasgal/B2G/commit/f12f7a7f557c0506d05be75eb7439493b76653d2.  The compiler is included prebuilt in the repository.  It corresponds to the android NDK as of 2.3.5, which I believe is NDK 6.

Note that this mozconfig isn't the default for b2g builds (obviously! :).
It came out in bug 774131 that we're spending ~500ms between fork() and main() when launching content processes.  glandium reminded us that we disabled elfhack, which should win some startup time.

After bug 774131 fixes the biggest user-perceived lag in launching content processes, the fork()-to-main() overhead will be the biggest to seeing app pixels appear on screen.  XPCOM startup is biggest after that.

This should block for figuring out what's required to reenable elfhack and what win we can get from it.
blocking-basecamp: --- → +
Assignee: nobody → mh+mozilla
Component: Build Config → mozglue
OS: Linux → Gonk
Hardware: x86_64 → ARM
glandium, I now see

===
=== If you get failures below, please file a bug describing the error
=== and your environment (compiler and linker versions), and use
/home/cjones/mozilla/new-b2g/objdir-gecko/config/nsinstall -D ../../dist/
=== --disable-elf-hack until this is fixed.
===
cd ../../dist/bin; find . -name "*.so" | xargs ../../build/unix/elfhack/elfhack
./libnssckbi.so: terminate called after throwing an instance of 'std::runtime_error'
  what():  Moving section would require overlapping segments
xargs: ../../build/unix/elfhack/elfhack: terminated by signal 6
make[4]: *** [elfhack] Error 125
make[4]: Leaving directory `/home/cjones/mozilla/new-b2g/objdir-gecko/b2g/installer'
make[3]: *** [make-package] Error 2
make[3]: Leaving directory `/home/cjones/mozilla/new-b2g/objdir-gecko/b2g/installer'
make[2]: *** [default] Error 2
make[2]: Leaving directory `/home/cjones/mozilla/new-b2g/objdir-gecko/b2g/installer'
make[1]: *** [package] Error 2
make[1]: Leaving directory `/home/cjones/mozilla/new-b2g/objdir-gecko'
make: *** [out/target/product/crespo/obj/DATA/gecko_intermediates/gecko] Error 2

with a vanilla b2g build for the Nexus S, with |ac_add_options --disable-elf-hack| commented out in $b2g/gonk-misc/default-gecko-config.  Can you take a look?
Mid-aired, I'm going to assume that's a yes :).
Yeah, I already found the problem. Fix isn't very difficult, but the tricky part is to test the fix properly, considering the problem only happens on cross-compiled environment, where we can't verify that elfhack doesn't break the resulting binaries, so I'd like the fix to be tested on linux builds as well, and that requires forging a proper test case with a linker script, or maybe using -nostartfiles.
Ooc, why doesn't the bug appear in firefox-android builds?
(In reply to Chris Jones [:cjones] [:warhammer] from comment #5)
> Ooc, why doesn't the bug appear in firefox-android builds?

Because ld, on firefox-android builds, leaves empty entries in the .dynamic section, which the b2g ld doesn't. Elfhack adds an entry to .dynamic in both cases, but only on android builds it has room to do so.

The way I want to fix this is to avoid elfhack adding an entry to .dynamic at all.
Hm, ok.  I'm surprised we're setting things up differently from firefox-android.

Do you have a patch I can try?
I have a preliminary patch that addresses the .dynamic issue for rel locations (so, for x86 and ARM, but not x64), but it uncovered another problem...
Attached patch WIP (obsolete) — Splinter Review
This works for me on otoro.
Representative timings with patch applied

I/Gecko   (   78): LAUNCH [Parent] Before fork(), time is 52 sec 517 ms
I/Gecko   (   78): LAUNCH [Parent] LaunchAndWait() --- 12.6202 ms
I/Gecko   (  261): LAUNCH [Child] In XRE_InitChildProcess(), time is 52 sec 879 ms
I/Gecko   (  261): LAUNCH [Child] ContentProcess::Init()
I/Gecko   (  261): LAUNCH [Child]     mContent.Init() --- 1.98187 ms
I/Gecko   (  261): LAUNCH [Child]     mXREEmbed.Start() ---- 396.717 ms
I/Gecko   (  261): LAUNCH [Child]     mContent.InitXPCOM() --- 396.858 ms

So this takes fork()-to-main() time down by about 200ms :).
Priority: -- → P1
Dumb question, shouldn't we be doing relocations only once for the main b2g process and fork()ing post-relocation for the rest?
Attachment #649714 - Attachment is obsolete: true
Comment on attachment 650259 [details] [diff] [review]
part 1 - Fail more gracefully when .dynamic section can't be grown

Review of attachment 650259 [details] [diff] [review]:
-----------------------------------------------------------------

::: build/unix/elfhack/elf.cpp
@@ +656,5 @@
>      // If we get here, this means we didn't match for the given tag
> +    // Most of the time, there are a few DT_NULL entries, that we can
> +    // use to add our value, but if we are on the last entry, we can't.
> +    if (i >= shdr.sh_size / shdr.sh_entsize - 1)
> +        return false;

Given the number of times shdr.sh_size / shdr.sh_entsize is used in this function, I think you should give it a name.

I'd chuck the assert above this given that we're no longer twiddling with the size of the section.

@@ +662,2 @@
>      dyns[i].tag = tag;
>      dyns[i++].value = val;

Nit: just dyns[i] here.
Attachment #650259 - Flags: review?(nfroyd) → review+
Attachment #650260 - Flags: review?(nfroyd) → review+
Attachment #650261 - Flags: review?(nfroyd) → review+
Comment on attachment 650263 [details] [diff] [review]
part 5 - Add support for R_ARM_THM_CALL relocations

Review of attachment 650263 [details] [diff] [review]:
-----------------------------------------------------------------

::: build/unix/elfhack/elfhack.cpp
@@ +248,5 @@
>              s = (tmp & (1 << 24)) >> 24;
>              j1 = ((tmp & (1 << 23)) >> 23) ^ !s;
>              j2 = ((tmp & (1 << 22)) >> 22) ^ !s;
>  
>              return 0xf000 | (s << 10) | ((tmp & (0x3ff << 12)) >> 12) | 

Nit: trailing whitespace.  Please fix while you're here.
Attachment #650263 - Flags: review?(nfroyd) → review+
Attachment #650264 - Flags: review?(nfroyd) → review+
Comment on attachment 650262 [details] [diff] [review]
part 4 - Interpose elfhack injected code in DT_INIT_ARRAY's first entry when possible

Review of attachment 650262 [details] [diff] [review]:
-----------------------------------------------------------------

I think this looks sane enough.  I want to give myself some time to look at this in the morning, though.

::: build/unix/elfhack/elfhack.cpp
@@ +368,5 @@
> +    addr.serialize(const_cast<char *>(loc.getBuffer()), Elf_Addr::size(elf->getClass()), elf->getClass(), elf->getData());
> +}
> +
> +void set_relative_reloc(Elf_Rela *rel, Elf *elf, unsigned int value) {
> +    set_relative_reloc((Elf_Rel *)rel, elf, value);

This looks unnecessary (and a bit ugly), since the reloc holds the addend already.  Am I missing something?

@@ +478,1 @@
>              (relro && (i->r_offset >= relro->getAddr()) &&

Nit: indentation here and below.

@@ +508,5 @@
> +        Rel_Type *rel = &new_rels[init_array_reloc - 1];
> +        unsigned int addend = get_addend(rel, elf);
> +        // Use relocated value of DT_INIT_ARRAY's first entry for the
> +        // function to be called by the injected code.
> +        if (ELF32_R_TYPE(rel->r_info) == rel_type) {

This macro usage (and others throughout) should depend on the class of |elf|, yes?
Attachment #650262 - Flags: review?(nfroyd) → feedback+
(In reply to Nathan Froyd (:froydnj) from comment #22)
> Comment on attachment 650262 [details] [diff] [review]
> part 4 - Interpose elfhack injected code in DT_INIT_ARRAY's first entry when
> possible
> 
> Review of attachment 650262 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I think this looks sane enough.  I want to give myself some time to look at
> this in the morning, though.
> 
> ::: build/unix/elfhack/elfhack.cpp
> @@ +368,5 @@
> > +    addr.serialize(const_cast<char *>(loc.getBuffer()), Elf_Addr::size(elf->getClass()), elf->getClass(), elf->getData());
> > +}
> > +
> > +void set_relative_reloc(Elf_Rela *rel, Elf *elf, unsigned int value) {
> > +    set_relative_reloc((Elf_Rel *)rel, elf, value);
> 
> This looks unnecessary (and a bit ugly), since the reloc holds the addend
> already.  Am I missing something?

It's for consistency with ld, which keeps the value in both places (and makes one wonder why use RELA at all), although that depends on versions. I'm tempted to make it conditional on whether the value is the old addend or not. ld.so doesn't care, though.

> @@ +508,5 @@
> > +        Rel_Type *rel = &new_rels[init_array_reloc - 1];
> > +        unsigned int addend = get_addend(rel, elf);
> > +        // Use relocated value of DT_INIT_ARRAY's first entry for the
> > +        // function to be called by the injected code.
> > +        if (ELF32_R_TYPE(rel->r_info) == rel_type) {
> 
> This macro usage (and others throughout) should depend on the class of
> |elf|, yes?

Once read from disk, everything is actually stored in Elf32 structs.
(In reply to Mike Hommey [:glandium] from comment #23)
> (In reply to Nathan Froyd (:froydnj) from comment #22)
> > Comment on attachment 650262 [details] [diff] [review]
> > part 4 - Interpose elfhack injected code in DT_INIT_ARRAY's first entry when
> > possible
> > 
> > Review of attachment 650262 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > I think this looks sane enough.  I want to give myself some time to look at
> > this in the morning, though.
> > 
> > ::: build/unix/elfhack/elfhack.cpp
> > @@ +368,5 @@
> > > +    addr.serialize(const_cast<char *>(loc.getBuffer()), Elf_Addr::size(elf->getClass()), elf->getClass(), elf->getData());
> > > +}
> > > +
> > > +void set_relative_reloc(Elf_Rela *rel, Elf *elf, unsigned int value) {
> > > +    set_relative_reloc((Elf_Rel *)rel, elf, value);
> > 
> > This looks unnecessary (and a bit ugly), since the reloc holds the addend
> > already.  Am I missing something?
> 
> It's for consistency with ld, which keeps the value in both places (and
> makes one wonder why use RELA at all), although that depends on versions.
> I'm tempted to make it conditional on whether the value is the old addend or
> not. ld.so doesn't care, though.

Actually, looking more closely, for relative relocations all versions of ld I tested are storing the value in both places. What differs is, for R_X86_64_64/R_386_32/..., some versions of ld put the value of the symbol in the relocated data field, while it's (obviously) not the addend, which is 0.
Attachment #650259 - Attachment is obsolete: true
Added a comment, and addressed review comments.
Attachment #650262 - Attachment is obsolete: true
Comment on attachment 650445 [details] [diff] [review]
part 1 - Fail more gracefully when .dynamic section can't be grown

Carry r+ over.
Attachment #650445 - Flags: review+
Attachment #650446 - Flags: review?(nfroyd)
(In reply to Mike Hommey [:glandium] from comment #23)
> > ::: build/unix/elfhack/elfhack.cpp
> > @@ +368,5 @@
> > > +    addr.serialize(const_cast<char *>(loc.getBuffer()), Elf_Addr::size(elf->getClass()), elf->getClass(), elf->getData());
> > > +}
> > > +
> > > +void set_relative_reloc(Elf_Rela *rel, Elf *elf, unsigned int value) {
> > > +    set_relative_reloc((Elf_Rel *)rel, elf, value);
> > 
> > This looks unnecessary (and a bit ugly), since the reloc holds the addend
> > already.  Am I missing something?
> 
> It's for consistency with ld, which keeps the value in both places (and
> makes one wonder why use RELA at all), although that depends on versions.
> I'm tempted to make it conditional on whether the value is the old addend or
> not. ld.so doesn't care, though.

My guess is that ld does what it does for simplicity.  If you always put the value in the relocation's location, then many places won't have to care whether you're using REL or RELA.  And this gracefully handles those architectures that can do either/or (or both simultaneously *shudder*).

> > @@ +508,5 @@
> > > +        Rel_Type *rel = &new_rels[init_array_reloc - 1];
> > > +        unsigned int addend = get_addend(rel, elf);
> > > +        // Use relocated value of DT_INIT_ARRAY's first entry for the
> > > +        // function to be called by the injected code.
> > > +        if (ELF32_R_TYPE(rel->r_info) == rel_type) {
> > 
> > This macro usage (and others throughout) should depend on the class of
> > |elf|, yes?
> 
> Once read from disk, everything is actually stored in Elf32 structs.

Argh, I saw that earlier but didn't make the connection to here.
Comment on attachment 650446 [details] [diff] [review]
part 4 - Interpose elfhack injected code in DT_INIT_ARRAY's first entry when possible

Review of attachment 650446 [details] [diff] [review]:
-----------------------------------------------------------------

::: build/unix/elfhack/elfhack.cpp
@@ +360,5 @@
> +unsigned int get_addend(Elf_Rela *rel, Elf *elf) {
> +    return rel->r_addend;
> +}
> +
> +void set_relative_reloc(Elf_Rel *rel, Elf *elf, unsigned int value, unsigned int old_value) {

old_value looks unused in both overloads of this function.

@@ +370,5 @@
> +
> +void set_relative_reloc(Elf_Rela *rel, Elf *elf, unsigned int value, unsigned int old_value) {
> +    // ld puts the value of relocated relocations both in the addend and
> +    // at r_offset. For consistency, keep it that way.
> +    set_relative_reloc((Elf_Rel *)rel, elf, value);

I appreciate the comment, but there's no three-argument set_relative_reloc...?
Attachment #650446 - Flags: review?(nfroyd) → review+
Doh, I had forgotten to remove these, which remained from an attempt at doing something smarter, which ended up being wrong.
Attachment #650446 - Attachment is obsolete: true
Comment on attachment 650548 [details] [diff] [review]
part 4 - Interpose elfhack injected code in DT_INIT_ARRAY's first entry when possible

Carrying r+ over.
Attachment #650548 - Flags: review+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: