Closed Bug 1260208 Opened 8 years ago Closed 8 years ago

switch Fennec builds to use libc++ by default

Categories

(Firefox Build System :: Android Studio and Gradle Integration, defect)

All
Android
defect
Not set
normal

Tracking

(firefox49 fixed)

RESOLVED FIXED
mozilla49
Tracking Status
firefox49 --- fixed

People

(Reporter: froydnj, Assigned: froydnj)

References

Details

Attachments

(7 files)

Newer versions of the NDK have a (presumably) fixed libc++.  We can ditch stlport and move that much closer to having a modern C++ library everywhere.
The libc++ included with the Android NDK does not seem to work correctly
with std::cerr; writing to it (or to std::cout, as confirmed by tests)
causes the process to hang indefinitely, causing test failures.  Using
fprintf and stderr, however, seems to work correctly.
Attachment #8736017 - Flags: review?(bgirard)
Because I failed to check it for the first NDK I built.
Attachment #8736018 - Flags: review?(snorp)
This needs some about:licenses changes not included here; I'll try uploading
those tomorrow, since there are some conflicts with the libc++/OS X work I'm
doing elsewhere.
Attachment #8736019 - Flags: review?(nalexander)
Attachment #8736017 - Flags: review?(bgirard) → review+
Attachment #8736018 - Flags: review?(snorp) → review+
(In reply to Nathan Froyd [:froydnj] from comment #1)
> Created attachment 8736017 [details] [diff] [review]
> part 1 - use C I/O facilities in TestWebGLElementArrayCache instead of C++
> ones
> 
> The libc++ included with the Android NDK does not seem to work correctly
> with std::cerr; writing to it (or to std::cout, as confirmed by tests)
> causes the process to hang indefinitely, causing test failures.  Using
> fprintf and stderr, however, seems to work correctly.

froydnj: how can we possibly use this?  There's 180 writers in the tree: https://dxr.mozilla.org/mozilla-central/search?tree=mozilla-central&q=%22std%3A%3Acerr%22&redirect=true.  Are you hoping every path through std::cerr is exercised in tests, preventing anybody from burning us?

As for flipping the default, I'd like to see bootstrap updated when we do that: https://dxr.mozilla.org/mozilla-central/source/python/mozboot/mozboot/android.py#261.  But I won't block on that, if you really don't want to test it -- it can be a follow-up ticket.
Flags: needinfo?(nfroyd)
Comment on attachment 8736019 [details] [diff] [review]
part 3 - switch to libc++ when building Fennec

Review of attachment 8736019 [details] [diff] [review]:
-----------------------------------------------------------------

Redirecting to snorp.  I appreciate being looped in but I just don't know enough about the C++ and linking story on Android to review.
Attachment #8736019 - Flags: review?(nalexander) → review?(snorp)
(In reply to Nick Alexander :nalexander from comment #4)
> (In reply to Nathan Froyd [:froydnj] from comment #1)
> > Created attachment 8736017 [details] [diff] [review]
> > part 1 - use C I/O facilities in TestWebGLElementArrayCache instead of C++
> > ones
> > 
> > The libc++ included with the Android NDK does not seem to work correctly
> > with std::cerr; writing to it (or to std::cout, as confirmed by tests)
> > causes the process to hang indefinitely, causing test failures.  Using
> > fprintf and stderr, however, seems to work correctly.
> 
> froydnj: how can we possibly use this?  There's 180 writers in the tree:
> https://dxr.mozilla.org/mozilla-central/search?tree=mozilla-
> central&q=%22std%3A%3Acerr%22&redirect=true.  Are you hoping every path
> through std::cerr is exercised in tests, preventing anybody from burning us?

Those instances fall into the following categories:

* gmp-plugin-openh264: If people want to log on Android, I'd be a little surprised.
* gfx/angle/: I think this is Windows-only code, so it doesn't matter.
* media/mtransport/test/: We actually only run a small number of these tests on Android:

jsep_track_unittest
sdp_unittests
mediaconduit_unittests (gets built, but doesn't actually run any tests, see test logs on TH or search for MOZ_WEBRTC_TESTS)
signaling_unittests (same)

jsep_track_unittest, at least, appears to output to std::cerr just fine, according to the logs.  So I'm not entirely sure what's going on there.

* media/webrtc/trunk/: I don't think we care about any of these.
* security/nss/: We don't run NSS tests.
* testing/gtest/: Android doesn't run gtests, AFAICT (!).
* toolkit/crashreporter/google-breakpad/: I think these are for tools that don't run on Android?

> As for flipping the default, I'd like to see bootstrap updated when we do
> that:
> https://dxr.mozilla.org/mozilla-central/source/python/mozboot/mozboot/
> android.py#261.  But I won't block on that, if you really don't want to test
> it -- it can be a follow-up ticket.

This actually looks pretty easy, as direct downloading from the URL works, despite the popup license you get when clicking on the link in the NDK download webpage.  I'll try it out.
Flags: needinfo?(nfroyd)
Direct downloading worked out OK, and a test run on my Debian system got it
installed to the correct location.
Attachment #8736329 - Flags: review?(nalexander)
We are now statically linking libc++ from the Android NDK into our binaries, so
we need to note its license in about:license.
Attachment #8736337 - Flags: review?(nalexander)
Attachment #8736337 - Flags: review?(gerv)
Comment on attachment 8736329 [details] [diff] [review]
part 4 - update mozboot's android NDK URL for r11b and above

Review of attachment 8736329 [details] [diff] [review]:
-----------------------------------------------------------------

wfm.

::: python/mozboot/mozboot/android.py
@@ +258,5 @@
>          print(MOBILE_ANDROID_MOZCONFIG_TEMPLATE % (sdk_path, ndk_path))
>  
>  
> +def android_ndk_url(os_name, ver='r11b'):
> +    # Produce a URL like 'https://dl.google.com/android/repository/android-ndk-r11b-linux-x86_64.zip

nit: closing quote.

@@ +266,5 @@
>          arch = 'x86_64'
>      else:
>          arch = 'x86'
>  
> +    return '%s-%s-%s-%s.zip' % (base_url, ver, os_name, arch)

Oh hey, they went back to regular packaging.  \o/
Attachment #8736329 - Flags: review?(nalexander) → review+
Comment on attachment 8736337 [details] [diff] [review]
part 5 - add libc++ license to about:license when using it on Android

Review of attachment 8736337 [details] [diff] [review]:
-----------------------------------------------------------------

I'll defer to gerv for the licensing details.

::: toolkit/content/moz.build
@@ +16,5 @@
>      DEFINES['ANDROID_PACKAGE_NAME'] = CONFIG['ANDROID_PACKAGE_NAME']
>  
>  if CONFIG['MOZ_ANDROID_CXX_STL'] == 'mozstlport':
>      DEFINES['USE_STLPORT'] = True
> +if CONFIG['MOZ_ANDROID_CXX_STL'] == 'libc++':

This should be in the cut-over patch that snorp is reviewing, right?  In any case, r=me.
Attachment #8736337 - Flags: review?(nalexander) → review+
(In reply to Nathan Froyd [:froydnj] from comment #6)
> (In reply to Nick Alexander :nalexander from comment #4)
> > (In reply to Nathan Froyd [:froydnj] from comment #1)
> > > Created attachment 8736017 [details] [diff] [review]
> > > part 1 - use C I/O facilities in TestWebGLElementArrayCache instead of C++
> > > ones
> > > 
> > > The libc++ included with the Android NDK does not seem to work correctly
> > > with std::cerr; writing to it (or to std::cout, as confirmed by tests)
> > > causes the process to hang indefinitely, causing test failures.  Using
> > > fprintf and stderr, however, seems to work correctly.
> > 
> > froydnj: how can we possibly use this?  There's 180 writers in the tree:
> > https://dxr.mozilla.org/mozilla-central/search?tree=mozilla-
> > central&q=%22std%3A%3Acerr%22&redirect=true.  Are you hoping every path
> > through std::cerr is exercised in tests, preventing anybody from burning us?
> 
> Those instances fall into the following categories:
> 
> * gmp-plugin-openh264: If people want to log on Android, I'd be a little
> surprised.
> * gfx/angle/: I think this is Windows-only code, so it doesn't matter.
> * media/mtransport/test/: We actually only run a small number of these tests
> on Android:
> 
> jsep_track_unittest
> sdp_unittests
> mediaconduit_unittests (gets built, but doesn't actually run any tests, see
> test logs on TH or search for MOZ_WEBRTC_TESTS)
> signaling_unittests (same)
> 
> jsep_track_unittest, at least, appears to output to std::cerr just fine,
> according to the logs.  So I'm not entirely sure what's going on there.
> 
> * media/webrtc/trunk/: I don't think we care about any of these.
> * security/nss/: We don't run NSS tests.
> * testing/gtest/: Android doesn't run gtests, AFAICT (!).
> * toolkit/crashreporter/google-breakpad/: I think these are for tools that
> don't run on Android?

It just seems to be incredibly fragile.  That is, you see bad behaviour that you can't explain; work around the one case in our tests; and then have an argument that it's not actually an issue, with no expectation that it won't be an issue in the future.

However, making such calls is life.  If you and snorp are comfortable with this, roll on!
(In reply to Nathan Froyd [:froydnj] from comment #1)
> Created attachment 8736017 [details] [diff] [review]
> part 1 - use C I/O facilities in TestWebGLElementArrayCache instead of C++
> ones
> 
> The libc++ included with the Android NDK does not seem to work correctly
> with std::cerr; writing to it (or to std::cout, as confirmed by tests)
> causes the process to hang indefinitely, causing test failures.  Using
> fprintf and stderr, however, seems to work correctly.

Weird. Do we know what it's trying to do? I'm not very comfortable with this, as there is no chance anyone would expect this type of failure. I don't want to randomly start getting hangs because someone logs with std::cerr. I didn't see this patch when I r+ed the other one :/
I'm building with libc++ now to debug the cerr/cout hang
Attachment #8736337 - Flags: review?(gerv) → review+
Comment on attachment 8736019 [details] [diff] [review]
part 3 - switch to libc++ when building Fennec

Review of attachment 8736019 [details] [diff] [review]:
-----------------------------------------------------------------

I looked into the std::cout/cerr hangs and can't make heads or tails of it. I guess I'm ok with working around it as you've done here for now.
Attachment #8736019 - Flags: review?(snorp) → review+
Recently added code fails when compiled with libc++.  This is a temporary
workaround, and the same as we used in gfx/skia/skia/include/private/SkTLogic.h.
Attachment #8740724 - Flags: review?(kinetik)
I compiled things with try and everything looked good except rc1:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3ca0b3d62c83
https://treeherder.mozilla.org/logviewer.html#?job_id=19359370&repo=try

snorp, jchen, do either of you have special insight here?  The failure doesn't look libc++ related....
Flags: needinfo?(snorp)
Flags: needinfo?(nchen)
Attachment #8740724 - Flags: review?(kinetik) → review+
(In reply to Nathan Froyd [:froydnj] from comment #16)
> I compiled things with try and everything looked good except rc1:
> 
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=3ca0b3d62c83
> https://treeherder.mozilla.org/logviewer.html#?job_id=19359370&repo=try
> 
> snorp, jchen, do either of you have special insight here?  The failure
> doesn't look libc++ related....

Looks like some kind of memory corruption? If you can reproduce it locally, I think it'll be a lot easier to diagnose.
Flags: needinfo?(nchen)
Nathan and I both looked into this robocop crash. I couldn't seem to repro locally for some reason. Between this and the weird std::cerr issue (which looked like a bogus vptr to me), we think there might be a linker bug being tickled by libc++. Mike do you have any ideas here about some things we could try to debug some more?
Flags: needinfo?(snorp) → needinfo?(mh+mozilla)
From the crash:
 08:40:56     INFO -   0  libmozglue.so!jemalloc_crash [jemalloc.c:3ca0b3d62c83 : 1625 + 0x2]
 08:40:56     INFO -       r0 = 0x00000000    r1 = 0x00000000    r2 = 0xe5e5e5e5    r3 = 0x00000000
 08:40:56     INFO -       r4 = 0x54f001f0    r5 = 0x62688000    r6 = 0x00000000    r7 = 0x00000030
 08:40:56     INFO -       r8 = 0x54f00044    r9 = 0x54f00040   r10 = 0x00000000   r12 = 0x528d6e84
 08:40:56     INFO -       fp = 0x000052b5    sp = 0x52b4aec0    lr = 0x528636cd    pc = 0x5285eb56
 08:40:56     INFO -      Found by: given as instruction pointer in context
 08:40:56     INFO -   1  libmozglue.so!arena_malloc [jemalloc.c:3ca0b3d62c83 : 4300 + 0x3]
 08:40:56     INFO -       r4 = 0x54f001f0    r5 = 0x62688000    r6 = 0x00000000    r7 = 0x00000030
 08:40:56     INFO -       r8 = 0x54f00044    r9 = 0x54f00040   r10 = 0x00000000    fp = 0x000052b5
 08:40:56     INFO -       sp = 0x52b4aec0    pc = 0x528636cd
 08:40:56     INFO -      Found by: call frame info
 08:40:56     INFO -   2  libmozglue.so!je_malloc [jemalloc.c:3ca0b3d62c83 : 4317 + 0x7]
 08:40:56     INFO -       r4 = 0x00000029    r5 = 0x00000028    r6 = 0x544770f0    r7 = 0x544770f0
 08:40:56     INFO -       r8 = 0x00000029    r9 = 0x642ec800   r10 = 0x00000000    fp = 0x000052b5
 08:40:56     INFO -       sp = 0x52b4aee0    pc = 0x528639c7
08:40:56 INFO - Found by: call frame info

jemalloc.c:4317 is:
    return (arena_malloc(choose_arena(), size, false));

jemalloc.c:4300 is:
    RELEASE_ASSERT(arena->magic == ARENA_MAGIC);

which suggests choose_arena() returns something broken.

And choose_arena, on Android, looks like (when removing irrelevant #ifdefed code):
    if (isthreaded && narenas > 1) {
         ...
    } else
         ret = arenas[0];

and narenas is always 1.

So something is overwriting jemalloc's arena struct, which doesn't look really good.

You could try something like adding a page-sized dummy field (minus 4 bytes) following arena_s's magic field, and mprotect the first page of the arena, so that whatever is doing the overwrite crashes. Which would then tell you what is doing the overwriting... assuming that still happens.
Flags: needinfo?(mh+mozilla)
I have caught this on my Android device, and GDB tells me arena->magic is ARENA_MAGIC, even though jemalloc insists that it isn't.  Which is always good.  Maybe some sort of race?
(In reply to Nathan Froyd [:froydnj] from comment #20)
> I have caught this on my Android device, and GDB tells me arena->magic is
> ARENA_MAGIC, even though jemalloc insists that it isn't.  Which is always
> good.  Maybe some sort of race?

Are you saying you caught the crash but gdb was telling you jemalloc shouldn't have crashed in the first place?
(In reply to Mike Hommey [:glandium] from comment #21)
> (In reply to Nathan Froyd [:froydnj] from comment #20)
> > I have caught this on my Android device, and GDB tells me arena->magic is
> > ARENA_MAGIC, even though jemalloc insists that it isn't.  Which is always
> > good.  Maybe some sort of race?
> 
> Are you saying you caught the crash but gdb was telling you jemalloc
> shouldn't have crashed in the first place?

That's correct.
(In reply to Mike Hommey [:glandium] from comment #19)
> You could try something like adding a page-sized dummy field (minus 4 bytes)
> following arena_s's magic field, and mprotect the first page of the arena,
> so that whatever is doing the overwrite crashes. Which would then tell you
> what is doing the overwriting... assuming that still happens.

I did this, and the crash moved to a different assert.  No segfault from the page protection.  Examining things in GDB again suggests the crash should not have happened.
Let's see if this works or not, https://treeherder.mozilla.org/#/jobs?repo=try&revision=e887f439393b

operator< for EHTable compares the LHS start PC with the RHS *end* PC.
Because the ranges are non-overlapping, this works fine for two distinct
EHTables. However, the comparison doesn't work if LHS and RHS refer to
the same EHTable; in that case operator< returns true, even though it
should return false because the two operands are identical.

The operator is used to sort a std::vector using std::sort [1].  I think
the libc++ std::sort implementation has a quirk where, if the comparison
function has the above bug, sort will sometimes get confused, and start
sorting "values" outside of the memory range that it's given.  This
results in memory corruption and subsequent unpredictable behavior.

The fix is simply to compare only the start PCs in EHTable, so that
std::sort can work on it correctly.

[1] http://mxr.mozilla.org/mozilla-central/source/tools/profiler/core/EHABIStackWalk.cpp?rev=86730d0a8209#485
(In reply to Jim Chen [:jchen] [:darchons] from comment #24)
> operator< for EHTable compares the LHS start PC with the RHS *end* PC.
> Because the ranges are non-overlapping, this works fine for two distinct
> EHTables. However, the comparison doesn't work if LHS and RHS refer to
> the same EHTable; in that case operator< returns true, even though it
> should return false because the two operands are identical.
> 
> The operator is used to sort a std::vector using std::sort [1].  I think
> the libc++ std::sort implementation has a quirk where, if the comparison
> function has the above bug, sort will sometimes get confused, and start
> sorting "values" outside of the memory range that it's given.  This
> results in memory corruption and subsequent unpredictable behavior.
> 
> The fix is simply to compare only the start PCs in EHTable, so that
> std::sort can work on it correctly.

I wonder whether we should just check if &lhs == &rhs and return false in that case before moving on to the start < end check.
Comment on attachment 8747124 [details] [diff] [review]
Correctly compare EHTable when sorting (v1)

(In reply to Nathan Froyd [:froydnj] from comment #25)
> I wonder whether we should just check if &lhs == &rhs and return false in
> that case before moving on to the start < end check.

Checking start < start seems simpler. The rc1 try run did pass testANRReporter (it failed later in testDistribution, but that looks unrelated to this specific problem).
Attachment #8747124 - Flags: review?(nfroyd)
Comment on attachment 8747124 [details] [diff] [review]
Correctly compare EHTable when sorting (v1)

Review of attachment 8747124 [details] [diff] [review]:
-----------------------------------------------------------------

Excellent.
Attachment #8747124 - Flags: review?(nfroyd) → review+
Blocks: 1269251
Perfherder graphs registered a notable change in libxul size from these commits:

https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=3dd2dae0c4020efec0263c28e2aacbc0a669cd0e&newProject=mozilla-inbound&newRevision=4e848a9f85c4751449f4f072fbb3b2801a8b6d3d&originalSignature=c5af03f2e8b46331d479333c6b66802580f1be8a&newSignature=c5af03f2e8b46331d479333c6b66802580f1be8a&framework=2

or about ~120K in libxul.so size.  This was expected; what was not expected was a much larger regression in the APK size:

https://treeherder.mozilla.org/perf.html#/alerts?id=1090

I investigated, and this turns out to be the fault of our non-release builds adding --enable-profiling as a configure option.  This configure option turns on -funwind-tables for ARM:

http://dxr.mozilla.org/mozilla-central/source/build/autoconf/frameptr.m4#12

which in turn causes a bunch of extra symbols to be referenced, presumably from the unwind information added by the option.  These extra symbols include the C++ demangler, for pretty error messages from exception handling (which we don't use, but such is life).  The important thing is that these extra symbols are referenced by a bunch of different shared objects that we create: libmozglue.so and the plugin-container{,-pie}.so files, each of which grows by 200-300K due to these extra symbols.

We are not passing -Wl,--gc-sections to remove unused symbols for the plugin-container binaries, but given that we *are* passing -Wl,--gc-sections for libmozglue and still getting the extra symbols, I think it's safe to say that trying to remove them from plugin-container wouldn't have any effect.

The good news is that compiling without --enable-profiling makes these numbers go down significantly; plugin-container no longer includes unecessary symbols, and mozglue gets somewhat smaller as well.  I verified this locally (indeed, it took me a little while to figure out that --enable-profiling was the culprit!) and pushed a job to try:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b546b1f063df&selectedJob=20416061

without --enable-profiling on top of a mozilla-central revision that contains the libc++ patches.  The sizes recorded are:

Size of fennec-49.0a1.en-US.android-arm.apk: 37626921 bytes
Size of libxul.so: 24752751 bytes
Size of omni.ja: 6032727 bytes

Compare the numbers from a central revision without the libc++ patches:

Size of fennec-49.0a1.en-US.android-arm.apk: 37427851 bytes
Size of libxul.so: 24658568 bytes
Size of omni.ja: 6033161 bytes

There are a lot of other revisions in there, so the numbers aren't strictly comparable, but at least they are much closer to the same ballpark than we were before.  I think we might be able to win back the mozglue size regressions if we figure out some way of not linking in the libc++ demangler implementation.

mfinkle is the person that I have seen posting the most about APK sizes, so given the above analysis, are you OK with calling this seemingly large change benign?
Flags: needinfo?(mark.finkle)
Seems reasonable. Given the desire to move to a modern C++ lib, this increases are not out of line.

Thankfully have other work happening to further reduce the APK size.
Flags: needinfo?(mark.finkle)
Product: Firefox for Android → Firefox Build System
Target Milestone: Firefox 49 → mozilla49
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: