Closed
Bug 1260208
Opened 8 years ago
Closed 8 years ago
switch Fennec builds to use libc++ by default
Categories
(Firefox Build System :: Android Studio and Gradle Integration, defect)
Firefox Build System
Android Studio and Gradle Integration
All
Android
Tracking
(firefox49 fixed)
RESOLVED
FIXED
mozilla49
Tracking | Status | |
---|---|---|
firefox49 | --- | fixed |
People
(Reporter: froydnj, Assigned: froydnj)
References
Details
Attachments
(7 files)
1.89 KB,
patch
|
BenWa
:
review+
|
Details | Diff | Splinter Review |
1017 bytes,
patch
|
snorp
:
review+
|
Details | Diff | Splinter Review |
729 bytes,
patch
|
snorp
:
review+
|
Details | Diff | Splinter Review |
9.34 KB,
patch
|
nalexander
:
review+
|
Details | Diff | Splinter Review |
4.28 KB,
patch
|
gerv
:
review+
nalexander
:
review+
|
Details | Diff | Splinter Review |
1.43 KB,
patch
|
kinetik
:
review+
|
Details | Diff | Splinter Review |
1.65 KB,
patch
|
froydnj
:
review+
|
Details | Diff | Splinter Review |
Newer versions of the NDK have a (presumably) fixed libc++. We can ditch stlport and move that much closer to having a modern C++ library everywhere.
Assignee | ||
Comment 1•8 years ago
|
||
The libc++ included with the Android NDK does not seem to work correctly with std::cerr; writing to it (or to std::cout, as confirmed by tests) causes the process to hang indefinitely, causing test failures. Using fprintf and stderr, however, seems to work correctly.
Attachment #8736017 -
Flags: review?(bgirard)
Assignee | ||
Comment 2•8 years ago
|
||
Because I failed to check it for the first NDK I built.
Attachment #8736018 -
Flags: review?(snorp)
Assignee | ||
Comment 3•8 years ago
|
||
This needs some about:licenses changes not included here; I'll try uploading those tomorrow, since there are some conflicts with the libc++/OS X work I'm doing elsewhere.
Attachment #8736019 -
Flags: review?(nalexander)
Updated•8 years ago
|
Attachment #8736017 -
Flags: review?(bgirard) → review+
Attachment #8736018 -
Flags: review?(snorp) → review+
Comment 4•8 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #1) > Created attachment 8736017 [details] [diff] [review] > part 1 - use C I/O facilities in TestWebGLElementArrayCache instead of C++ > ones > > The libc++ included with the Android NDK does not seem to work correctly > with std::cerr; writing to it (or to std::cout, as confirmed by tests) > causes the process to hang indefinitely, causing test failures. Using > fprintf and stderr, however, seems to work correctly. froydnj: how can we possibly use this? There's 180 writers in the tree: https://dxr.mozilla.org/mozilla-central/search?tree=mozilla-central&q=%22std%3A%3Acerr%22&redirect=true. Are you hoping every path through std::cerr is exercised in tests, preventing anybody from burning us? As for flipping the default, I'd like to see bootstrap updated when we do that: https://dxr.mozilla.org/mozilla-central/source/python/mozboot/mozboot/android.py#261. But I won't block on that, if you really don't want to test it -- it can be a follow-up ticket.
Flags: needinfo?(nfroyd)
Comment 5•8 years ago
|
||
Comment on attachment 8736019 [details] [diff] [review] part 3 - switch to libc++ when building Fennec Review of attachment 8736019 [details] [diff] [review]: ----------------------------------------------------------------- Redirecting to snorp. I appreciate being looped in but I just don't know enough about the C++ and linking story on Android to review.
Attachment #8736019 -
Flags: review?(nalexander) → review?(snorp)
Assignee | ||
Comment 6•8 years ago
|
||
(In reply to Nick Alexander :nalexander from comment #4) > (In reply to Nathan Froyd [:froydnj] from comment #1) > > Created attachment 8736017 [details] [diff] [review] > > part 1 - use C I/O facilities in TestWebGLElementArrayCache instead of C++ > > ones > > > > The libc++ included with the Android NDK does not seem to work correctly > > with std::cerr; writing to it (or to std::cout, as confirmed by tests) > > causes the process to hang indefinitely, causing test failures. Using > > fprintf and stderr, however, seems to work correctly. > > froydnj: how can we possibly use this? There's 180 writers in the tree: > https://dxr.mozilla.org/mozilla-central/search?tree=mozilla- > central&q=%22std%3A%3Acerr%22&redirect=true. Are you hoping every path > through std::cerr is exercised in tests, preventing anybody from burning us? Those instances fall into the following categories: * gmp-plugin-openh264: If people want to log on Android, I'd be a little surprised. * gfx/angle/: I think this is Windows-only code, so it doesn't matter. * media/mtransport/test/: We actually only run a small number of these tests on Android: jsep_track_unittest sdp_unittests mediaconduit_unittests (gets built, but doesn't actually run any tests, see test logs on TH or search for MOZ_WEBRTC_TESTS) signaling_unittests (same) jsep_track_unittest, at least, appears to output to std::cerr just fine, according to the logs. So I'm not entirely sure what's going on there. * media/webrtc/trunk/: I don't think we care about any of these. * security/nss/: We don't run NSS tests. * testing/gtest/: Android doesn't run gtests, AFAICT (!). * toolkit/crashreporter/google-breakpad/: I think these are for tools that don't run on Android? > As for flipping the default, I'd like to see bootstrap updated when we do > that: > https://dxr.mozilla.org/mozilla-central/source/python/mozboot/mozboot/ > android.py#261. But I won't block on that, if you really don't want to test > it -- it can be a follow-up ticket. This actually looks pretty easy, as direct downloading from the URL works, despite the popup license you get when clicking on the link in the NDK download webpage. I'll try it out.
Flags: needinfo?(nfroyd)
Assignee | ||
Comment 7•8 years ago
|
||
Direct downloading worked out OK, and a test run on my Debian system got it installed to the correct location.
Attachment #8736329 -
Flags: review?(nalexander)
Assignee | ||
Comment 8•8 years ago
|
||
We are now statically linking libc++ from the Android NDK into our binaries, so we need to note its license in about:license.
Attachment #8736337 -
Flags: review?(nalexander)
Attachment #8736337 -
Flags: review?(gerv)
Comment 9•8 years ago
|
||
Comment on attachment 8736329 [details] [diff] [review] part 4 - update mozboot's android NDK URL for r11b and above Review of attachment 8736329 [details] [diff] [review]: ----------------------------------------------------------------- wfm. ::: python/mozboot/mozboot/android.py @@ +258,5 @@ > print(MOBILE_ANDROID_MOZCONFIG_TEMPLATE % (sdk_path, ndk_path)) > > > +def android_ndk_url(os_name, ver='r11b'): > + # Produce a URL like 'https://dl.google.com/android/repository/android-ndk-r11b-linux-x86_64.zip nit: closing quote. @@ +266,5 @@ > arch = 'x86_64' > else: > arch = 'x86' > > + return '%s-%s-%s-%s.zip' % (base_url, ver, os_name, arch) Oh hey, they went back to regular packaging. \o/
Attachment #8736329 -
Flags: review?(nalexander) → review+
Comment 10•8 years ago
|
||
Comment on attachment 8736337 [details] [diff] [review] part 5 - add libc++ license to about:license when using it on Android Review of attachment 8736337 [details] [diff] [review]: ----------------------------------------------------------------- I'll defer to gerv for the licensing details. ::: toolkit/content/moz.build @@ +16,5 @@ > DEFINES['ANDROID_PACKAGE_NAME'] = CONFIG['ANDROID_PACKAGE_NAME'] > > if CONFIG['MOZ_ANDROID_CXX_STL'] == 'mozstlport': > DEFINES['USE_STLPORT'] = True > +if CONFIG['MOZ_ANDROID_CXX_STL'] == 'libc++': This should be in the cut-over patch that snorp is reviewing, right? In any case, r=me.
Attachment #8736337 -
Flags: review?(nalexander) → review+
Comment 11•8 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #6) > (In reply to Nick Alexander :nalexander from comment #4) > > (In reply to Nathan Froyd [:froydnj] from comment #1) > > > Created attachment 8736017 [details] [diff] [review] > > > part 1 - use C I/O facilities in TestWebGLElementArrayCache instead of C++ > > > ones > > > > > > The libc++ included with the Android NDK does not seem to work correctly > > > with std::cerr; writing to it (or to std::cout, as confirmed by tests) > > > causes the process to hang indefinitely, causing test failures. Using > > > fprintf and stderr, however, seems to work correctly. > > > > froydnj: how can we possibly use this? There's 180 writers in the tree: > > https://dxr.mozilla.org/mozilla-central/search?tree=mozilla- > > central&q=%22std%3A%3Acerr%22&redirect=true. Are you hoping every path > > through std::cerr is exercised in tests, preventing anybody from burning us? > > Those instances fall into the following categories: > > * gmp-plugin-openh264: If people want to log on Android, I'd be a little > surprised. > * gfx/angle/: I think this is Windows-only code, so it doesn't matter. > * media/mtransport/test/: We actually only run a small number of these tests > on Android: > > jsep_track_unittest > sdp_unittests > mediaconduit_unittests (gets built, but doesn't actually run any tests, see > test logs on TH or search for MOZ_WEBRTC_TESTS) > signaling_unittests (same) > > jsep_track_unittest, at least, appears to output to std::cerr just fine, > according to the logs. So I'm not entirely sure what's going on there. > > * media/webrtc/trunk/: I don't think we care about any of these. > * security/nss/: We don't run NSS tests. > * testing/gtest/: Android doesn't run gtests, AFAICT (!). > * toolkit/crashreporter/google-breakpad/: I think these are for tools that > don't run on Android? It just seems to be incredibly fragile. That is, you see bad behaviour that you can't explain; work around the one case in our tests; and then have an argument that it's not actually an issue, with no expectation that it won't be an issue in the future. However, making such calls is life. If you and snorp are comfortable with this, roll on!
(In reply to Nathan Froyd [:froydnj] from comment #1) > Created attachment 8736017 [details] [diff] [review] > part 1 - use C I/O facilities in TestWebGLElementArrayCache instead of C++ > ones > > The libc++ included with the Android NDK does not seem to work correctly > with std::cerr; writing to it (or to std::cout, as confirmed by tests) > causes the process to hang indefinitely, causing test failures. Using > fprintf and stderr, however, seems to work correctly. Weird. Do we know what it's trying to do? I'm not very comfortable with this, as there is no chance anyone would expect this type of failure. I don't want to randomly start getting hangs because someone logs with std::cerr. I didn't see this patch when I r+ed the other one :/
I'm building with libc++ now to debug the cerr/cout hang
Updated•8 years ago
|
Attachment #8736337 -
Flags: review?(gerv) → review+
Comment on attachment 8736019 [details] [diff] [review] part 3 - switch to libc++ when building Fennec Review of attachment 8736019 [details] [diff] [review]: ----------------------------------------------------------------- I looked into the std::cout/cerr hangs and can't make heads or tails of it. I guess I'm ok with working around it as you've done here for now.
Attachment #8736019 -
Flags: review?(snorp) → review+
Assignee | ||
Comment 15•8 years ago
|
||
Recently added code fails when compiled with libc++. This is a temporary workaround, and the same as we used in gfx/skia/skia/include/private/SkTLogic.h.
Attachment #8740724 -
Flags: review?(kinetik)
Assignee | ||
Comment 16•8 years ago
|
||
I compiled things with try and everything looked good except rc1: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3ca0b3d62c83 https://treeherder.mozilla.org/logviewer.html#?job_id=19359370&repo=try snorp, jchen, do either of you have special insight here? The failure doesn't look libc++ related....
Flags: needinfo?(snorp)
Flags: needinfo?(nchen)
Updated•8 years ago
|
Attachment #8740724 -
Flags: review?(kinetik) → review+
Comment 17•8 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #16) > I compiled things with try and everything looked good except rc1: > > https://treeherder.mozilla.org/#/jobs?repo=try&revision=3ca0b3d62c83 > https://treeherder.mozilla.org/logviewer.html#?job_id=19359370&repo=try > > snorp, jchen, do either of you have special insight here? The failure > doesn't look libc++ related.... Looks like some kind of memory corruption? If you can reproduce it locally, I think it'll be a lot easier to diagnose.
Flags: needinfo?(nchen)
Nathan and I both looked into this robocop crash. I couldn't seem to repro locally for some reason. Between this and the weird std::cerr issue (which looked like a bogus vptr to me), we think there might be a linker bug being tickled by libc++. Mike do you have any ideas here about some things we could try to debug some more?
Flags: needinfo?(snorp) → needinfo?(mh+mozilla)
Comment 19•8 years ago
|
||
From the crash: 08:40:56 INFO - 0 libmozglue.so!jemalloc_crash [jemalloc.c:3ca0b3d62c83 : 1625 + 0x2] 08:40:56 INFO - r0 = 0x00000000 r1 = 0x00000000 r2 = 0xe5e5e5e5 r3 = 0x00000000 08:40:56 INFO - r4 = 0x54f001f0 r5 = 0x62688000 r6 = 0x00000000 r7 = 0x00000030 08:40:56 INFO - r8 = 0x54f00044 r9 = 0x54f00040 r10 = 0x00000000 r12 = 0x528d6e84 08:40:56 INFO - fp = 0x000052b5 sp = 0x52b4aec0 lr = 0x528636cd pc = 0x5285eb56 08:40:56 INFO - Found by: given as instruction pointer in context 08:40:56 INFO - 1 libmozglue.so!arena_malloc [jemalloc.c:3ca0b3d62c83 : 4300 + 0x3] 08:40:56 INFO - r4 = 0x54f001f0 r5 = 0x62688000 r6 = 0x00000000 r7 = 0x00000030 08:40:56 INFO - r8 = 0x54f00044 r9 = 0x54f00040 r10 = 0x00000000 fp = 0x000052b5 08:40:56 INFO - sp = 0x52b4aec0 pc = 0x528636cd 08:40:56 INFO - Found by: call frame info 08:40:56 INFO - 2 libmozglue.so!je_malloc [jemalloc.c:3ca0b3d62c83 : 4317 + 0x7] 08:40:56 INFO - r4 = 0x00000029 r5 = 0x00000028 r6 = 0x544770f0 r7 = 0x544770f0 08:40:56 INFO - r8 = 0x00000029 r9 = 0x642ec800 r10 = 0x00000000 fp = 0x000052b5 08:40:56 INFO - sp = 0x52b4aee0 pc = 0x528639c7 08:40:56 INFO - Found by: call frame info jemalloc.c:4317 is: return (arena_malloc(choose_arena(), size, false)); jemalloc.c:4300 is: RELEASE_ASSERT(arena->magic == ARENA_MAGIC); which suggests choose_arena() returns something broken. And choose_arena, on Android, looks like (when removing irrelevant #ifdefed code): if (isthreaded && narenas > 1) { ... } else ret = arenas[0]; and narenas is always 1. So something is overwriting jemalloc's arena struct, which doesn't look really good. You could try something like adding a page-sized dummy field (minus 4 bytes) following arena_s's magic field, and mprotect the first page of the arena, so that whatever is doing the overwrite crashes. Which would then tell you what is doing the overwriting... assuming that still happens.
Flags: needinfo?(mh+mozilla)
Assignee | ||
Comment 20•8 years ago
|
||
I have caught this on my Android device, and GDB tells me arena->magic is ARENA_MAGIC, even though jemalloc insists that it isn't. Which is always good. Maybe some sort of race?
Comment 21•8 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #20) > I have caught this on my Android device, and GDB tells me arena->magic is > ARENA_MAGIC, even though jemalloc insists that it isn't. Which is always > good. Maybe some sort of race? Are you saying you caught the crash but gdb was telling you jemalloc shouldn't have crashed in the first place?
Assignee | ||
Comment 22•8 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #21) > (In reply to Nathan Froyd [:froydnj] from comment #20) > > I have caught this on my Android device, and GDB tells me arena->magic is > > ARENA_MAGIC, even though jemalloc insists that it isn't. Which is always > > good. Maybe some sort of race? > > Are you saying you caught the crash but gdb was telling you jemalloc > shouldn't have crashed in the first place? That's correct.
Assignee | ||
Comment 23•8 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #19) > You could try something like adding a page-sized dummy field (minus 4 bytes) > following arena_s's magic field, and mprotect the first page of the arena, > so that whatever is doing the overwrite crashes. Which would then tell you > what is doing the overwriting... assuming that still happens. I did this, and the crash moved to a different assert. No segfault from the page protection. Examining things in GDB again suggests the crash should not have happened.
Comment 24•8 years ago
|
||
Let's see if this works or not, https://treeherder.mozilla.org/#/jobs?repo=try&revision=e887f439393b operator< for EHTable compares the LHS start PC with the RHS *end* PC. Because the ranges are non-overlapping, this works fine for two distinct EHTables. However, the comparison doesn't work if LHS and RHS refer to the same EHTable; in that case operator< returns true, even though it should return false because the two operands are identical. The operator is used to sort a std::vector using std::sort [1]. I think the libc++ std::sort implementation has a quirk where, if the comparison function has the above bug, sort will sometimes get confused, and start sorting "values" outside of the memory range that it's given. This results in memory corruption and subsequent unpredictable behavior. The fix is simply to compare only the start PCs in EHTable, so that std::sort can work on it correctly. [1] http://mxr.mozilla.org/mozilla-central/source/tools/profiler/core/EHABIStackWalk.cpp?rev=86730d0a8209#485
Assignee | ||
Comment 25•8 years ago
|
||
(In reply to Jim Chen [:jchen] [:darchons] from comment #24) > operator< for EHTable compares the LHS start PC with the RHS *end* PC. > Because the ranges are non-overlapping, this works fine for two distinct > EHTables. However, the comparison doesn't work if LHS and RHS refer to > the same EHTable; in that case operator< returns true, even though it > should return false because the two operands are identical. > > The operator is used to sort a std::vector using std::sort [1]. I think > the libc++ std::sort implementation has a quirk where, if the comparison > function has the above bug, sort will sometimes get confused, and start > sorting "values" outside of the memory range that it's given. This > results in memory corruption and subsequent unpredictable behavior. > > The fix is simply to compare only the start PCs in EHTable, so that > std::sort can work on it correctly. I wonder whether we should just check if &lhs == &rhs and return false in that case before moving on to the start < end check.
Comment 26•8 years ago
|
||
Comment on attachment 8747124 [details] [diff] [review] Correctly compare EHTable when sorting (v1) (In reply to Nathan Froyd [:froydnj] from comment #25) > I wonder whether we should just check if &lhs == &rhs and return false in > that case before moving on to the start < end check. Checking start < start seems simpler. The rc1 try run did pass testANRReporter (it failed later in testDistribution, but that looks unrelated to this specific problem).
Attachment #8747124 -
Flags: review?(nfroyd)
Assignee | ||
Comment 27•8 years ago
|
||
Comment on attachment 8747124 [details] [diff] [review] Correctly compare EHTable when sorting (v1) Review of attachment 8747124 [details] [diff] [review]: ----------------------------------------------------------------- Excellent.
Attachment #8747124 -
Flags: review?(nfroyd) → review+
Comment 28•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/8986d6fa6a5c https://hg.mozilla.org/integration/mozilla-inbound/rev/185f0a2bfb14 https://hg.mozilla.org/integration/mozilla-inbound/rev/c3aad9e8d78d https://hg.mozilla.org/integration/mozilla-inbound/rev/b2dab7186ca8 https://hg.mozilla.org/integration/mozilla-inbound/rev/d72fd0796e7e https://hg.mozilla.org/integration/mozilla-inbound/rev/4e848a9f85c4
Comment 29•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/8986d6fa6a5c https://hg.mozilla.org/mozilla-central/rev/185f0a2bfb14 https://hg.mozilla.org/mozilla-central/rev/c3aad9e8d78d https://hg.mozilla.org/mozilla-central/rev/b2dab7186ca8 https://hg.mozilla.org/mozilla-central/rev/d72fd0796e7e https://hg.mozilla.org/mozilla-central/rev/4e848a9f85c4
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox49:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → Firefox 49
Assignee | ||
Comment 30•8 years ago
|
||
Perfherder graphs registered a notable change in libxul size from these commits: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=3dd2dae0c4020efec0263c28e2aacbc0a669cd0e&newProject=mozilla-inbound&newRevision=4e848a9f85c4751449f4f072fbb3b2801a8b6d3d&originalSignature=c5af03f2e8b46331d479333c6b66802580f1be8a&newSignature=c5af03f2e8b46331d479333c6b66802580f1be8a&framework=2 or about ~120K in libxul.so size. This was expected; what was not expected was a much larger regression in the APK size: https://treeherder.mozilla.org/perf.html#/alerts?id=1090 I investigated, and this turns out to be the fault of our non-release builds adding --enable-profiling as a configure option. This configure option turns on -funwind-tables for ARM: http://dxr.mozilla.org/mozilla-central/source/build/autoconf/frameptr.m4#12 which in turn causes a bunch of extra symbols to be referenced, presumably from the unwind information added by the option. These extra symbols include the C++ demangler, for pretty error messages from exception handling (which we don't use, but such is life). The important thing is that these extra symbols are referenced by a bunch of different shared objects that we create: libmozglue.so and the plugin-container{,-pie}.so files, each of which grows by 200-300K due to these extra symbols. We are not passing -Wl,--gc-sections to remove unused symbols for the plugin-container binaries, but given that we *are* passing -Wl,--gc-sections for libmozglue and still getting the extra symbols, I think it's safe to say that trying to remove them from plugin-container wouldn't have any effect. The good news is that compiling without --enable-profiling makes these numbers go down significantly; plugin-container no longer includes unecessary symbols, and mozglue gets somewhat smaller as well. I verified this locally (indeed, it took me a little while to figure out that --enable-profiling was the culprit!) and pushed a job to try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b546b1f063df&selectedJob=20416061 without --enable-profiling on top of a mozilla-central revision that contains the libc++ patches. The sizes recorded are: Size of fennec-49.0a1.en-US.android-arm.apk: 37626921 bytes Size of libxul.so: 24752751 bytes Size of omni.ja: 6032727 bytes Compare the numbers from a central revision without the libc++ patches: Size of fennec-49.0a1.en-US.android-arm.apk: 37427851 bytes Size of libxul.so: 24658568 bytes Size of omni.ja: 6033161 bytes There are a lot of other revisions in there, so the numbers aren't strictly comparable, but at least they are much closer to the same ballpark than we were before. I think we might be able to win back the mozglue size regressions if we figure out some way of not linking in the libc++ demangler implementation. mfinkle is the person that I have seen posting the most about APK sizes, so given the above analysis, are you OK with calling this seemingly large change benign?
Flags: needinfo?(mark.finkle)
Comment 31•8 years ago
|
||
Seems reasonable. Given the desire to move to a modern C++ lib, this increases are not out of line. Thankfully have other work happening to further reduce the APK size.
Flags: needinfo?(mark.finkle)
Updated•5 years ago
|
Product: Firefox for Android → Firefox Build System
Target Milestone: Firefox 49 → mozilla49
You need to log in
before you can comment on or make changes to this bug.
Description
•