Closed Bug 1165460 Opened 9 years ago Closed 9 years ago

Upgrade to NDK r10e on builders to avoid x86 linkage issues

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(firefox41 fixed)

RESOLVED FIXED
Tracking Status
firefox41 --- fixed

People

(Reporter: snorp, Assigned: snorp)

References

Details

Attachments

(3 files, 4 obsolete files)

I have encountered a bug where code built using the NDK I've been using locally (r10d) works fine, but fails when using NDK r8e which we have deployed right now (bug 1141693). Maybe it's time to upgrade? One other benefit is that we could use GCC 4.9, which should have better support for modern C++ features.
Ok, it looks like we need this for bug 1141693 on x86.
Blocks: 1141693
Summary: Consider using newer Android NDK (r10d) → Upgrade to NDK r10d on builders to avoid x86 linkage issues
Kim, any idea how quickly we could do this?
Flags: needinfo?(kmoir)
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #2)
> Kim, any idea how quickly we could do this?

snorp, you can do this yourself, since it's in tooltool: https://dxr.mozilla.org/mozilla-central/source/mobile/android/config/tooltool-manifests/android/releng.manifest#4  You will need tooltool privileges; you may need to file to get access bits, like in Bug 1090606.

You can also co-ordinate with mcomella, who's doing similar things to rev the SDK side in Bug 1165422.

But in general this is really easy:

1) fetch old NDK, using tooltool interface;
2) observe structure;
3) update as appropriate;
4) upload to tooltool;
5) update manifest locally to use new SHA512;
6) push to try.
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #0)
> I have encountered a bug where code built using the NDK I've been using
> locally (r10d) works fine, but fails when using NDK r8e which we have
> deployed right now (bug 1141693). Maybe it's time to upgrade? One other
> benefit is that we could use GCC 4.9, which should have better support for
> modern C++ features.

glandium knows the most about NDK-version-specific bugs.  He once gave me a list of such tickets but now I can't find it :(
I apologize for the delay in responding, I was at TRIBE last week and then yesterday was a holiday in Canada. It looks like nalexander has provided you with the info you need.
Flags: needinfo?(kmoir)
Assignee: nobody → snorp
r10e is the latest and seems to work here, so we'll use that.
Summary: Upgrade to NDK r10d on builders to avoid x86 linkage issues → Upgrade to NDK r10e on builders to avoid x86 linkage issues
Try run here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ad46c7891fae

The failures are from a MOZ_CRASH() here: https://dxr.mozilla.org/mozilla-central/source/js/src/jsnativestack.cpp?from=jsnativestack.cpp&case=true#154

Somehow, fopen("/proc/self/maps", "r") fails with EFAULT with this NDK/compiler. I have no idea why or how that can happen, but I'm investigating. It only seems to happen in child processes, so in practice I don't think it really affects us too much.
With the addition of a small patch to clean up some warnings-as-errors, these appear to pass try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b77126fd1e12
Attachment #8612317 - Flags: review?(nchen) → review?(coop)
Attachment #8612318 - Flags: review?(nchen) → review?(mh+mozilla)
snorp: there's a little work to update mozboot (|mach bootstrap|) here.  NI to me to look into it.  This might need a follow-up ticket filed.
Flags: needinfo?(nalexander)
Comment on attachment 8612317 [details] [diff] [review]
Build Fennec using NDK r10e

Review of attachment 8612317 [details] [diff] [review]:
-----------------------------------------------------------------

If you could trim the trailing whitespace before landing, that would be swell.
Attachment #8612317 - Flags: review?(coop) → review+
Comment on attachment 8612319 [details] [diff] [review]
Try to work around weirdness reading /proc/self/maps under NDK r10e/GCC 4.9

Review of attachment 8612319 [details] [diff] [review]:
-----------------------------------------------------------------

I noticed that there was one line in /proc/B2G_PID/maps that was longer than 100 characters, so we should probably increase the size of the buffer.

::: js/src/jsnativestack.cpp
@@ +98,5 @@
> +    *stackSize = 0;
> +
> +    FILE* fs = fopen("/proc/self/maps", "r");
> +    if (fs) {
> +        char line[100];

Some of the lines seem to be longer that 100 characters. I wonder if that's related to the problem?

@@ +112,5 @@
> +                break;
> +            }
> +        }
> +        fclose(fs);
> +        return true;

If, for some bizarre reason, the stack address isn't found in /proc/self/maps (which seems about as likely to happen as the open of /proc/self/maps failing), then this function will return true. If that's intended, then this looks fine to me.

@@ +175,1 @@
>          }

So with this new code, does ReadStackSize return after some number of tries without an EFAULT?

I'm going to assume that the answer to that is yes.

In which case, I think that its actually a bug in the kernel. I'm guessing that the the times it fails is when the map is changing while we're reading the map file, and that this is in turn causing the linked list of vm areas to become invalid, which in turn causes the EFAULT.
Attachment #8612319 - Flags: review?(dhylands) → review+
(In reply to Chris Cooper [:coop] from comment #13)
> If you could trim the trailing whitespace before landing, that would be
> swell.

If tooltool could not add them, it would be even better ;)
Comment on attachment 8612317 [details] [diff] [review]
Build Fennec using NDK r10e

Review of attachment 8612317 [details] [diff] [review]:
-----------------------------------------------------------------

::: mobile/android/config/tooltool-manifests/android-x86/releng.manifest
@@ +1,3 @@
>  [
>  {
> +"size": 357533004, 

Our NDK was trimmed to avoid such sizes. Could you do the same? (iirc, tbsaunde did the trimming, maybe he used a script)
Attachment #8612317 - Flags: feedback-
Comment on attachment 8612317 [details] [diff] [review]
Build Fennec using NDK r10e

Review of attachment 8612317 [details] [diff] [review]:
-----------------------------------------------------------------

::: mobile/android/config/tooltool-manifests/android-x86/releng.manifest
@@ +1,3 @@
>  [
>  {
> +"size": 357533004, 

Also note how the x86 and arm ndks were different, because they weren't including the parts for each other.
Attachment #8612318 - Flags: review?(mh+mozilla) → review+
(In reply to Dave Hylands [:dhylands] from comment #14)
> Comment on attachment 8612319 [details] [diff] [review]
> Try to work around weirdness reading /proc/self/maps under NDK r10e/GCC 4.9
> 
> Review of attachment 8612319 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I noticed that there was one line in /proc/B2G_PID/maps that was longer than
> 100 characters, so we should probably increase the size of the buffer.

Ok, I'll do that.

> 
> ::: js/src/jsnativestack.cpp
> @@ +98,5 @@
> > +    *stackSize = 0;
> > +
> > +    FILE* fs = fopen("/proc/self/maps", "r");
> > +    if (fs) {
> > +        char line[100];
> 
> Some of the lines seem to be longer that 100 characters. I wonder if that's
> related to the problem?

Nope, from what I've seen, we can't even open the file.

> 
> @@ +112,5 @@
> > +                break;
> > +            }
> > +        }
> > +        fclose(fs);
> > +        return true;
> 
> If, for some bizarre reason, the stack address isn't found in
> /proc/self/maps (which seems about as likely to happen as the open of
> /proc/self/maps failing), then this function will return true. If that's
> intended, then this looks fine to me.

Yeah that's no bueno, I'll fix.

> 
> @@ +175,1 @@
> >          }
> 
> So with this new code, does ReadStackSize return after some number of tries
> without an EFAULT?
> 
> I'm going to assume that the answer to that is yes.

In local testing (on a pandaboard, even) it succeeds on either the first or second try. On automation, though, it sometimes seems to exceed all 10 tries, which is why I added the pthread_attr_getstack() fallback.

> 
> In which case, I think that its actually a bug in the kernel. I'm guessing
> that the the times it fails is when the map is changing while we're reading
> the map file, and that this is in turn causing the linked list of vm areas
> to become invalid, which in turn causes the EFAULT.

I agree that it most likely seems like a bug in the kernel. But why would a new NDK or GCC 4.9 trigger that I wonder?

Jim Chen had the idea that perhaps the "/proc/self/maps" string had not been loaded by the dynamic linker yet. I thought the string would be copied into kernel space, triggering an appropriate fault, but maybe I'm wrong? I can't seem to reproduce this locally with latest m-c anymore, but it still happens in automation. Very irritating.
Attachment #8612319 - Attachment is obsolete: true
Attachment #8613042 - Flags: review?(mh+mozilla)
Attachment #8613042 - Flags: review?(dhylands)
Attachment #8613042 - Flags: review?(mh+mozilla)
Comment on attachment 8613042 [details] [diff] [review]
Try to work around weirdness reading /proc/self/maps under NDK r10e/GCC 4.9

Review of attachment 8613042 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsnativestack.cpp
@@ +159,5 @@
>          // thread (see bug 846670). So we scan /proc/self/maps to find the
>          // segment which contains the stack.
> +        //
> +        // There appears to be some kind of race here when building with NDK r10e, where
> +        // fopen (and open) fail with EFAULT. Retry until it works or we give up.

Based on:
- what EFAULT means for open,
- how completely normal the machine code and data for this function looks

Here is my theory:
- when this code is called, the address at which "/proc/self/maps" has not been decompressed by our linker.
- if open wasn't involved, what would happen is that reading that string would trigger a segfault, our linker would get that segfault, decompress the necessary data, return, and execution would be restored to reading that string.
- Now, with open()/fopen(), what happens is that the pointer to the string is sent directly to the kernel. In that case, the kernel tries to do the reading, and that doesn't trigger a segfault. Instead, the kernel returns a EFAULT error. Try running this locally:
   #include <fcntl.h>
   #include <stdio.h>
   #include <errno.h>
   int main() {
     int fd = open((char *) 0x12345, O_RDONLY);
     printf("%d %d\n", fd, errno);
     return 0;
   }
Obviously 0x12345 is not going to be a valid address. If the program itself was reading there, it would segfault. But open just returns -1 with errno=14 (EFAULT). That has the same meaning, but doesn't crash.
- Why this worked before? simply because the data was laid out such that the page where the string literal "/proc/self/maps" is was already uncompressed before this code was reached.

I'm actually surprised we've never hit this before. Or maybe we did and didn't figure out the root cause. Or maybe we're not using open (or other system calls) with string literals (or other literal data the kernel would read direction from userspace) that often.

If this theory is right, and it sure seems sound to me, you can work around this issue by reading the string (completely, in case it spans over 2 pages) before calling fopen.

Whether that's the right thing to do is another story, though. At least it's less magic than retrying 10 times.
Attachment #8613042 - Flags: feedback-
A simple way to test that theory would be to do:

char filename[64];
strcpy(filename, "/proc/self/maps");

and then pass filename to open.

If it is some kind of bizarre segfault while the image is still loading (I'm dubious) then the failure will move from the open to the strcpy.

I wonder if using /proc/%d/maps and sprintf'ing getpid into that makes any difference? (/proc/self is a symlink so maybe that is having an impact somehow?)

If it were something strange like the string happens to end right on a page boundary, then you could try to use

"/proc/self/maps\0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" as the filename.

Getting the address of the string when you get the EFAULT might offer a clue.

And just to confirm. The EFAULT is on the open and not on the read?
You don't even need to strcpy. Simply strlen, which will iterate over it.
> I wonder if using /proc/%d/maps and sprintf'ing getpid into that makes any difference? (/proc/self is a symlink so maybe that is having an impact somehow?)

It would make a difference between the string sent to open would then not be in .rodata, but would be in the heap.
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #18)
...snip...
> I agree that it most likely seems like a bug in the kernel. But why would a
> new NDK or GCC 4.9 trigger that I wonder?

The new compiler probably does a different code generatation which changes code sizes and causes everything in memory to move around.

If might be that that the /proc/self/maps string is in a bad place (perhaps ending on a page boundary or something), or some other data structure is now in a bad place.

Having said that, I looked at the kernel code for vfs_open

If returns EFAULT if the userspace pointer is larger than TASK_SIZE (which is typically 0xC000000 although it depends on the kernel/user spilt).

It then winds up calling strncpy_from_user which x86 provides an architecture specific version which does longword accesses and falls back to byte at a time accesses.

The code says that it falls back to byte at a time if it hits a page fault, however this would probably not happen very often, so there might be some weird complication.

This is why it would be useful to know the address of the /proc/self/maps string when we encounter one of these EFAULT conditions. It would allow us to see if the string is right near a page boundary or not.

However, this is also one of those cases where adding the code to check might disturb the memory enough that the problem goes away.

> Jim Chen had the idea that perhaps the "/proc/self/maps" string had not been
> loaded by the dynamic linker yet. I thought the string would be copied into
> kernel space, triggering an appropriate fault, but maybe I'm wrong? I can't
> seem to reproduce this locally with latest m-c anymore, but it still happens
> in automation. Very irritating.

IIRC automation runs single-cpu virtual machines?
OK, this passes Try, and pretty much confirms that the on-demand linker theory is correct.
Attachment #8613042 - Attachment is obsolete: true
Attachment #8613042 - Flags: review?(dhylands)
Attachment #8613628 - Flags: review?(mh+mozilla)
Attachment #8613628 - Flags: review?(dhylands)
Comment on attachment 8613628 [details] [diff] [review]
Work around on-demand linker badness with NDK r10e/GCC 4.9

Review of attachment 8613628 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with an explanatory comment.

::: js/src/jsnativestack.cpp
@@ +130,5 @@
> +
> +        const char* path = "/proc/self/maps";
> +        const char* mode = "r";
> +
> +        mozilla::unused << strlen(path);

I think that you need a comment here explaining why you're doing this.
Attachment #8613628 - Flags: review?(dhylands) → review+
Comment on attachment 8613628 [details] [diff] [review]
Work around on-demand linker badness with NDK r10e/GCC 4.9

Review of attachment 8613628 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsnativestack.cpp
@@ +127,5 @@
>          // thread (see bug 846670). So we scan /proc/self/maps to find the
>          // segment which contains the stack.
>          rc = -1;
> +
> +        const char* path = "/proc/self/maps";

const char path[] = ...

@@ +131,5 @@
> +        const char* path = "/proc/self/maps";
> +        const char* mode = "r";
> +
> +        mozilla::unused << strlen(path);
> +        mozilla::unused << strlen(mode);

This one is not needed, as mentioned on irc.
That said, strlen was a bad example :( you're being outsmarted by the compiler, who finds out that what you're doing is calling strlen on a string literal and happily replaces it with the string length.

[Note: we still need to fix the actual problem somehow, so please file a separate bug for that]
Attachment #8613628 - Flags: review?(mh+mozilla) → review-
Alright, I think this one is the best yet. It assigns the literal to a non-const
char[], which means it should be on the stack and not in .rodata
Attachment #8613628 - Attachment is obsolete: true
Attachment #8614315 - Flags: review?(mh+mozilla)
Add volatile keyword
Attachment #8614315 - Attachment is obsolete: true
Attachment #8614315 - Flags: review?(mh+mozilla)
Attachment #8614335 - Flags: review?(mh+mozilla)
Comment on attachment 8614335 [details] [diff] [review]
Work around on-demand linker badness with NDK r10e/GCC 4.9

Review of attachment 8614335 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsnativestack.cpp
@@ +127,5 @@
>          // segment which contains the stack.
>          rc = -1;
> +
> +        // Put the string on the stack, otherwise there is the danger that it
> +        // has not been decompressed by the the on-demand linker. Bug 1165460.

s/the the/the/
Attachment #8614335 - Flags: review?(mh+mozilla) → review+
Yup. Accidental perf improvements are the best.
Blocks: 1194365
Flags: needinfo?(nalexander)
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.