Closed Bug 935831 Opened 6 years ago Closed 6 years ago

segfault on Android 2.3 [@ memcpy | libfreebl3.so (deleted)@0x7549]

Categories

(NSS :: Libraries, defect, P1, critical)

3.15.4
ARM
Android
defect

Tracking

(firefox25 unaffected, firefox26 unaffected, firefox27+ fixed, firefox28+ verified, fennec28+)

VERIFIED FIXED
3.15.4
Tracking Status
firefox25 --- unaffected
firefox26 --- unaffected
firefox27 + fixed
firefox28 + verified
fennec 28+ ---

People

(Reporter: rnewman, Assigned: wtc)

References

Details

(Keywords: crash, regression, reproducible)

Crash Data

Attachments

(3 files, 4 obsolete files)

I wondered why my Fennec build would silently quit as soon as Gecko finished loading. This has been happening for a few days on my local fx-team builds. Did a clobber, still there, so here I am filing. I have reproduced this with the same stack twice. Please move to another component if needed.

Motorola XT910 running Android 2.3.x.

This is with a "Clear Data" Android app -- as if it were just installed.

Had to set a breakpoint on _exit.


Breakpoint 1, 0xafd0b170 in _exit () from /Users/rnewman/moz/jimdb/jimdb-arm/lib/0146C4A81200C003/system/lib/libc.so
(gdb) bt
#0  0xafd0b170 in _exit () from /Users/rnewman/moz/jimdb/jimdb-arm/lib/0146C4A81200C003/system/lib/libc.so
#1  0x4a53f4e2 in nsProfileLock::FatalSignalHandler (signo=11, info=0x46ad6b20, context=0x46ad6ba0) at /Users/rnewman/moz/hg/fx-team/profile/dirserviceprovider/src/nsProfileLock.cpp:195
#2  0x4b4a43b8 in AsmJSFaultHandler (context=0x46ad6ba0, info=0x46ad6b20, signum=11) at /Users/rnewman/moz/hg/fx-team/js/src/jit/AsmJSSignalHandlers.cpp:988
#3  AsmJSFaultHandler (signum=11, info=0x46ad6b20, context=0x46ad6ba0) at /Users/rnewman/moz/hg/fx-team/js/src/jit/AsmJSSignalHandlers.cpp:970
#4  0x8171f1b0 in SEGVHandler::handler (signum=11, info=0x46ad6b20, context=0x46ad6ba0) at /Users/rnewman/moz/hg/fx-team/mozglue/linker/ElfLoader.cpp:1092
#5  0xffff0514 in ?? ()
#6  0xffff0514 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) f 1
#1  0x4a53f4e2 in nsProfileLock::FatalSignalHandler (signo=11, info=0x46ad6b20, context=0x46ad6ba0) at /Users/rnewman/moz/hg/fx-team/profile/dirserviceprovider/src/nsProfileLock.cpp:195
195	    _exit(signo);
(gdb) p *info
$1 = {si_signo = 11, si_errno = 0, si_code = 2, _sifields = {_pad = {1213841408, 1172007248, -2128, 1368241097, 1262555593, 0, 2, 136, 36420, 1185780136, 136, 1270736800, 1268663152, 1268627260, 1368240736, 36371, 36377, 36370, 1185780136,
    1185780136, 1185771064, 1, 1309754492, 140, 1314988128, 1185771480, 1264300675, 1185770240, 0}, _kill = {_pid = 1213841408, _uid = 1172007248}, _timer = {_tid = 1213841408, _overrun = 1172007248, _pad =
    0x46ad6b34 "\260\367\377\377\215Q\311\021AK", _sigval = {sival_int = -2128, sival_ptr = 0xfffff7b0}, _sys_private = 1368241097}, _rt = {_pid = 1213841408, _uid = 1172007248, _sigval = {sival_int = -2128, sival_ptr = 0xfffff7b0}},
    _sigchld = {_pid = 1213841408, _uid = 1172007248, _status = -2128, _utime = 1368241097, _stime = 1262555593}, _sigfault = {_addr = 0x4859c000}, _sigpoll = {_band = 1213841408, _fd = 1172007248}}}
(gdb)



I have a run stopped in gdb if you need me to issue more commands.
AsmJSFaultHandler is just a signal handler like FatalSignalHandler and the ElfLoader SEGVHandler::handler, which are also on the stack.  They will be on the stack for any SIGSEGV caused by a bug.  Do you have any indication that the SIGSEGV is in some way caused by asm.js?
No indication at all, 'cept for presence in the stack and a correlation with Gecko launching (which tentatively implies that it's not a Java-land bug).

My gdb doesn't have `catch signal`, so this is the end of the trail for me...

Any suggestions?
(In reply to Richard Newman [:rnewman] from comment #2)
> Any suggestions?

Not in particular, I guess you could try bisecting to find the bad cset.  I haven't seen anything indicating so far that this is related to JS, so probably this bug belongs in a different component.
Summary: segfault in AsmJSFaultHandler on Android 2.3 → segfault on Android 2.3
Thanks, Luke.
Component: JavaScript Engine → General
Product: Core → Firefox for Android
Version: 28 Branch → Trunk
Crash Signature: libfreebl3.so (deleted)@0x7549 | libfreebl3.so (deleted)@0x3505b | libfreebl3.so (deleted)@0xac6d
tracking-fennec: --- → ?
Duplicate of this bug: 936486
Crash Signature: libfreebl3.so (deleted)@0x7549 | libfreebl3.so (deleted)@0x3505b | libfreebl3.so (deleted)@0xac6d → [@ memcpy | libfreebl3.so (deleted)@0x7549 | libfreebl3.so (deleted)@0x3505b | libfreebl3.so (deleted)@0xac6d]
Aaron: do you have an Atrix to get a regression window here?
Mine is crashing on an XT910, so other devices might be susceptible too.
Good buid: 11.03.2013;
Bad build: 11.04.2013;
Pushlog: hg.mozilla.org/mozilla-central/pushloghtml?fromchange=47c8e9b16918&tochange=b4143e04bea1

Tested on Motorola Atrix MB860 (android 2.3.4)
There should be inbounds available from those dates, can you narrow this down?
Keywords: qawantedregression
Crash Signature: [@ memcpy | libfreebl3.so (deleted)@0x7549 | libfreebl3.so (deleted)@0x3505b | libfreebl3.so (deleted)@0xac6d] → [@ memcpy | libfreebl3.so (deleted)@0x7549 | libfreebl3.so (deleted)@0x3505b | libfreebl3.so (deleted)@0xac6d] [@ libfreebl3.so (deleted)@0x7549 | libfreebl3.so (deleted)@0x3505b | libfreebl3.so (deleted)@0xac6d]
Severity: normal → critical
Keywords: reproducible
Doug, we need to narrow this down to a specific NSS change. Could you please find somebody who could do that?
Flags: needinfo?(brian) → needinfo?(doug.turner)
tracking-fennec: ? → 28+
Summary: segfault on Android 2.3 → segfault on Android 2.3 [@ memcpy | libfreebl3.so (deleted)@0x7549]
Brian, Doug: what particular information are we waiting for here?

Realistically, we have about two weeks left before merge in which to fix or back out that NSS change, and in the mean time we're crashing on startup on a tier 1 platform.
Flags: needinfo?(brian)
Attached patch Patch? (obsolete) — Splinter Review
I have no idea what this change was there for or if it can be backed out. Just found this by backing out random suspicious looking pieces of 35c1e12a0772. Its made me less (no) crashy though.

I have no desire to own this. brian, can you?
Attachment #8334994 - Flags: review?(brian)
Unfortunately, this does not eliminate the crash on my Motorola 2.3 device.
Does disabling on-demand decompression help? i.e. run
> adb shell am start -n org.mozilla.fennec/.App --es env0 MOZ_LINKER_ONDEMAND=0

FWIW, I could not reproduce this on my single-core ARMv7 device running 2.2 or 2.3. It's possible this only happens on dual-core devices (Atrix/XT910) and would indicate a race somewhere. However, dual-core Tegras running 2.2 don't seem to be affected.
To bisect this, I think we can do something like this:

cd ../
mkdir nss
hg clone https://hg.mozilla.org/projects/nss
hg bisect --good 827c64ac482f
hg bisect --bad
...

Every time Mercurial switches to a revision <revision> in that bisect, in another terminal window:

python client.py --repo=../nss/nss update_nss <revision>
touch security/nss/coreconf/coreconf.dep
make build security/build

I will see if I can find an Android 2.3 device in the office tomorrow.
Flags: needinfo?(brian)
Comment on attachment 8334994 [details] [diff] [review]
Patch?

Review of attachment 8334994 [details] [diff] [review]:
-----------------------------------------------------------------

I seriously doubt that this is fixing the crash. This is a patch to the build/installation tool, nsinstall. This would only affect how NSS gets installed/copied into the distribution directory during build time.
Attachment #8334994 - Flags: review?(brian)
(In reply to Jim Chen [:jchen :nchen] from comment #17)
> Does disabling on-demand decompression help?

Not on my Motorola.
Startup of Nightly works for me on my Nexus One (2.3)
(In reply to Brian Smith (:briansmith, was :bsmith; Please NEEDINFO? me if you want a response) from comment #19)
> Comment on attachment 8334994 [details] [diff] [review]
> Patch?
> 
> Review of attachment 8334994 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I seriously doubt that this is fixing the crash. This is a patch to the
> build/installation tool, nsinstall. This would only affect how NSS gets
> installed/copied into the distribution directory during build time.

Hmm. I must have been doing something funky. But now that I can startup, I can fix my other 2.3 bugs and safely ignore this.
(In reply to Brian Smith (:briansmith, was :bsmith; Please NEEDINFO? me if you want a response) from comment #18)
> I will see if I can find an Android 2.3 device in the office tomorrow.

I've got 3 or 4 at my desk if you need one.
i asked rnewman and wes to help find a regression window and they said they have more important things to do.
Flags: needinfo?(doug.turner)
i can't reproduce on a nexus 5.
(In reply to Doug Turner (:dougt) from comment #25)
> i can't reproduce on a nexus 5.

Ah, the fabled Nexus 5 "Gingerbread Edition".
(In reply to Doug Turner (:dougt) from comment #25)
> i can't reproduce on a nexus 5.

Yes, I wouldn't expect the Nexus 5 to reproduce a 2.3 crash. Let me know if you need a 2.3 device.
Flags: needinfo?(doug.turner)
I looked at rnewman's crash and got this stack,

> memcpy (SIGSEGV)
> SHA256_Update
> prng_Hash_df
> prng_reseed
> prng_reseed_test
> RNG_RandomUpdate
> RNG_FileUpdate
> ...

In these functions, the only change I see from the regression range is bug 927230. This patch disables that new code on Android and should tell us whether it's the cause or not.
Alas, does not prevent startup crash for me.
richard, can you back out:
   https://bug927230.bugzilla.mozilla.org/attachment.cgi?id=822066

Can try again?
Flags: needinfo?(doug.turner)
Attached patch Backout. v1Splinter Review
Backout, as suggested by Doug.

Note that this is identical to Jim's suggestion in Comment 28, except that instead of wrapping everything with ANDROID ifdefs, I just deleted the relevant lines.

This fixes the crash, which I guess means that those ifdefs didn't work.
Attachment #8335708 - Flags: review?(brian)
Depends on: 927230
Blocks: 927230
No longer depends on: 927230
Comment on attachment 8335708 [details] [diff] [review]
Backout. v1

Review of attachment 8335708 [details] [diff] [review]:
-----------------------------------------------------------------

Let's check this into mozilla-inbound to test it and verify that it stops the crashes, while we work on a better patch for NSS.
Attachment #8335708 - Flags: review?(brian) → review+
BTW, I will do the checkin, so that I can do the security/patches/README madness.
Assignee: nobody → brian
Whiteboard: [leave open]
Thank you for tracking this down. Can you get the arguments for the functions
in the crash call stack?

Based on the call stack in comment 28, it is one of the these four
SHA256_Update calls in prng_Hash_df (file security/nss/lib/freebl/drbg.c):

136  	SHA256_Update(&ctx, &counter, 1);
137  	SHA256_Update(&ctx, (unsigned char *)&tmp, sizeof tmp);
138  	SHA256_Update(&ctx, input_string_1, input_string_1_len);
139 	if (input_string_2) {
140  	    SHA256_Update(&ctx, input_string_2, input_string_2_len);
141 	}

It is most likely the third or the fourth SHA256_Update call. Please
get their second and third arguments, which have these types:

    const unsigned char *input,
    unsigned int inputLen

Also, please get the two arguments that RNG_FileUpdate passes to
RNG_RandomUpdate (in file security/nss/lib/freebl/unix_rand.c):

982 	    bytes = PR_MIN(sizeof buffer, limit - fileBytes);
983 	    bytes = fread(buffer, 1, bytes, file);
984 	    if (bytes == 0) 
985 		break;
986 	    RNG_RandomUpdate(buffer, bytes);

The arguments have these types:

  const void *data, size_t bytes

I suspect the second argument, |bytes|, has a bad value.

Finally, if you can get the arguments passed to the memcpy call that
crashed, that'll be helpful, too.

I suggest starting with that RNG_RandomUpdate call, then the two
SHA256_Update calls (probably the fourth one), and finally the crashing
memcpy call inside SHA256_Update. You can stop if you see a bad buffer
length argument.

Thank you for your help!
Gotta love Bionic... I went through the minidump stack data and extracted all the relevant arguments,

memcpy(src = 0x48E48FE4, dst = 0x46C10F30, size = 64) (sha512.c:412)
SHA256_Update(ctx = 0x46C10F30, input = 0x46C11198, inputLen = 0xFFFFFFFF) (drbg.c:140)
prng_Hash_df(input_string_2 = 0x46C11198, input_string_2_len = 0xFFFFFFFF) (drbg.c:212)
prng_reseed(additional_input = 0x46C11198, additional_input_len = 0xFFFFFFFF) (drbg.c:524)
RNG_RandomUpdate(data = 0x46C11198, bytes = 0xFFFFFFFF) (unix_rand.c:982)
RNG_FileUpdate

As you can see, RNG_FileUpdate() passed in a |bytes| argument of 0xFFFFFFFF to RNG_RandomUpdate(), which made us try to read 4GB of data. That value came straight from the return value of Bionic's fread(), but why did it return 0xFFFFFFFF? Let's look at the code for Gingerbread's fread() [1],

> 68 #if 1  /* BIONIC: optimize unbuffered reads */
> 69     if (fp->_flags & __SNBF && fp->_ur == 0)

Bad sign when Bionic tries to optimize (aka break) something. Further down,

> 82         if (fp->_flags & __SEOF)
> 83             return (EOF);
> ...
> 88                 errno = EBADF;
> 89                 fp->_flags |= __SERR;
> 90                 return (EOF);
> ...
> 94                 if (__sflush(fp))
> 95                     return (EOF);

Wait... return EOF?! Case closed. Bionic strikes again.

In summary, in unbuffered mode, Bionic's fread() potentially returns EOF (0xFFFFFFFF) at the end of a file or when an error occurs. I couldn't find any docs for fread() that document this behavior, and other people have encountered it as well [2]. It's a problem going all the way to KitKat [3], and we were just fortunate (or unfortunate) that it didn't strike sooner.

In our case, the switch to unbuffered read in RNG_FileUpdate() enabled this code path, and because we didn't check for EOF, we thought we read 4GB of data from the fread() return value, and Bad Things happened. It only happened on certain 2.3 devices presumably because only those devices contain files that resulted in EOF reads or errors.

[1] http://androidxref.com/2.3.6/xref/bionic/libc/stdio/fread.c#48
[2] https://github.com/cesanta/mongoose/issues/9
[3] http://androidxref.com/4.4_r1/xref/bionic/libc/stdio/fread.c#47
Jim: thank you very much for extracting debug info and tracking down the bug!

Before I received your reply, I also inspected the source code of fread() in
Android and suspected that fread() may have returned EOF (-1). With your
confirmation and the extra info, I can come up with a workaround: only use
unbuffered I/O with /dev/urandom.

I checked several specifications of fread() and none of them document the
EOF return value. So the Bionic behavior is nonstandard.
(In reply to Wan-Teh Chang from comment #37)
> Jim: thank you very much for extracting debug info and tracking down the bug!
> 
> Before I received your reply, I also inspected the source code of fread() in
> Android and suspected that fread() may have returned EOF (-1). With your
> confirmation and the extra info, I can come up with a workaround: only use
> unbuffered I/O with /dev/urandom.
> 
> I checked several specifications of fread() and none of them document the
> EOF return value. So the Bionic behavior is nonstandard.

Wan-Teh, I have written a patch that adds a "safe_fread()" to the unix_rand.c, that converts the -1 back into 0, and which changes all the callers in that file to use it. I think that might be a safer path to take because it isn't obvious that unbuffered I/O with /dev/urandom is guaranteed to avoid the buggy behavior. What do you think?
Brian: thank you for your help. You may have wondered why
I didn't change unix_rand.c to use the open/read/close
system calls directly, which is the best way to do
unbuffered file I/O. The reason is that Chromium's Linux
sandbox intercepts fopen() to allow it to open /dev/urandom
inside the sandbox. If I switch to open/read/close, I'll
need to update Chromium's Linux sandbox to intercept open().

Now I came up with another fix: still open /dev/urandom
using fopen(), but extract the Unix fd from FILE using fileno()
and read from the Unix fd using read(). I think this is
the best solution until I figure out how to intercept open()
in Chromium's Linux sandbox.

I will also file an Android bug report about fread() returning
EOF in unbuffered mode.
I filed Android bug https://code.google.com/p/android/issues/detail?id=62890
for the problem that fread() may return EOF in unbuffered I/O mode.

This patch uses fileno() to extract the Unix file descriptor from a FILE,
and call the read() system call.

It uses an 'int' to store the return value of read(), which must be a signed
type. By code inspection, I verified the return value of these two read()
calls won't overflow an 'int'. If you have a more robust solution, please
let me know. Thanks.
Attachment #8338922 - Flags: review?(brian)
Remove spaces at the end of lines, spotted in the code review tool.
Attachment #8338922 - Attachment is obsolete: true
Attachment #8338922 - Flags: review?(brian)
Attachment #8338924 - Flags: review?(brian)
Comment on attachment 8338924 [details] [diff] [review]
Read from the Unix fd directly to bypass stdio buffering, v1.1

Review of attachment 8338924 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/freebl/unix_rand.c
@@ +978,5 @@
> +	/* Read from the underlying file descriptor directly to bypass stdio
> +	 * buffering and avoid reading more bytes than we need from
> +	 * /dev/urandom. Moreover, we read into a buffer of size BUFSIZ, so
> +	 * buffered I/O has no performance advantage. */
> +	fd = fileno(file);

fileno can fail, returning -1. WE should check this here, or comment why it isn't necessary.

Also, the comment should mention that using fread with unbuffered I/O on Android is buggy, so that if/when we find a bug in this, we don't revert it back to the code that breaks android.

@@ +1143,5 @@
>      }
> +    /* Read from the underlying file descriptor directly to bypass stdio
> +     * buffering and avoid reading more bytes than we need from /dev/urandom.
> +     */
> +    fd = fileno(file);

ditto.
Attachment #8338924 - Flags: review?(brian) → review+
Assignee: brian → wtc
Brian: thank you for the review. I added a comment to note the Android fread
bug.

The reason it is not necessary to check for fileno() failure is that it
follows a successful fopen call immediately, and fileno() is a simple
structure member getter. If fileno() returns -1 in that case, that's a
serious bug in the libc implementation.

Patch checked in: https://hg.mozilla.org/projects/nss/rev/44ba62b87bd6
Attachment #8338924 - Attachment is obsolete: true
Attachment #8338981 - Flags: checkin+
Component: General → Libraries
Flags: checkin+
Product: Firefox for Android → NSS
Target Milestone: Firefox 28 → 3.15.4
Version: Trunk → trunk
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Whiteboard: [leave open]
Verified as fixed on build: Nightly 28.0a1 (11.27.2013).
Device: Motorola Atrix MB860 (Android 2.3.4)
Status: RESOLVED → VERIFIED
Attachment #8339496 - Flags: review?(brian) → review+
Version: trunk → 3.15.4
Comment on attachment 8339496 [details] [diff] [review]
Assert that fileno() on a newly opened FILE object doesn't fail

Patch checked in: https://hg.mozilla.org/projects/nss/rev/7554ee6ba4b0
Attachment #8339496 - Flags: checked-in+
Brian, given the NSS update on mozilla-beta happened in https://bugzilla.mozilla.org/show_bug.cgi?id=898431#c61, is there anything else that needs to land to resolve this regression ?
Flags: needinfo?(brian)
(In reply to bhavana bajaj [:bajaj] from comment #47)
> Brian, given the NSS update on mozilla-beta happened in
> https://bugzilla.mozilla.org/show_bug.cgi?id=898431#c61, is there anything
> else that needs to land to resolve this regression ?

Landing the new NSS fixes this issue in Gecko.
Flags: needinfo?(brian)
You need to log in before you can comment on or make changes to this bug.