Closed Bug 832942 Opened 11 years ago Closed 11 years ago

Searches to Google.com over SSL cause OOM error page on ARMv6 builds

Categories

(NSS :: Libraries, defect, P1)

ARM
Android

Tracking

(firefox19- unaffected, firefox20+ fixed, firefox21+ fixed, firefox22+ verified)

VERIFIED FIXED
Tracking Status
firefox19 - unaffected
firefox20 + fixed
firefox21 + fixed
firefox22 + verified

People

(Reporter: mamozrk, Assigned: cviecco)

References

Details

(Keywords: regression, reproducible, Whiteboard: [ARMv6])

Attachments

(6 files, 2 obsolete files)

Searching from the URL bar (ie type words in bar, tap the "Go"/looking-glass icon --> get Google search results) results in a "Problem loading page" error:

Secure Connection Failed
An error occurred during a connection to www.google.com. security library: memory allocation failure. (Error code: sec_error_no_memory)

The URL turned out to be:

https://www.google.com/search?q=kittens&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:unofficial

I'm running Nightly. about: tells me this is version 21.0a1 (2013-01-20)

about:support tells me of the following "Important Modified Preferences"

browser.cache.disk.capacity                204800
browser.cache.disk.smart_size.first_run    false
browser.cache.disk.smart_size.use_old_max  false
browser.cache.disk.smart_size_cached_value 10240
browser.startup.homepage_override.mstone   21.0a1
dom.mozApps.used                           true
dom.w3c_touch_events.expose                true
extensions.lastAppVersion                  21.0a1
network.cookie.prefsMigrated               true
privacy.donottrackheader.enabled           true

I also have AdBlock Plus (v2.2.1) installed. Disabling this doesn't change things.

The phone is a ZTE Blade running Android 4.0.4 (ICS); it has an ARMv6 CPU, running at 600 MHz, and (according to the "About phone" in the settings) 421 MB of memory.

Browsing other https sites doesn't cause problems, neither does removing the "s" from the "https".
Using the stock ICS browser to search also works as expected (and is using https for searching in this way).
If you browse to about:config and search for “tls” and toggle “security.enable_tls” to false does this work for you?

What are the conditions for raising sec_error_no_memory?
Assignee: nobody → nobody
Component: General → Libraries
Product: Firefox for Android → NSS
Version: Firefox 21 → unspecified
(In reply to Aaron Train [:aaronmt] from comment #1)
> If you browse to about:config and search for “tls” and toggle
> “security.enable_tls” to false does this work for you?

Yes, toggling security.enable_tls to false makes this work.  Is leaving this set at false OK for day-to-day browsing?
Not a good idea, I'm curious why you would be getting sec_error_no_memory though, hopefully someone tracking this can detail.
Does the error occur all the time, most of the time, sometimes, or rarely?
(In reply to Brian Smith (:bsmith) from comment #4)
> Does the error occur all the time, most of the time, sometimes, or rarely?

All the time. I've been running Nightly on this phone only since 2013-01-05 (or thereabouts) and I can't recall when I first noticed this. Probably some time last week; but of course I can't actually recall trying a search on the phone before then.


Is there any logging I can turn on?
Flags: needinfo?(bsmith)
I too can reproduce consistently using the armv6 Nightly build on a Samsung Galaxy Fit (GT-5670). Submitting the search using the built-in Google search engine from the addressbar shows the error
I can reproduce this, this is pretty terrible on Firefox for Android; you can't search.
Steps to reproduce

i) Install Nightly/Aurora/Beta
ii) Access the address bar and input a query such as "mozilla" and hit enter or the arrow
Severity: normal → major
Keywords: reproducible
The only GeckoConsole output

E/GeckoConsole( 1447): An error occurred during a connection to www.google.com:443.
E/GeckoConsole( 1447): 
E/GeckoConsole( 1447): security library: memory allocation failure.
E/GeckoConsole( 1447): 
E/GeckoConsole( 1447): (Error code: sec_error_no_memory)
E/GeckoConsole( 1447): An error occurred during a connection to www.google.com:443.
E/GeckoConsole( 1447): 
E/GeckoConsole( 1447): security library: memory allocation failure.
E/GeckoConsole( 1447): 
E/GeckoConsole( 1447): (Error code: sec_error_no_memory)
Summary: searches to Google.com over SSL cause OOM error page → Searches to Google.com over SSL cause OOM error page on ARMv6 devices
Whiteboard: [ARMv6]
Aaron noted in email:

"It's intermittent which suggests that, I just happened to hit it a bunch of times this morning and saw the duplicate which reminded me about the original bug. For example, I just reinstalled 19 and now I'm not hitting it but did earlier; and 20/21/22."

Given that, we'll wontfix for FF19 unless we start getting an influx of negative feedback.

Josh - who on your team is in the best position (and has the time) to work with Mobile QA?
Assignee: nobody → joshmoz
This seems to happen when my device is really low on memory obviously

MemTotal:         428364 kB
MemFree:            4000 kB


Worse on 21 and 22. Fennec is using so much memory on this device that a simple Google search can't be conducted in one tab after startup.
The symptoms are that when TLSv1 is enabled and the user does a search on https://google.com, we get sec_error_out_of_memory. When the user disables TLSv1 (security.enable_tls=false) or when the user users http://google.com instead, then the problem does not reproduce.

There is basically no difference between TLSv1 and SSLv3 in the PSM and NSS layers in terms of memory usage. The main difference between TLSv1 and SSLv3 is that we send TLS extensions with TLSv1 but not SSLv3. In particular, we send the NPN extension. This makes me think that the problem is related to SPDY. Did we do anything that would have increased the memory usage of SPDY recently? e.g. SPDY server push?

Does the problem reproduce when browsing to https://twitter.com? (Patrick: what are some other SPDY-enabled sites to check?)
Flags: needinfo?(bsmith) → needinfo?(mcmanus)
Please try doing the search on https://google.com again with pref network.http.spdy.enabled=false and report back the results.
Flags: needinfo?(aaron.train)
(In reply to Brian Smith (:bsmith) from comment #13)

> SPDY. Did we do anything that would have increased the memory usage of SPDY
> recently? e.g. SPDY server push?
> 

no.. just small bug fixes.

> Does the problem reproduce when browsing to https://twitter.com? (Patrick:
> what are some other SPDY-enabled sites to check?)

https://facebook.com
https://cloudflare.com
https://www.webtide.com/
https://www.optimizely.com/

If this were a more general spdy memory thing it is kinda weird that the error is consistently coming out of nss instead of spread around. I wonder if google.com is taking spdy as a signal to do something more aggressive than usual at the ssl layer (we know they take that approach on the chrome side.)

can we get a stack and is it consistent?
Flags: needinfo?(mcmanus)
With network.http.spdy.enabled=false, I still get the same results (sec_error_no_memory). I am using Nightly (02/27) on my HTC Status (Android 2.3.4).

(In reply to Brian Smith (:bsmith) from comment #13)
> Does the problem reproduce when browsing to https://twitter.com?

Twitter search works for me; both with spdy and without.
Flags: needinfo?(aaron.train)
Assignee: joshmoz → bsmith
(In reply to Patrick McManus [:mcmanus] from comment #15)
> > Does the problem reproduce when browsing to https://twitter.com? (Patrick:
> > what are some other SPDY-enabled sites to check?)

twitter WFM

> https://facebook.com

this redirects to a non-https page

> https://cloudflare.com

this brings the same issue as google

> https://www.webtide.com/
> https://www.optimizely.com/

these two WFM.

All these with the latest Nightly: 22.0a1 (2013-02-27)

> can we get a stack and is it consistent?

I could try for a stack if I was told what to do. Special build? Extra app? Hook up to a PC with ADB?
The mobile team implemented custom fonts which increased the baseline memory usage in Fx21 bug 844669.
androiduser in #mobile mentions that 2013-01-03-09-07 works fine and that 2013-01-03-03-09-46 is bad

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=a812ef63de87&tochange=6955309291ee

Bug 799267?
b2897468bd82 Bug 825453 - Bump ARMv6 mozconfigs to use NDK r8c and GCC 4.6
                          (Kartikaya Gupta, r=blassey,ted)
da2b24b1fc7c Bug 799267 - AuthCertificate Telemetry (measuring first auth,
                          pkix and classic) (cviecco, r=bsmith)
5c8e41454810 Bug 825151 - Bump ARMv7 mozconfigs to use NDK r8c and GCC 4.6
                          (Kartikaya Gupta, r=blassey,ted)

Also possibly related:
https://bugzilla.mozilla.org/show_bug.cgi?id=772144
Keywords: regression
Camilo, can you see if this is a regression caused by your patch?
Assignee: bsmith → cviecco
Mark (reporter)

As I do not have an arm6 device available. Would you mind installing the version located here: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/cviecco@mozilla.com-689d47ffe068/try-android-armv6/ and let me know if this solves the issue
I installed the try build on my GT-5670 but I still get the error.
(In reply to Camilo Viecco (:cviecco) from comment #22)
> Mark (reporter)
> 
> As I do not have an arm6 device available. Would you mind installing the
> version located here:
> http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/cviecco@mozilla.
> com-689d47ffe068/try-android-armv6/ and let me know if this solves the issue

as with John, that build didn't fix the issue at either google or cloudflare.
I got the error again.

Is there any way to provide more information for this bug? Do debug builds spit out more information in logcat?
I'm still able to reproduce this on Firefox for Android 20 Beta2 build 2 on LG Slider (Android 2.3.4)
We are now into our third beta for Firefox 20; we need a fix here.
It seems like it is something to to with the build system. I have not been able to rebuld for Jan 1/2/3 but a rebuild on Jan 31 version (ndk r5c instead of r8c does not show the problem).
Blocks: 825453
I am pretty sure that this related to the build system.

> hg update -d 'Jan 31'
> cat .mozconfig
 # Add the correct paths here:
 #ac_add_options --with-android-ndk="$HOME/android/android-ndk-r8c"
 #ac_add_options --with-android-sdk="$HOME/android/android-sdk-linux/platforms/android-16"
 #ac_add_options --with-android-version=8
 ac_add_options --with-android-ndk="/opt/android-ndk-r5c"
 ac_add_options --with-android-sdk="/opt/android-sdk-linux/platforms/android-16"
 ac_add_options --with-android-version=5


 # android options
 ac_add_options --enable-application=mobile/android
 ac_add_options --target=arm-linux-androideabi
 ac_add_options --with-ccache

 mk_add_options MOZ_OBJDIR=./objdir-droid
 mk_add_options MOZ_MAKE_FLAGS="-j14 -s"

 ac_add_options --with-arch=armv6
export MOZ_PKG_SPECIAL=armv6
ac_add_options --disable-ion

> make -f client.mk build_and_deploy

and it works. With the version 8 ndk I the bug reappears.
If you are able to reproduce the problem, can you try to attach gdb and debug it? We can revert the NDK change if it comes to that but I'd still like to re-enable it eventually and we need to figure out what's going wrong here in order to do that. I don't know where the sec_error_no_memory error comes from and how it propagates so it would be easier for somebody who knows that code to investigate.
I discussed this a bit with cviecco on IRC, and he agreed to look into it a bit further, to see where the sec_error_no_memory error is coming from.

With no knowledge of the relevant code, I assume this error code is a result of a failed malloc and I would hypothesize that some latent bug in our code is being tickled by the NDK change such that we're passing a bad value to the malloc. Usually android kills fennec directly if we're *actually* eating up all the memory, and clearly we still have enough memory to render the error page so I suspect it's more likely that we're just making some sort of abnormal allocation request that's failing.

If we are unable to make much headway on this in the next few days I can revert back to NDK r5c on beta while we continue investigation.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #31)
> Try push with the old NDK to confirm:
> https://tbpl.mozilla.org/?tree=Try&rev=e81dde77bd40

Confirming this works on the HTC Status (NDK R5C)
Should have tried this earlier, you can use the ARMv6 builds on an ARMv7 device too; also able to reproduce on the Nexus 4.
Summary: Searches to Google.com over SSL cause OOM error page on ARMv6 devices → Searches to Google.com over SSL cause OOM error page on ARMv6 builds
definetely a linker problem:

relevant adb locat lines:
v5 (bug not shown)
03-05 18:38:25.422 E/GeckoLibLoad( 9686): Load nss start
03-05 18:38:25.892 E/GeckoLinker( 9686): /data/app/org.mozilla.fennec_cviecco-1.apk!/libnspr4.so: Warning: relocation to NULL @0x0002c6b4
03-05 18:38:25.912 E/GeckoLinker( 9686): /data/app/org.mozilla.fennec_cviecco-1.apk!/libnspr4.so: Warning: relocation to NULL @0x0002c45c for symbol "__cxa_begin_cleanup"
03-05 18:38:25.912 E/GeckoLinker( 9686): /data/app/org.mozilla.fennec_cviecco-1.apk!/libnspr4.so: Warning: relocation to NULL @0x0002c578 for symbol "__cxa_type_match"
03-05 18:38:25.932 E/GeckoLibLoad( 9686): Load nss done


v8 (with problem, notice libnss now appears with a warning).
03-05 18:04:58.922 E/GeckoLibLoad( 9142): Load nss start
03-05 18:04:59.102 E/GeckoLinker( 9142): /data/app/org.mozilla.fennec_cviecco-1.apk!/libnss3.so: Warning: unhandled flags #8 not handled
03-05 18:04:59.132 E/GeckoLinker( 9142): /data/app/org.mozilla.fennec_cviecco-1.apk!/libnssutil3.so: Warning: unhandled flags #8 not handled
03-05 18:04:59.142 E/GeckoLinker( 9142): /data/app/org.mozilla.fennec_cviecco-1.apk!/libplc4.so: Warning: unhandled flags #8 not handled
03-05 18:04:59.152 E/GeckoLinker( 9142): /data/app/org.mozilla.fennec_cviecco-1.apk!/libnspr4.so: Warning: unhandled flags #8 not handled
03-05 18:04:59.172 E/GeckoLinker( 9142): /data/app/org.mozilla.fennec_cviecco-1.apk!/libplds4.so: Warning: unhandled flags #8 not handled
03-05 18:04:59.182 E/GeckoLibLoad( 9142): Load nss done
Assignee: cviecco → bugmail.mozilla
CC'ing some others who might have some NDK5 vs NDK8 linker knowledge
(In reply to Camilo Viecco (:cviecco) from comment #37)
> definetely a linker problem:
> 
> relevant adb locat lines:
> v5 (bug not shown)
> 03-05 18:38:25.422 E/GeckoLibLoad( 9686): Load nss start
> 03-05 18:38:25.892 E/GeckoLinker( 9686):
> /data/app/org.mozilla.fennec_cviecco-1.apk!/libnspr4.so: Warning: relocation
> to NULL @0x0002c6b4
> 03-05 18:38:25.912 E/GeckoLinker( 9686):
> /data/app/org.mozilla.fennec_cviecco-1.apk!/libnspr4.so: Warning: relocation
> to NULL @0x0002c45c for symbol "__cxa_begin_cleanup"
> 03-05 18:38:25.912 E/GeckoLinker( 9686):
> /data/app/org.mozilla.fennec_cviecco-1.apk!/libnspr4.so: Warning: relocation
> to NULL @0x0002c578 for symbol "__cxa_type_match"
> 03-05 18:38:25.932 E/GeckoLibLoad( 9686): Load nss done
> 
> 
> v8 (with problem, notice libnss now appears with a warning).
> 03-05 18:04:58.922 E/GeckoLibLoad( 9142): Load nss start
> 03-05 18:04:59.102 E/GeckoLinker( 9142):
> /data/app/org.mozilla.fennec_cviecco-1.apk!/libnss3.so: Warning: unhandled
> flags #8 not handled
> 03-05 18:04:59.132 E/GeckoLinker( 9142):
> /data/app/org.mozilla.fennec_cviecco-1.apk!/libnssutil3.so: Warning:
> unhandled flags #8 not handled
> 03-05 18:04:59.142 E/GeckoLinker( 9142):
> /data/app/org.mozilla.fennec_cviecco-1.apk!/libplc4.so: Warning: unhandled
> flags #8 not handled
> 03-05 18:04:59.152 E/GeckoLinker( 9142):
> /data/app/org.mozilla.fennec_cviecco-1.apk!/libnspr4.so: Warning: unhandled
> flags #8 not handled
> 03-05 18:04:59.172 E/GeckoLinker( 9142):
> /data/app/org.mozilla.fennec_cviecco-1.apk!/libplds4.so: Warning: unhandled
> flags #8 not handled
> 03-05 18:04:59.182 E/GeckoLibLoad( 9142): Load nss done

None of these should be a problem.
I investigated this some more on an ARMv6 debug build with gdb, and the problem seems to be happening in the call to PORT_SetError(0) in pk11slot.c [1]. This call goes to secport.c:151 [2] which calls PR_SetError(0, 0) [3]. This is where things go wrong, because when stepping through with the debugger the error code coming into this method is 0, but ends up getting set to -8173. I've attached some of my gdb trace, with disassembly. Coming into the method, errorCode is stored in $r0 and copied to $r4. At address 0x4a9d380c after the call to PR_GetCurrentThread, $r4 is still correct (value is 0), but then as soon as it executes the "mov r3, #0" instruction $r4 ends up at 4294959123 (equivalent to -8173). As far as I can tell this makes no sense, particularly since the problem doesn't happen on *every* call to PR_SetError, only some of them.

[1] http://mxr.mozilla.org/mozilla-central/source/security/nss/lib/pk11wrap/pk11slot.c#2059
[2] http://mxr.mozilla.org/mozilla-central/source/security/nss/lib/util/secport.c#151
[3] http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/misc/prerror.c#23
Actually after a few more runs it looks like the $r0 is bad even going into the PR_SetError method. I'm not sure why it showed up differently in the above trace.

The call from secport.c:151 to PR_SetError goes through some sort of thunk layer:

(gdb) x/i $pc
=> 0x4a8a1260 <PORT_SetError_Util+16>:	b	0x4a8964b4
(gdb) si
0x4a8964b4 in ?? () from /mozilla-git/obj-android-armv6/dist/bin/libnssutil3.so
(gdb) x/3i $pc
=> 0x4a8964b4:	add	r12, pc, #0, 12
   0x4a8964b8:	add	r12, r12, #135168	; 0x21000
   0x4a8964bc:	ldr	pc, [r12, #2532]!	; 0x9e4

And $r0 changes from 0 to 4294959123 when the "ldr" instruction in the thunk layer is executed:

(gdb) si
0x4a8964b8 in ?? () from /mozilla-git/obj-android-armv6/dist/bin/libnssutil3.so
(gdb) si
0x4a8964bc in ?? () from /mozilla-git/obj-android-armv6/dist/bin/libnssutil3.so
(gdb) p $r12
$99 = 1250653372
(gdb) x/w $r12+2532
0x4a8b7ea0 <_GLOBAL_OFFSET_TABLE_+56>:	0x4a9d37fc
(gdb) p $r0
$100 = 0
(gdb) si

Breakpoint 7, PR_SetError (code=-8173, osErr=0) at /mozilla-git/nsprpub/pr/src/misc/prerror.c:24
24	{
(gdb) p $r0
$101 = 4294959123
(gdb) x/i $pc
=> 0x4a9d37fc <PR_SetError>:	push	{r3, r4, r5, lr}
After much debugging with glandium, we (well, he) figured out what was going on. GCC is miscompiling the sftk_mkPrivKey function in security/nss/lib/softoken/pkcs11.c file. Specifically, after the call to DER_SetUInteger in that function [1], the "bad" compilation compares the return value to an unused register r10, and then returns the CKR_HOST_MEMORY return value since the comparison fails. Here is a disassembly of the relevant code from NDK r5c:

ead8:   ebffd023    bl  2b6c <NSS_3.4+0x2b6c>
eadc:   e3500000    cmp r0, #0  ; 0x0
eae0:   13a00002    movne   r0, #2  ; 0x2             

and the bad code from NDK r8c:

   0x4b06d0d0 <+604>:   bl  0x4b062ce8
   0x4b06d0d4 <+608>:   mov r3, #0
   0x4b06d0d8 <+612>:   cmp r10, r3
   0x4b06d0dc <+616>:   movne   r0, #2
   0x4b06d0e0 <+620>:   moveq   r0, r3

[1] http://mxr.mozilla.org/mozilla-central/source/security/nss/lib/softoken/pkcs11.c#1895
Attached patch PatchSplinter Review
We came up with this patch which seems to do enough code rearrangement to make the compiler generate different code. Try build in progress at https://tbpl.mozilla.org/?tree=Try&rev=af9109e8baef
Awesome work!
Testing locally a different version of the patch (little less disrupting of the control flow).
push of an alternative patch (works locally)
https://tbpl.mozilla.org/?tree=Try&rev=94d253182343
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #43)
> After much debugging with glandium, we (well, he) figured out what was going

:kats, :glandium, :cviecco - that's pretty sweet stuff.
The trybuild shows no OOM messages when searching Google from the address-bar.
Wan-Teh, Ryan: I am pretty sure you use at least parts of NSS in your Android Chromium builds. If so, this bug may affect you.

I think the next step is to try to reproduce this using NSS as built using NSS's build system, to see if this is a bug in NSS or a bug in Mozilla's overrides of NSS's build system.

We should also see if the patch in bug 838989 resolves the issue or not, because IIRC that patch deals with ARMv6 compilation stuff.

Also, we should see if DER_SetUInteger is perhaps triggering undefined behavior due to some C language spec. violation. In particular, aliasing violations are a high probability.

Failing all of that, we should accept this patch. But, I do not like taking this patch when we have no idea why it works when the original code doesn't, given that it is supposed to be logically equivalent to the current code.
Chrome does not use NSS on Android at all.
Attached patch alt-patchSplinter Review
Also, what was the motivation for upgrading to NDK 8? I am worried that if we have a compiler bug that affects *this* code, the same bug will affect other code as well. Do we have any way of being confident one way or the other?
(In reply to Brian Smith (:bsmith) from comment #49)
> We should also see if the patch in bug 838989 resolves the issue or not,
> because IIRC that patch deals with ARMv6 compilation stuff.

Even if that patch fixes the problem, that patch looks too complicated to uplift to beta (I could be wrong though). I would be more comfortable with landing the hack and uplifting it to beta, and then possibly reverting the hack and landing that patch instead on central, if that resolves the issue.

> Also, we should see if DER_SetUInteger is perhaps triggering undefined
> behavior due to some C language spec. violation. In particular, aliasing
> violations are a high probability.

It looked ok to me, but feel free to take a look.

> 
> Failing all of that, we should accept this patch. But, I do not like taking
> this patch when we have no idea why it works when the original code doesn't,
> given that it is supposed to be logically equivalent to the current code.

FWIW I compiled that file with gcc-4.7 from NDK r8d using the same command line options and environment, and the assembly in the object file for that came out fine:

 240:   ebfffffe    bl  0 <DER_SetUInteger>
            240: R_ARM_CALL DER_SetUInteger
 244:   e3a03000    mov r3, #0
 248:   e1500003    cmp r0, r3
 24c:   13a00002    movne   r0, #2
 250:   01a00003    moveq   r0, r3

So AFAICT this is a bug in gcc-4.6 in the armv6 codegen only, and has been fixed in gcc-4.7.

(In reply to Brian Smith (:bsmith) from comment #52)
> Also, what was the motivation for upgrading to NDK 8?

Various and sundry, mostly smaller libraries and general perf improvements.

> I am worried that if
> we have a compiler bug that affects *this* code, the same bug will affect
> other code as well. Do we have any way of being confident one way or the
> other?

It might affect other code but I'm inclined to assume it doesn't until we find a case that proves otherwise. Alternatively we could track down the fix in the gcc source code, add a diagnostic message to gcc that gets printed out whenever the bug would have been triggered, and then compile fennec with this modified gcc and see how many times it gets hit.
:cviecco, I like your alt-patch better than my original patch, so reassigning this bug back to you. If you're happy with the patch please request review so we can get this in and uplifted as soon as possible.
Assignee: bugmail.mozilla → cviecco
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #53)
> It might affect other code but I'm inclined to assume it doesn't until we
> find a case that proves otherwise. Alternatively we could track down the fix
> in the gcc source code (...)

We should ask gcc developers.
Comment on attachment 721915 [details] [diff] [review]
alt-patch

>+          /* The following ifdef is completelly unnecesary, but needed for
>+           * android if using ndk8c and arm6 as it has a compiler bug
>+           */

It would be nice to update this comment with the compiler version when
we find out more from the GCC developers.
Attachment #721915 - Flags: review+
(In reply to Brian Smith (:bsmith) from comment #49)
>
> Also, we should see if DER_SetUInteger is perhaps triggering undefined
> behavior due to some C language spec. violation. In particular, aliasing
> violations are a high probability.

I reviewed DER_SetUInteger and saw no problem. There is no pointer
aliasing.
A few additional data points:
- It's not limited to armv6. It happens with -march=armv7-a -marm
- The minimal flags I've reproduced it with an already preprocessed and slightly simplified source are -marm -Os.
- It doesn't happen at -O2 (default for plain NSS build, but softoken has ALLOW_OPT_CODE_SIZE=1)
Mike, thank you. Do the r+ patch works around the issue on arm7 too? If so we can probably land it today.
(In reply to Camilo Viecco (:cviecco) from comment #61)
> Mike, thank you. Do the r+ patch works around the issue on arm7 too? If so
> we can probably land it today.

The issue is present on any arm architecture as long as it's not built as thumb. So you could add a !defined(__thumb__)
latest push. Testing on all platforms, updated patch to include gcc bug id.
  https://tbpl.mozilla.org/?tree=Try&rev=b18227edec50
As an update: I built a modified version of the gcc-4.6 toolchain in NDK r8c that includes the GCC patch that fixes the problem (as per [1]), and also added some output for when that code is exercised. I've attached the patch I used (I had to backport the actual patch to the gcc-4.6 code).

Then I built the latest mozilla-central with that modified toolchain and verified that the pkcs11.o file contains the proper assembly (and works fine on my phone). Unfortunately, the gcc code in question gets exercised on pretty much every file, and I don't know how to tell which hits would have tickled the bug.

[1] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56561#c2
needinfo'ing nathan as glandium suggested his experience with GCC might be able to help here. Basically, given the GCC patch in comment 64, I'd like to find out when the buggy codepath in GCC is exercised, so that we can find if other files in the mozilla codebase will be impacted.

For more backstory you can start reading at comment 43 (or comment 0, if you want everything).
Flags: needinfo?(nfroyd)
(In reply to Camilo Viecco (:cviecco) from comment #63)
> latest push. Testing on all platforms, updated patch to include gcc bug id.
>   https://tbpl.mozilla.org/?tree=Try&rev=b18227edec50

I installed the build at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/cviecco@mozilla.com-b18227edec50/try-android-armv6/

and this fixes the issue. \o/
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #65)
> needinfo'ing nathan as glandium suggested his experience with GCC might be
> able to help here. Basically, given the GCC patch in comment 64, I'd like to
> find out when the buggy codepath in GCC is exercised, so that we can find if
> other files in the mozilla codebase will be impacted.

It sounds like you are asking "what's the problematic code pattern so that we can audit several million lines of code to find other cases which might be fixed?"  That sounds crazy, so could you please clarify what you're asking for?  I don't see what the value of looking at other files in m-c is, since we already know this is a compiler bug.
Flags: needinfo?(nfroyd) → needinfo?(bugmail.mozilla)
Our "fix" is to juggle the code around to avoid the compiler bug. If we switched compilers to one without the bug, then yes, I would agree with you that there's no point in looking at other files in m-c. If we stick with the buggy compiler though then it is useful to know how much of our code will be impacted.
Flags: needinfo?(bugmail.mozilla)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #68)
> Our "fix" is to juggle the code around to avoid the compiler bug. If we
> switched compilers to one without the bug, then yes, I would agree with you
> that there's no point in looking at other files in m-c. If we stick with the
> buggy compiler though then it is useful to know how much of our code will be
> impacted.

Ah, thanks for the clarification.  Sticking with the buggy compiler still sounds like crazy talk, though. ;)

I don't see a usefully specific code pattern here; the best I can offer in lieu of sitting down with the compiler is "comparisons to zero" which unhelpfully translates to "everywhere".

One suggestion for guesstimating how much of a fix this provides:

- Build the source tree with extra options '-save-temps -fno-schedule-insns -fno-schedule-insns2' with an unmodified (buggy) and modified (fixed) compiler.  -save-temps will produce .s files next to the .o files.  -fno-schedule-insns* prevents the compiler from rearranging the instruction stream too much.
- Diff .s files at your leisure; the assembly files ought to differ only in places where bugs have been fixed.  Ideally there will be no false positives from instruction rearrangement or differing register allocations.
(In reply to Nathan Froyd (:froydnj) from comment #69)
> - Diff .s files at your leisure; the assembly files ought to differ only in
> places where bugs have been fixed.  Ideally there will be no false positives
> from instruction rearrangement or differing register allocations.

Isn't there chances of label numbering changes?
(In reply to Mike Hommey [:glandium] from comment #70)
> (In reply to Nathan Froyd (:froydnj) from comment #69)
> > - Diff .s files at your leisure; the assembly files ought to differ only in
> > places where bugs have been fixed.  Ideally there will be no false positives
> > from instruction rearrangement or differing register allocations.
> 
> Isn't there chances of label numbering changes?

Sure, it's not a foolproof method.  But usually diffs in label numbers are harmless or they're surrounded by larger chunks of significant code changes.

Using the -y option to diff can be helpful for looking at assembly dumps too.

It's also worth noting that just because you (kats) were seeing those debug messages fire a lot doesn't necessarily mean that differences in assembly were always produced.  combine.c is trying differing ways of sticking instructions together and you were likely seeing backtracking of unsuccessful combinations, rather than combinations that always affected final code generation.
(In reply to Nathan Froyd (:froydnj) from comment #71)
> It's also worth noting that just because you (kats) were seeing those debug
> messages fire a lot doesn't necessarily mean that differences in assembly
> were always produced.  combine.c is trying differing ways of sticking
> instructions together and you were likely seeing backtracking of
> unsuccessful combinations, rather than combinations that always affected
> final code generation.

Yeah, I figured it might be something like that when I was looking through the code. :(

(In reply to Nathan Froyd (:froydnj) from comment #69)
> One suggestion for guesstimating how much of a fix this provides:
> 
> - Build the source tree with extra options '-save-temps -fno-schedule-insns
> -fno-schedule-insns2' with an unmodified (buggy) and modified (fixed)
> compiler.  -save-temps will produce .s files next to the .o files. 
> -fno-schedule-insns* prevents the compiler from rearranging the instruction
> stream too much.
> - Diff .s files at your leisure; the assembly files ought to differ only in
> places where bugs have been fixed.  Ideally there will be no false positives
> from instruction rearrangement or differing register allocations.

I'll try this. Thanks for the suggestion!
latest push has unrelated xpcshell test failures. Running tests locally before asking for moreinfo (uplift to nss).
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #72)
> I'll try this. Thanks for the suggestion!

So I did that, and found that only two files had changed (after removing spurious changes such as different objdir folder). One is pkcs11.s and the other is mpmontg.s. The changes to mpmontg.s look kind of worrisome because a large part of the file is just gone. I'm attaching the before/after.

Also as a sanity check I went through all the .o files generated in the objdir and ensured that I had the .s files that corresponded to each of them (i.e. to make sure I didn't miss any potential differences). Files compiled with the host-compiler showed up on the list, as did files that were generated from .lo scripts or .S files in the source tree. After removing all of those, there were still a few files that I was missing the .s files for and couldn't figure out where they were being generated from, so those files could potentially also be affected. I've included a list of those in the attachment as well.
The compiler-original output for mpmontg is cut in the middle of a function, that seems implausible.
So it is. I didn't notice the .s file was exactly 4096 bytes long either. It must not have gotten rebuilt properly when I ctrl-c'd an earlier build. I rebuilt that file just now and verified that there is no difference with the patched and unpatched compiler. I also figured out where the unknown .o files were coming from (mostly I was just looking for the .s in the wrong directory) and verified those are also unchanged. Looks like it's just pkcs11.c that is affected by this compiler bug.
Looks like we're good to go with that workaround then :)
I sent email to nss-dev to give people a heads-up about this landing in mozilla-beta and other branches as a private patch, and I will do the landing later today if I don't hear anything back from the module owners.
Comment on attachment 721915 [details] [diff] [review]
alt-patch

Review of attachment 721915 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/nss/lib/softoken/pkcs11.c
@@ +1901,5 @@
> +           */
> +#if defined (__arm__) && defined (__GNUC__)
> +	  *crvp = CKR_HOST_MEMORY;
> +#endif
> +          break;

You should check in this patch in the NSS Hg repository as well.

Is this extra break statement necessary to work around the compiler bug?
(In reply to Wan-Teh Chang from comment #79)
> Comment on attachment 721915 [details] [diff] [review]
> alt-patch
> 
> Review of attachment 721915 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: security/nss/lib/softoken/pkcs11.c
> @@ +1901,5 @@
> > +           */
> > +#if defined (__arm__) && defined (__GNUC__)
> > +	  *crvp = CKR_HOST_MEMORY;
> > +#endif
> > +          break;
> 
> You should check in this patch in the NSS Hg repository as well.
> 
> Is this extra break statement necessary to work around the compiler bug?
Yes, I have also tested with the break inside the ifdef and that works well too.
If you land this in NSS hg, you should change the comment to say it's a gcc 4.6 problem when targetting arm (but not thumb).
waiting for upstream nss landing
I moved the break statement into the ifdef block.
I added the gcc version info as Mike suggested and
fixed the indentation in Camilo's patch. Thanks.
Attachment #725057 - Flags: superreview?(cviecco)
Attachment #725057 - Flags: review?(mh+mozilla)
Comment on attachment 725057 [details] [diff] [review]
Patch for the NSS hg repository, by Camilo Viecco

I don't think you need to mention NDK. Especially since this also affects Linux arm distros. Also, as mentioned in comment 62, you may add a !defined(__thumb__)
Attachment #725057 - Flags: review?(mh+mozilla) → review+
Mike: I made the changes you suggested. Is this what you have
in mind?  Thanks.
Attachment #725057 - Attachment is obsolete: true
Attachment #725057 - Flags: superreview?(cviecco)
Attachment #725060 - Flags: superreview?(cviecco)
Attachment #725060 - Flags: review?(mh+mozilla)
Comment on attachment 725060 [details] [diff] [review]
Patch for the NSS hg repository, v2, by Camilo Viecco

Review of attachment 725060 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/softoken/pkcs11.c
@@ +1896,5 @@
>                            NSSLOWKEY_EC_PRIVATE_KEY_VERSION);
> +	if (rv != SECSuccess) {
> +	    crv = CKR_HOST_MEMORY;
> +	    /* The following ifdef is needed for Linux arm distros and
> +	     * Android if using ARMv6 as gcc 4.6 has a bug when targeting

s/if using ARMv6// ; you can even remove "Linux arm distros and Android" if you feel like it, because netbsd and other systems also supports arm, although they might not be using gcc 4.6.
Attachment #725060 - Flags: review?(mh+mozilla) → review+
you might want to add the gcc bug link: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56561

also running latest patch on try:
 https://tbpl.mozilla.org/?tree=Try&rev=67f1728e23e0 
(eta 2hrs)
Removed "if using ARMv6" and added the gcc bug report URL.

Patch pushed to the NSS hg repository (NSS 3.15):
http://hg.mozilla.org/projects/nss/rev/a223a1cc6b5f
Attachment #725060 - Attachment is obsolete: true
Attachment #725060 - Flags: superreview?(cviecco)
Attachment #725119 - Flags: checked-in+
(In reply to Wan-Teh Chang from comment #88)
> Created attachment 725119 [details] [diff] [review]
> Patch for the NSS hg repository, v3, by Camilo Viecco
> 
> Removed "if using ARMv6" and added the gcc bug report URL.
> 
> Patch pushed to the NSS hg repository (NSS 3.15):
> http://hg.mozilla.org/projects/nss/rev/a223a1cc6b5f

Given this, can we get an landing to m-c/m-a/m-b? Our second to last FF20 beta is going to build tomorrow, so I think we can move forward with https://bugzilla.mozilla.org/show_bug.cgi?id=832942#c78
Flags: needinfo?(bsmith)
Yes, please use the upstream NSS patch (attachment 725119 [details] [diff] [review]) as the
private patch in mozilla-central.
https://hg.mozilla.org/mozilla-central/rev/dd27b00a412c
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Priority: -- → P1
Target Milestone: --- → 3.15
Verified fixed on m-c as per original steps-to-reproduce

Tested via: Alcatel One-Touch 958 (Android 2.3, ARMv6), HTC Status (Android 2.3, ARMv6), Samsung Galaxy SII (Android 4.0.4, ARMv7 w/ARMv6 APK).
Status: RESOLVED → VERIFIED
Blocks: 921090
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: