Searches to Google.com over SSL cause OOM error page on ARMv6 builds

VERIFIED FIXED in Firefox 20

Status

NSS
Libraries
P1
major
VERIFIED FIXED
5 years ago
4 years ago

People

(Reporter: Mark Tyndall, Assigned: cviecco)

Tracking

({regression, reproducible})

unspecified
3.15
ARM
Android
regression, reproducible
Dependency tree / graph

Firefox Tracking Flags

(firefox19- unaffected, firefox20+ fixed, firefox21+ fixed, firefox22+ verified)

Details

(Whiteboard: [ARMv6])

Attachments

(6 attachments, 2 obsolete attachments)

(Reporter)

Description

5 years ago
Searching from the URL bar (ie type words in bar, tap the "Go"/looking-glass icon --> get Google search results) results in a "Problem loading page" error:

Secure Connection Failed
An error occurred during a connection to www.google.com. security library: memory allocation failure. (Error code: sec_error_no_memory)

The URL turned out to be:

https://www.google.com/search?q=kittens&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:unofficial

I'm running Nightly. about: tells me this is version 21.0a1 (2013-01-20)

about:support tells me of the following "Important Modified Preferences"

browser.cache.disk.capacity                204800
browser.cache.disk.smart_size.first_run    false
browser.cache.disk.smart_size.use_old_max  false
browser.cache.disk.smart_size_cached_value 10240
browser.startup.homepage_override.mstone   21.0a1
dom.mozApps.used                           true
dom.w3c_touch_events.expose                true
extensions.lastAppVersion                  21.0a1
network.cookie.prefsMigrated               true
privacy.donottrackheader.enabled           true

I also have AdBlock Plus (v2.2.1) installed. Disabling this doesn't change things.

The phone is a ZTE Blade running Android 4.0.4 (ICS); it has an ARMv6 CPU, running at 600 MHz, and (according to the "About phone" in the settings) 421 MB of memory.

Browsing other https sites doesn't cause problems, neither does removing the "s" from the "https".
Using the stock ICS browser to search also works as expected (and is using https for searching in this way).
If you browse to about:config and search for “tls” and toggle “security.enable_tls” to false does this work for you?

What are the conditions for raising sec_error_no_memory?
Assignee: nobody → nobody
Component: General → Libraries
Product: Firefox for Android → NSS
Version: Firefox 21 → unspecified
(Reporter)

Comment 2

5 years ago
(In reply to Aaron Train [:aaronmt] from comment #1)
> If you browse to about:config and search for “tls” and toggle
> “security.enable_tls” to false does this work for you?

Yes, toggling security.enable_tls to false makes this work.  Is leaving this set at false OK for day-to-day browsing?
Not a good idea, I'm curious why you would be getting sec_error_no_memory though, hopefully someone tracking this can detail.
Does the error occur all the time, most of the time, sometimes, or rarely?
(Reporter)

Comment 5

5 years ago
(In reply to Brian Smith (:bsmith) from comment #4)
> Does the error occur all the time, most of the time, sometimes, or rarely?

All the time. I've been running Nightly on this phone only since 2013-01-05 (or thereabouts) and I can't recall when I first noticed this. Probably some time last week; but of course I can't actually recall trying a search on the phone before then.


Is there any logging I can turn on?
Flags: needinfo?(bsmith)

Comment 6

5 years ago
I too can reproduce consistently using the armv6 Nightly build on a Samsung Galaxy Fit (GT-5670). Submitting the search using the built-in Google search engine from the addressbar shows the error
I can reproduce this, this is pretty terrible on Firefox for Android; you can't search.
status-firefox19: --- → affected
status-firefox20: --- → affected
status-firefox21: --- → affected
status-firefox22: --- → affected
tracking-firefox20: --- → ?
tracking-firefox21: --- → ?
tracking-firefox22: --- → ?

Updated

5 years ago
Duplicate of this bug: 845797
Steps to reproduce

i) Install Nightly/Aurora/Beta
ii) Access the address bar and input a query such as "mozilla" and hit enter or the arrow
Severity: normal → major
Keywords: reproducible
The only GeckoConsole output

E/GeckoConsole( 1447): An error occurred during a connection to www.google.com:443.
E/GeckoConsole( 1447): 
E/GeckoConsole( 1447): security library: memory allocation failure.
E/GeckoConsole( 1447): 
E/GeckoConsole( 1447): (Error code: sec_error_no_memory)
E/GeckoConsole( 1447): An error occurred during a connection to www.google.com:443.
E/GeckoConsole( 1447): 
E/GeckoConsole( 1447): security library: memory allocation failure.
E/GeckoConsole( 1447): 
E/GeckoConsole( 1447): (Error code: sec_error_no_memory)

Updated

5 years ago
Keywords: regressionwindow-wanted

Updated

5 years ago
Summary: searches to Google.com over SSL cause OOM error page → Searches to Google.com over SSL cause OOM error page on ARMv6 devices
Whiteboard: [ARMv6]

Updated

5 years ago
tracking-firefox19: --- → ?
tracking-firefox20: ? → +
tracking-firefox21: ? → +
tracking-firefox22: ? → +
Aaron noted in email:

"It's intermittent which suggests that, I just happened to hit it a bunch of times this morning and saw the duplicate which reminded me about the original bug. For example, I just reinstalled 19 and now I'm not hitting it but did earlier; and 20/21/22."

Given that, we'll wontfix for FF19 unless we start getting an influx of negative feedback.

Josh - who on your team is in the best position (and has the time) to work with Mobile QA?
Assignee: nobody → joshmoz
status-firefox19: affected → wontfix
tracking-firefox19: ? → -
This seems to happen when my device is really low on memory obviously

MemTotal:         428364 kB
MemFree:            4000 kB


Worse on 21 and 22. Fennec is using so much memory on this device that a simple Google search can't be conducted in one tab after startup.
The symptoms are that when TLSv1 is enabled and the user does a search on https://google.com, we get sec_error_out_of_memory. When the user disables TLSv1 (security.enable_tls=false) or when the user users http://google.com instead, then the problem does not reproduce.

There is basically no difference between TLSv1 and SSLv3 in the PSM and NSS layers in terms of memory usage. The main difference between TLSv1 and SSLv3 is that we send TLS extensions with TLSv1 but not SSLv3. In particular, we send the NPN extension. This makes me think that the problem is related to SPDY. Did we do anything that would have increased the memory usage of SPDY recently? e.g. SPDY server push?

Does the problem reproduce when browsing to https://twitter.com? (Patrick: what are some other SPDY-enabled sites to check?)
Flags: needinfo?(bsmith) → needinfo?(mcmanus)
Please try doing the search on https://google.com again with pref network.http.spdy.enabled=false and report back the results.
Flags: needinfo?(aaron.train)
(In reply to Brian Smith (:bsmith) from comment #13)

> SPDY. Did we do anything that would have increased the memory usage of SPDY
> recently? e.g. SPDY server push?
> 

no.. just small bug fixes.

> Does the problem reproduce when browsing to https://twitter.com? (Patrick:
> what are some other SPDY-enabled sites to check?)

https://facebook.com
https://cloudflare.com
https://www.webtide.com/
https://www.optimizely.com/

If this were a more general spdy memory thing it is kinda weird that the error is consistently coming out of nss instead of spread around. I wonder if google.com is taking spdy as a signal to do something more aggressive than usual at the ssl layer (we know they take that approach on the chrome side.)

can we get a stack and is it consistent?
Flags: needinfo?(mcmanus)
With network.http.spdy.enabled=false, I still get the same results (sec_error_no_memory). I am using Nightly (02/27) on my HTC Status (Android 2.3.4).

(In reply to Brian Smith (:bsmith) from comment #13)
> Does the problem reproduce when browsing to https://twitter.com?

Twitter search works for me; both with spdy and without.
Flags: needinfo?(aaron.train)
Assignee: joshmoz → bsmith
(Reporter)

Comment 17

5 years ago
(In reply to Patrick McManus [:mcmanus] from comment #15)
> > Does the problem reproduce when browsing to https://twitter.com? (Patrick:
> > what are some other SPDY-enabled sites to check?)

twitter WFM

> https://facebook.com

this redirects to a non-https page

> https://cloudflare.com

this brings the same issue as google

> https://www.webtide.com/
> https://www.optimizely.com/

these two WFM.

All these with the latest Nightly: 22.0a1 (2013-02-27)

> can we get a stack and is it consistent?

I could try for a stack if I was told what to do. Special build? Extra app? Hook up to a PC with ADB?
The mobile team implemented custom fonts which increased the baseline memory usage in Fx21 bug 844669.
androiduser in #mobile mentions that 2013-01-03-09-07 works fine and that 2013-01-03-03-09-46 is bad

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=a812ef63de87&tochange=6955309291ee

Bug 799267?
b2897468bd82 Bug 825453 - Bump ARMv6 mozconfigs to use NDK r8c and GCC 4.6
                          (Kartikaya Gupta, r=blassey,ted)
da2b24b1fc7c Bug 799267 - AuthCertificate Telemetry (measuring first auth,
                          pkix and classic) (cviecco, r=bsmith)
5c8e41454810 Bug 825151 - Bump ARMv7 mozconfigs to use NDK r8c and GCC 4.6
                          (Kartikaya Gupta, r=blassey,ted)

Also possibly related:
https://bugzilla.mozilla.org/show_bug.cgi?id=772144
Keywords: regression
Camilo, can you see if this is a regression caused by your patch?
Assignee: bsmith → cviecco
(Assignee)

Comment 22

5 years ago
Mark (reporter)

As I do not have an arm6 device available. Would you mind installing the version located here: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/cviecco@mozilla.com-689d47ffe068/try-android-armv6/ and let me know if this solves the issue

Comment 23

5 years ago
I installed the try build on my GT-5670 but I still get the error.
(Reporter)

Comment 24

5 years ago
(In reply to Camilo Viecco (:cviecco) from comment #22)
> Mark (reporter)
> 
> As I do not have an arm6 device available. Would you mind installing the
> version located here:
> http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/cviecco@mozilla.
> com-689d47ffe068/try-android-armv6/ and let me know if this solves the issue

as with John, that build didn't fix the issue at either google or cloudflare.
(Assignee)

Comment 25

5 years ago
John, Mark thank you

Can you try this one?
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/cviecco@mozilla.com-2945b87a34ac/try-android-armv6/

Comment 26

5 years ago
I got the error again.

Is there any way to provide more information for this bug? Do debug builds spit out more information in logcat?

Comment 27

5 years ago
I'm still able to reproduce this on Firefox for Android 20 Beta2 build 2 on LG Slider (Android 2.3.4)
We are now into our third beta for Firefox 20; we need a fix here.
https://play.google.com/store/apps/details?id=org.mozilla.firefox_beta&reviewId=Z3A6QU9xcFRPR3NLNHh2Z3VrODB4M25vM1VFM0Y1ckh2cGR5M1BCTWtXOTVudFkydXlFVmJPS2NSZTJZOS1NcjdPMy1XdlZsZUdUdFBPY3FJaEZpS2FwZmc
(Assignee)

Comment 30

5 years ago
It seems like it is something to to with the build system. I have not been able to rebuld for Jan 1/2/3 but a rebuild on Jan 31 version (ndk r5c instead of r8c does not show the problem).

Updated

5 years ago
Blocks: 825453
Try push with the old NDK to confirm: https://tbpl.mozilla.org/?tree=Try&rev=e81dde77bd40
(Assignee)

Comment 32

5 years ago
I am pretty sure that this related to the build system.

> hg update -d 'Jan 31'
> cat .mozconfig
 # Add the correct paths here:
 #ac_add_options --with-android-ndk="$HOME/android/android-ndk-r8c"
 #ac_add_options --with-android-sdk="$HOME/android/android-sdk-linux/platforms/android-16"
 #ac_add_options --with-android-version=8
 ac_add_options --with-android-ndk="/opt/android-ndk-r5c"
 ac_add_options --with-android-sdk="/opt/android-sdk-linux/platforms/android-16"
 ac_add_options --with-android-version=5


 # android options
 ac_add_options --enable-application=mobile/android
 ac_add_options --target=arm-linux-androideabi
 ac_add_options --with-ccache

 mk_add_options MOZ_OBJDIR=./objdir-droid
 mk_add_options MOZ_MAKE_FLAGS="-j14 -s"

 ac_add_options --with-arch=armv6
export MOZ_PKG_SPECIAL=armv6
ac_add_options --disable-ion

> make -f client.mk build_and_deploy

and it works. With the version 8 ndk I the bug reappears.
If you are able to reproduce the problem, can you try to attach gdb and debug it? We can revert the NDK change if it comes to that but I'd still like to re-enable it eventually and we need to figure out what's going wrong here in order to do that. I don't know where the sec_error_no_memory error comes from and how it propagates so it would be easier for somebody who knows that code to investigate.
I discussed this a bit with cviecco on IRC, and he agreed to look into it a bit further, to see where the sec_error_no_memory error is coming from.

With no knowledge of the relevant code, I assume this error code is a result of a failed malloc and I would hypothesize that some latent bug in our code is being tickled by the NDK change such that we're passing a bad value to the malloc. Usually android kills fennec directly if we're *actually* eating up all the memory, and clearly we still have enough memory to render the error page so I suspect it's more likely that we're just making some sort of abnormal allocation request that's failing.

If we are unable to make much headway on this in the next few days I can revert back to NDK r5c on beta while we continue investigation.
Blocks: 824169
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #31)
> Try push with the old NDK to confirm:
> https://tbpl.mozilla.org/?tree=Try&rev=e81dde77bd40

Confirming this works on the HTC Status (NDK R5C)
Should have tried this earlier, you can use the ARMv6 builds on an ARMv7 device too; also able to reproduce on the Nexus 4.
Summary: Searches to Google.com over SSL cause OOM error page on ARMv6 devices → Searches to Google.com over SSL cause OOM error page on ARMv6 builds
status-firefox19: wontfix → unaffected
(Assignee)

Comment 37

5 years ago
definetely a linker problem:

relevant adb locat lines:
v5 (bug not shown)
03-05 18:38:25.422 E/GeckoLibLoad( 9686): Load nss start
03-05 18:38:25.892 E/GeckoLinker( 9686): /data/app/org.mozilla.fennec_cviecco-1.apk!/libnspr4.so: Warning: relocation to NULL @0x0002c6b4
03-05 18:38:25.912 E/GeckoLinker( 9686): /data/app/org.mozilla.fennec_cviecco-1.apk!/libnspr4.so: Warning: relocation to NULL @0x0002c45c for symbol "__cxa_begin_cleanup"
03-05 18:38:25.912 E/GeckoLinker( 9686): /data/app/org.mozilla.fennec_cviecco-1.apk!/libnspr4.so: Warning: relocation to NULL @0x0002c578 for symbol "__cxa_type_match"
03-05 18:38:25.932 E/GeckoLibLoad( 9686): Load nss done


v8 (with problem, notice libnss now appears with a warning).
03-05 18:04:58.922 E/GeckoLibLoad( 9142): Load nss start
03-05 18:04:59.102 E/GeckoLinker( 9142): /data/app/org.mozilla.fennec_cviecco-1.apk!/libnss3.so: Warning: unhandled flags #8 not handled
03-05 18:04:59.132 E/GeckoLinker( 9142): /data/app/org.mozilla.fennec_cviecco-1.apk!/libnssutil3.so: Warning: unhandled flags #8 not handled
03-05 18:04:59.142 E/GeckoLinker( 9142): /data/app/org.mozilla.fennec_cviecco-1.apk!/libplc4.so: Warning: unhandled flags #8 not handled
03-05 18:04:59.152 E/GeckoLinker( 9142): /data/app/org.mozilla.fennec_cviecco-1.apk!/libnspr4.so: Warning: unhandled flags #8 not handled
03-05 18:04:59.172 E/GeckoLinker( 9142): /data/app/org.mozilla.fennec_cviecco-1.apk!/libplds4.so: Warning: unhandled flags #8 not handled
03-05 18:04:59.182 E/GeckoLibLoad( 9142): Load nss done
(Assignee)

Updated

5 years ago
Assignee: cviecco → bugmail.mozilla
CC'ing some others who might have some NDK5 vs NDK8 linker knowledge
(In reply to Camilo Viecco (:cviecco) from comment #37)
> definetely a linker problem:
> 
> relevant adb locat lines:
> v5 (bug not shown)
> 03-05 18:38:25.422 E/GeckoLibLoad( 9686): Load nss start
> 03-05 18:38:25.892 E/GeckoLinker( 9686):
> /data/app/org.mozilla.fennec_cviecco-1.apk!/libnspr4.so: Warning: relocation
> to NULL @0x0002c6b4
> 03-05 18:38:25.912 E/GeckoLinker( 9686):
> /data/app/org.mozilla.fennec_cviecco-1.apk!/libnspr4.so: Warning: relocation
> to NULL @0x0002c45c for symbol "__cxa_begin_cleanup"
> 03-05 18:38:25.912 E/GeckoLinker( 9686):
> /data/app/org.mozilla.fennec_cviecco-1.apk!/libnspr4.so: Warning: relocation
> to NULL @0x0002c578 for symbol "__cxa_type_match"
> 03-05 18:38:25.932 E/GeckoLibLoad( 9686): Load nss done
> 
> 
> v8 (with problem, notice libnss now appears with a warning).
> 03-05 18:04:58.922 E/GeckoLibLoad( 9142): Load nss start
> 03-05 18:04:59.102 E/GeckoLinker( 9142):
> /data/app/org.mozilla.fennec_cviecco-1.apk!/libnss3.so: Warning: unhandled
> flags #8 not handled
> 03-05 18:04:59.132 E/GeckoLinker( 9142):
> /data/app/org.mozilla.fennec_cviecco-1.apk!/libnssutil3.so: Warning:
> unhandled flags #8 not handled
> 03-05 18:04:59.142 E/GeckoLinker( 9142):
> /data/app/org.mozilla.fennec_cviecco-1.apk!/libplc4.so: Warning: unhandled
> flags #8 not handled
> 03-05 18:04:59.152 E/GeckoLinker( 9142):
> /data/app/org.mozilla.fennec_cviecco-1.apk!/libnspr4.so: Warning: unhandled
> flags #8 not handled
> 03-05 18:04:59.172 E/GeckoLinker( 9142):
> /data/app/org.mozilla.fennec_cviecco-1.apk!/libplds4.so: Warning: unhandled
> flags #8 not handled
> 03-05 18:04:59.182 E/GeckoLibLoad( 9142): Load nss done

None of these should be a problem.
I investigated this some more on an ARMv6 debug build with gdb, and the problem seems to be happening in the call to PORT_SetError(0) in pk11slot.c [1]. This call goes to secport.c:151 [2] which calls PR_SetError(0, 0) [3]. This is where things go wrong, because when stepping through with the debugger the error code coming into this method is 0, but ends up getting set to -8173. I've attached some of my gdb trace, with disassembly. Coming into the method, errorCode is stored in $r0 and copied to $r4. At address 0x4a9d380c after the call to PR_GetCurrentThread, $r4 is still correct (value is 0), but then as soon as it executes the "mov r3, #0" instruction $r4 ends up at 4294959123 (equivalent to -8173). As far as I can tell this makes no sense, particularly since the problem doesn't happen on *every* call to PR_SetError, only some of them.

[1] http://mxr.mozilla.org/mozilla-central/source/security/nss/lib/pk11wrap/pk11slot.c#2059
[2] http://mxr.mozilla.org/mozilla-central/source/security/nss/lib/util/secport.c#151
[3] http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/misc/prerror.c#23
Created attachment 721676 [details]
GDB trace
Actually after a few more runs it looks like the $r0 is bad even going into the PR_SetError method. I'm not sure why it showed up differently in the above trace.

The call from secport.c:151 to PR_SetError goes through some sort of thunk layer:

(gdb) x/i $pc
=> 0x4a8a1260 <PORT_SetError_Util+16>:	b	0x4a8964b4
(gdb) si
0x4a8964b4 in ?? () from /mozilla-git/obj-android-armv6/dist/bin/libnssutil3.so
(gdb) x/3i $pc
=> 0x4a8964b4:	add	r12, pc, #0, 12
   0x4a8964b8:	add	r12, r12, #135168	; 0x21000
   0x4a8964bc:	ldr	pc, [r12, #2532]!	; 0x9e4

And $r0 changes from 0 to 4294959123 when the "ldr" instruction in the thunk layer is executed:

(gdb) si
0x4a8964b8 in ?? () from /mozilla-git/obj-android-armv6/dist/bin/libnssutil3.so
(gdb) si
0x4a8964bc in ?? () from /mozilla-git/obj-android-armv6/dist/bin/libnssutil3.so
(gdb) p $r12
$99 = 1250653372
(gdb) x/w $r12+2532
0x4a8b7ea0 <_GLOBAL_OFFSET_TABLE_+56>:	0x4a9d37fc
(gdb) p $r0
$100 = 0
(gdb) si

Breakpoint 7, PR_SetError (code=-8173, osErr=0) at /mozilla-git/nsprpub/pr/src/misc/prerror.c:24
24	{
(gdb) p $r0
$101 = 4294959123
(gdb) x/i $pc
=> 0x4a9d37fc <PR_SetError>:	push	{r3, r4, r5, lr}
After much debugging with glandium, we (well, he) figured out what was going on. GCC is miscompiling the sftk_mkPrivKey function in security/nss/lib/softoken/pkcs11.c file. Specifically, after the call to DER_SetUInteger in that function [1], the "bad" compilation compares the return value to an unused register r10, and then returns the CKR_HOST_MEMORY return value since the comparison fails. Here is a disassembly of the relevant code from NDK r5c:

ead8:   ebffd023    bl  2b6c <NSS_3.4+0x2b6c>
eadc:   e3500000    cmp r0, #0  ; 0x0
eae0:   13a00002    movne   r0, #2  ; 0x2             

and the bad code from NDK r8c:

   0x4b06d0d0 <+604>:   bl  0x4b062ce8
   0x4b06d0d4 <+608>:   mov r3, #0
   0x4b06d0d8 <+612>:   cmp r10, r3
   0x4b06d0dc <+616>:   movne   r0, #2
   0x4b06d0e0 <+620>:   moveq   r0, r3

[1] http://mxr.mozilla.org/mozilla-central/source/security/nss/lib/softoken/pkcs11.c#1895
Created attachment 721780 [details] [diff] [review]
Patch

We came up with this patch which seems to do enough code rearrangement to make the compiler generate different code. Try build in progress at https://tbpl.mozilla.org/?tree=Try&rev=af9109e8baef
(Assignee)

Comment 45

5 years ago
Awesome work!
Testing locally a different version of the patch (little less disrupting of the control flow).
(Assignee)

Comment 46

5 years ago
push of an alternative patch (works locally)
https://tbpl.mozilla.org/?tree=Try&rev=94d253182343
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #43)
> After much debugging with glandium, we (well, he) figured out what was going

:kats, :glandium, :cviecco - that's pretty sweet stuff.

Comment 48

5 years ago
The trybuild shows no OOM messages when searching Google from the address-bar.
Wan-Teh, Ryan: I am pretty sure you use at least parts of NSS in your Android Chromium builds. If so, this bug may affect you.

I think the next step is to try to reproduce this using NSS as built using NSS's build system, to see if this is a bug in NSS or a bug in Mozilla's overrides of NSS's build system.

We should also see if the patch in bug 838989 resolves the issue or not, because IIRC that patch deals with ARMv6 compilation stuff.

Also, we should see if DER_SetUInteger is perhaps triggering undefined behavior due to some C language spec. violation. In particular, aliasing violations are a high probability.

Failing all of that, we should accept this patch. But, I do not like taking this patch when we have no idea why it works when the original code doesn't, given that it is supposed to be logically equivalent to the current code.

Comment 50

5 years ago
Chrome does not use NSS on Android at all.
(Assignee)

Comment 51

5 years ago
Created attachment 721915 [details] [diff] [review]
alt-patch
Also, what was the motivation for upgrading to NDK 8? I am worried that if we have a compiler bug that affects *this* code, the same bug will affect other code as well. Do we have any way of being confident one way or the other?
(In reply to Brian Smith (:bsmith) from comment #49)
> We should also see if the patch in bug 838989 resolves the issue or not,
> because IIRC that patch deals with ARMv6 compilation stuff.

Even if that patch fixes the problem, that patch looks too complicated to uplift to beta (I could be wrong though). I would be more comfortable with landing the hack and uplifting it to beta, and then possibly reverting the hack and landing that patch instead on central, if that resolves the issue.

> Also, we should see if DER_SetUInteger is perhaps triggering undefined
> behavior due to some C language spec. violation. In particular, aliasing
> violations are a high probability.

It looked ok to me, but feel free to take a look.

> 
> Failing all of that, we should accept this patch. But, I do not like taking
> this patch when we have no idea why it works when the original code doesn't,
> given that it is supposed to be logically equivalent to the current code.

FWIW I compiled that file with gcc-4.7 from NDK r8d using the same command line options and environment, and the assembly in the object file for that came out fine:

 240:   ebfffffe    bl  0 <DER_SetUInteger>
            240: R_ARM_CALL DER_SetUInteger
 244:   e3a03000    mov r3, #0
 248:   e1500003    cmp r0, r3
 24c:   13a00002    movne   r0, #2
 250:   01a00003    moveq   r0, r3

So AFAICT this is a bug in gcc-4.6 in the armv6 codegen only, and has been fixed in gcc-4.7.

(In reply to Brian Smith (:bsmith) from comment #52)
> Also, what was the motivation for upgrading to NDK 8?

Various and sundry, mostly smaller libraries and general perf improvements.

> I am worried that if
> we have a compiler bug that affects *this* code, the same bug will affect
> other code as well. Do we have any way of being confident one way or the
> other?

It might affect other code but I'm inclined to assume it doesn't until we find a case that proves otherwise. Alternatively we could track down the fix in the gcc source code, add a diagnostic message to gcc that gets printed out whenever the bug would have been triggered, and then compile fennec with this modified gcc and see how many times it gets hit.
:cviecco, I like your alt-patch better than my original patch, so reassigning this bug back to you. If you're happy with the patch please request review so we can get this in and uplifted as soon as possible.
Assignee: bugmail.mozilla → cviecco
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #53)
> It might affect other code but I'm inclined to assume it doesn't until we
> find a case that proves otherwise. Alternatively we could track down the fix
> in the gcc source code (...)

We should ask gcc developers.

Comment 56

5 years ago
Comment on attachment 721915 [details] [diff] [review]
alt-patch

>+          /* The following ifdef is completelly unnecesary, but needed for
>+           * android if using ndk8c and arm6 as it has a compiler bug
>+           */

It would be nice to update this comment with the compiler version when
we find out more from the GCC developers.
Attachment #721915 - Flags: review+

Comment 57

5 years ago
(In reply to Brian Smith (:bsmith) from comment #49)
>
> Also, we should see if DER_SetUInteger is perhaps triggering undefined
> behavior due to some C language spec. violation. In particular, aliasing
> violations are a high probability.

I reviewed DER_SetUInteger and saw no problem. There is no pointer
aliasing.
A few additional data points:
- It's not limited to armv6. It happens with -march=armv7-a -marm
- The minimal flags I've reproduced it with an already preprocessed and slightly simplified source are -marm -Os.
- It doesn't happen at -O2 (default for plain NSS build, but softoken has ALLOW_OPT_CODE_SIZE=1)
http://gcc.gnu.org/ml/gcc/2013-03/msg00051.html
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56561
(Assignee)

Comment 61

5 years ago
Mike, thank you. Do the r+ patch works around the issue on arm7 too? If so we can probably land it today.
(In reply to Camilo Viecco (:cviecco) from comment #61)
> Mike, thank you. Do the r+ patch works around the issue on arm7 too? If so
> we can probably land it today.

The issue is present on any arm architecture as long as it's not built as thumb. So you could add a !defined(__thumb__)
(Assignee)

Comment 63

5 years ago
latest push. Testing on all platforms, updated patch to include gcc bug id.
  https://tbpl.mozilla.org/?tree=Try&rev=b18227edec50
Created attachment 722713 [details] [diff] [review]
GCC 4.6 patch to diagnose if the error happens elsewhere

As an update: I built a modified version of the gcc-4.6 toolchain in NDK r8c that includes the GCC patch that fixes the problem (as per [1]), and also added some output for when that code is exercised. I've attached the patch I used (I had to backport the actual patch to the gcc-4.6 code).

Then I built the latest mozilla-central with that modified toolchain and verified that the pkcs11.o file contains the proper assembly (and works fine on my phone). Unfortunately, the gcc code in question gets exercised on pretty much every file, and I don't know how to tell which hits would have tickled the bug.

[1] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56561#c2
needinfo'ing nathan as glandium suggested his experience with GCC might be able to help here. Basically, given the GCC patch in comment 64, I'd like to find out when the buggy codepath in GCC is exercised, so that we can find if other files in the mozilla codebase will be impacted.

For more backstory you can start reading at comment 43 (or comment 0, if you want everything).
Flags: needinfo?(nfroyd)
Keywords: regressionwindow-wanted
(Reporter)

Comment 66

5 years ago
(In reply to Camilo Viecco (:cviecco) from comment #63)
> latest push. Testing on all platforms, updated patch to include gcc bug id.
>   https://tbpl.mozilla.org/?tree=Try&rev=b18227edec50

I installed the build at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/cviecco@mozilla.com-b18227edec50/try-android-armv6/

and this fixes the issue. \o/
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #65)
> needinfo'ing nathan as glandium suggested his experience with GCC might be
> able to help here. Basically, given the GCC patch in comment 64, I'd like to
> find out when the buggy codepath in GCC is exercised, so that we can find if
> other files in the mozilla codebase will be impacted.

It sounds like you are asking "what's the problematic code pattern so that we can audit several million lines of code to find other cases which might be fixed?"  That sounds crazy, so could you please clarify what you're asking for?  I don't see what the value of looking at other files in m-c is, since we already know this is a compiler bug.
Flags: needinfo?(nfroyd) → needinfo?(bugmail.mozilla)
Our "fix" is to juggle the code around to avoid the compiler bug. If we switched compilers to one without the bug, then yes, I would agree with you that there's no point in looking at other files in m-c. If we stick with the buggy compiler though then it is useful to know how much of our code will be impacted.
Flags: needinfo?(bugmail.mozilla)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #68)
> Our "fix" is to juggle the code around to avoid the compiler bug. If we
> switched compilers to one without the bug, then yes, I would agree with you
> that there's no point in looking at other files in m-c. If we stick with the
> buggy compiler though then it is useful to know how much of our code will be
> impacted.

Ah, thanks for the clarification.  Sticking with the buggy compiler still sounds like crazy talk, though. ;)

I don't see a usefully specific code pattern here; the best I can offer in lieu of sitting down with the compiler is "comparisons to zero" which unhelpfully translates to "everywhere".

One suggestion for guesstimating how much of a fix this provides:

- Build the source tree with extra options '-save-temps -fno-schedule-insns -fno-schedule-insns2' with an unmodified (buggy) and modified (fixed) compiler.  -save-temps will produce .s files next to the .o files.  -fno-schedule-insns* prevents the compiler from rearranging the instruction stream too much.
- Diff .s files at your leisure; the assembly files ought to differ only in places where bugs have been fixed.  Ideally there will be no false positives from instruction rearrangement or differing register allocations.
(In reply to Nathan Froyd (:froydnj) from comment #69)
> - Diff .s files at your leisure; the assembly files ought to differ only in
> places where bugs have been fixed.  Ideally there will be no false positives
> from instruction rearrangement or differing register allocations.

Isn't there chances of label numbering changes?
(In reply to Mike Hommey [:glandium] from comment #70)
> (In reply to Nathan Froyd (:froydnj) from comment #69)
> > - Diff .s files at your leisure; the assembly files ought to differ only in
> > places where bugs have been fixed.  Ideally there will be no false positives
> > from instruction rearrangement or differing register allocations.
> 
> Isn't there chances of label numbering changes?

Sure, it's not a foolproof method.  But usually diffs in label numbers are harmless or they're surrounded by larger chunks of significant code changes.

Using the -y option to diff can be helpful for looking at assembly dumps too.

It's also worth noting that just because you (kats) were seeing those debug messages fire a lot doesn't necessarily mean that differences in assembly were always produced.  combine.c is trying differing ways of sticking instructions together and you were likely seeing backtracking of unsuccessful combinations, rather than combinations that always affected final code generation.
(In reply to Nathan Froyd (:froydnj) from comment #71)
> It's also worth noting that just because you (kats) were seeing those debug
> messages fire a lot doesn't necessarily mean that differences in assembly
> were always produced.  combine.c is trying differing ways of sticking
> instructions together and you were likely seeing backtracking of
> unsuccessful combinations, rather than combinations that always affected
> final code generation.

Yeah, I figured it might be something like that when I was looking through the code. :(

(In reply to Nathan Froyd (:froydnj) from comment #69)
> One suggestion for guesstimating how much of a fix this provides:
> 
> - Build the source tree with extra options '-save-temps -fno-schedule-insns
> -fno-schedule-insns2' with an unmodified (buggy) and modified (fixed)
> compiler.  -save-temps will produce .s files next to the .o files. 
> -fno-schedule-insns* prevents the compiler from rearranging the instruction
> stream too much.
> - Diff .s files at your leisure; the assembly files ought to differ only in
> places where bugs have been fixed.  Ideally there will be no false positives
> from instruction rearrangement or differing register allocations.

I'll try this. Thanks for the suggestion!
(Assignee)

Comment 73

5 years ago
latest push has unrelated xpcshell test failures. Running tests locally before asking for moreinfo (uplift to nss).
Created attachment 723192 [details]
Zip containing affected .s files

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #72)
> I'll try this. Thanks for the suggestion!

So I did that, and found that only two files had changed (after removing spurious changes such as different objdir folder). One is pkcs11.s and the other is mpmontg.s. The changes to mpmontg.s look kind of worrisome because a large part of the file is just gone. I'm attaching the before/after.

Also as a sanity check I went through all the .o files generated in the objdir and ensured that I had the .s files that corresponded to each of them (i.e. to make sure I didn't miss any potential differences). Files compiled with the host-compiler showed up on the list, as did files that were generated from .lo scripts or .S files in the source tree. After removing all of those, there were still a few files that I was missing the .s files for and couldn't figure out where they were being generated from, so those files could potentially also be affected. I've included a list of those in the attachment as well.
The compiler-original output for mpmontg is cut in the middle of a function, that seems implausible.
So it is. I didn't notice the .s file was exactly 4096 bytes long either. It must not have gotten rebuilt properly when I ctrl-c'd an earlier build. I rebuilt that file just now and verified that there is no difference with the patched and unpatched compiler. I also figured out where the unknown .o files were coming from (mostly I was just looking for the .s in the wrong directory) and verified those are also unchanged. Looks like it's just pkcs11.c that is affected by this compiler bug.
Looks like we're good to go with that workaround then :)
I sent email to nss-dev to give people a heads-up about this landing in mozilla-beta and other branches as a private patch, and I will do the landing later today if I don't hear anything back from the module owners.

Comment 79

5 years ago
Comment on attachment 721915 [details] [diff] [review]
alt-patch

Review of attachment 721915 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/nss/lib/softoken/pkcs11.c
@@ +1901,5 @@
> +           */
> +#if defined (__arm__) && defined (__GNUC__)
> +	  *crvp = CKR_HOST_MEMORY;
> +#endif
> +          break;

You should check in this patch in the NSS Hg repository as well.

Is this extra break statement necessary to work around the compiler bug?
(Assignee)

Comment 80

5 years ago
(In reply to Wan-Teh Chang from comment #79)
> Comment on attachment 721915 [details] [diff] [review]
> alt-patch
> 
> Review of attachment 721915 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: security/nss/lib/softoken/pkcs11.c
> @@ +1901,5 @@
> > +           */
> > +#if defined (__arm__) && defined (__GNUC__)
> > +	  *crvp = CKR_HOST_MEMORY;
> > +#endif
> > +          break;
> 
> You should check in this patch in the NSS Hg repository as well.
> 
> Is this extra break statement necessary to work around the compiler bug?
Yes, I have also tested with the break inside the ifdef and that works well too.
If you land this in NSS hg, you should change the comment to say it's a gcc 4.6 problem when targetting arm (but not thumb).
(Assignee)

Comment 82

5 years ago
waiting for upstream nss landing

Comment 83

5 years ago
Created attachment 725057 [details] [diff] [review]
Patch for the NSS hg repository, by Camilo Viecco

I moved the break statement into the ifdef block.
I added the gcc version info as Mike suggested and
fixed the indentation in Camilo's patch. Thanks.
Attachment #725057 - Flags: superreview?(cviecco)
Attachment #725057 - Flags: review?(mh+mozilla)
Comment on attachment 725057 [details] [diff] [review]
Patch for the NSS hg repository, by Camilo Viecco

I don't think you need to mention NDK. Especially since this also affects Linux arm distros. Also, as mentioned in comment 62, you may add a !defined(__thumb__)
Attachment #725057 - Flags: review?(mh+mozilla) → review+

Comment 85

5 years ago
Created attachment 725060 [details] [diff] [review]
Patch for the NSS hg repository, v2, by Camilo Viecco

Mike: I made the changes you suggested. Is this what you have
in mind?  Thanks.
Attachment #725057 - Attachment is obsolete: true
Attachment #725057 - Flags: superreview?(cviecco)
Attachment #725060 - Flags: superreview?(cviecco)
Attachment #725060 - Flags: review?(mh+mozilla)
Comment on attachment 725060 [details] [diff] [review]
Patch for the NSS hg repository, v2, by Camilo Viecco

Review of attachment 725060 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/softoken/pkcs11.c
@@ +1896,5 @@
>                            NSSLOWKEY_EC_PRIVATE_KEY_VERSION);
> +	if (rv != SECSuccess) {
> +	    crv = CKR_HOST_MEMORY;
> +	    /* The following ifdef is needed for Linux arm distros and
> +	     * Android if using ARMv6 as gcc 4.6 has a bug when targeting

s/if using ARMv6// ; you can even remove "Linux arm distros and Android" if you feel like it, because netbsd and other systems also supports arm, although they might not be using gcc 4.6.
Attachment #725060 - Flags: review?(mh+mozilla) → review+
(Assignee)

Comment 87

5 years ago
you might want to add the gcc bug link: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56561

also running latest patch on try:
 https://tbpl.mozilla.org/?tree=Try&rev=67f1728e23e0 
(eta 2hrs)

Comment 88

5 years ago
Created attachment 725119 [details] [diff] [review]
Patch for the NSS hg repository, v3, by Camilo Viecco

Removed "if using ARMv6" and added the gcc bug report URL.

Patch pushed to the NSS hg repository (NSS 3.15):
http://hg.mozilla.org/projects/nss/rev/a223a1cc6b5f
Attachment #725060 - Attachment is obsolete: true
Attachment #725060 - Flags: superreview?(cviecco)
Attachment #725119 - Flags: checked-in+
(In reply to Wan-Teh Chang from comment #88)
> Created attachment 725119 [details] [diff] [review]
> Patch for the NSS hg repository, v3, by Camilo Viecco
> 
> Removed "if using ARMv6" and added the gcc bug report URL.
> 
> Patch pushed to the NSS hg repository (NSS 3.15):
> http://hg.mozilla.org/projects/nss/rev/a223a1cc6b5f

Given this, can we get an landing to m-c/m-a/m-b? Our second to last FF20 beta is going to build tomorrow, so I think we can move forward with https://bugzilla.mozilla.org/show_bug.cgi?id=832942#c78
Flags: needinfo?(bsmith)

Comment 90

5 years ago
Yes, please use the upstream NSS patch (attachment 725119 [details] [diff] [review]) as the
private patch in mozilla-central.
https://hg.mozilla.org/integration/mozilla-inbound/rev/dd27b00a412c
https://hg.mozilla.org/releases/mozilla-aurora/rev/b802cd3011a1
https://hg.mozilla.org/releases/mozilla-beta/rev/9fe304cd8db0
status-firefox20: affected → fixed
status-firefox21: affected → fixed
Flags: needinfo?(bsmith)
https://hg.mozilla.org/mozilla-central/rev/dd27b00a412c
Status: NEW → RESOLVED
Last Resolved: 5 years ago
status-firefox22: affected → fixed
Resolution: --- → FIXED

Updated

5 years ago
Priority: -- → P1
Target Milestone: --- → 3.15
Verified fixed on m-c as per original steps-to-reproduce

Tested via: Alcatel One-Touch 958 (Android 2.3, ARMv6), HTC Status (Android 2.3, ARMv6), Samsung Galaxy SII (Android 4.0.4, ARMv7 w/ARMv6 APK).
Status: RESOLVED → VERIFIED
status-firefox22: fixed → verified
Blocks: 921090
You need to log in before you can comment on or make changes to this bug.