Closed Bug 1273048 Opened 4 years ago Closed 4 years ago

x86 Android crashes in mozilla::PseudoElementForStyleContext

Categories

(Core :: Layout, defect, critical)

x86
Android
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
firefox49 --- fixed

People

(Reporter: njn, Assigned: dbaron)

References

Details

(Keywords: crash)

Crash Data

Attachments

(2 files, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-eb3eccff-dc1a-4034-af8e-8a9732160511.
=============================================================

This crash signature first appeared in Fennec Nightly 20160511030221, and has occurred 5 times since, across three installations.

I don't see any obvious changes in and around this code recently.

One interesting thing is that all 5 crashes have occurred on x86/Android devices.

dbaron, any ideas?
Flags: needinfo?(dbaron)
The crash address for the crashes seems to be build-specific, and not near either any of the registers, nor in any of the mapped modules.

The CPU for *all* of the Android-x86 crashes is:
GenuineIntel family 6 model 55 stepping 8 | 4

(This CPU is the second most common Android-x86 CPU in crash-stats, responsible for 12.66% of crash reports:
https://crash-stats.mozilla.com/search/?product=FennecAndroid&cpu_arch=x86&_facets=cpu_info&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-cpu_info
The distribution on nightly is quite different, where it's 3rd and 13.08%:
https://crash-stats.mozilla.com/search/?product=FennecAndroid&cpu_arch=x86&release_channel=nightly&date=%3E2016-01-01&_facets=cpu_info&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-cpu_info .)
Flags: needinfo?(dbaron)
Bug 1269976 is about pseudo elements and landed on m-c 2016-05-10, fwiw.
My guess is that the compiler somehow removes the sanity check
> if (aPseudoType >= CSSPseudoElementType::Count) {
above the crashing line.

The patch in bug 1269976 moves kPseudoElementFlags from a local const array to a scoped one, which may change where the array lives in memory, and the new location may have less valid space after...
(In reply to Mats Palmgren (:mats) from comment #2)
> Bug 1269976 is about pseudo elements and landed on m-c 2016-05-10, fwiw.

Seems pretty likely.  Given the nightly crash data, I think there's a >75% chance or so that the regression is in the one-day range:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=043082cb7bd8490c60815f67fbd1f33323ad7663&tochange=674a552743785c28c75866969aad513bd8eaf6ae
(especially since the May 9 and May 10 builds were built from the same changeset).
Blocks: 1269976
Summary: Crash in mozilla::PseudoElementForStyleContext → x86 Android crashes in mozilla::PseudoElementForStyleContext
(In reply to David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) from comment #1)
> The CPU for *all* of the Android-x86 crashes is:
> GenuineIntel family 6 model 55 stepping 8 | 4

This is no longer true.



This is also now 17% of x86 Android crashes for 49.0a1:
https://crash-stats.mozilla.com/search/?product=FennecAndroid&version=49.0a1&cpu_arch=x86&date=%3E2016-04-01&_facets=signature
Hmmm, I have no idea how to disasm the xul library in x86 Android build... I heard that libxul.so there is compressed by szip, which I don't find any tool to decompress. And I don't know how to get the proper symbols either...
Per IRC discussion between dbaron and glandium, to decompress the .so file, you need to:
> mach configure --enable-project=mobile/android
> mach build pre-export
> mach build -C . mozglue/linker/host
> $objdir/dist/host/bin/szip -d libxul.so
I debugged bp-1b4efb55-ecc2-42e4-98c7-27a872160601 a bit

In order to objdump the libxul.so from the build, I needed to compile szip with:
unset MOZCONFIG; MOZ_LINKER=1 ./mach configure --with-system-zlib ; ./mach build pre-export ; ./mach build -C . mozglue/linker/host

Then I could determine that the function we want is:

017eb8ad <mozilla::PseudoElementForStyleContext(nsIFrame*, mozilla::CSSPseudoEle
mentType)>:
 17eb8ad:       80 fa 16                cmp    dl,0x16
 17eb8b0:       77 3a                   ja     17eb8ec <mozilla::PseudoElementFo
rStyleContext(nsIFrame*, mozilla::CSSPseudoElementType)+0x3f>
 17eb8b2:       55                      push   ebp
 17eb8b3:       89 e5                   mov    ebp,esp
 17eb8b5:       56                      push   esi
 17eb8b6:       89 c6                   mov    esi,eax
 17eb8b8:       53                      push   ebx
 17eb8b9:       e8 b2 5a b5 fe          call   341370 <__x86.get_pc_thunk.bx>
 17eb8be:       81 c3 6a c6 17 02       add    ebx,0x217c66a
 17eb8c4:       f6 84 93 98 1d 33 ff    test   BYTE PTR [ebx+edx*4-0xcce268],0x4
 17eb8cb:       04 
 17eb8cc:       75 14                   jne    17eb8e2 <mozilla::PseudoElementForStyleContext(nsIFrame*, mozilla::CSSPseudoElementType)+0x35>
 17eb8ce:       83 ec 0c                sub    esp,0xc
 17eb8d1:       52                      push   edx
 17eb8d2:       e8 0b 3b f9 ff          call   177f3e2 <nsCSSPseudoElements::PseudoElementSupportsUserActionState(mozilla::CSSPseudoElementType)>
 17eb8d7:       83 c4 10                add    esp,0x10
 17eb8da:       84 c0                   test   al,al
 17eb8dc:       75 04                   jne    17eb8e2 <mozilla::PseudoElementForStyleContext(nsIFrame*, mozilla::CSSPseudoElementType)+0x35>
 17eb8de:       31 c0                   xor    eax,eax
 17eb8e0:       eb 03                   jmp    17eb8e5 <mozilla::PseudoElementForStyleContext(nsIFrame*, mozilla::CSSPseudoElementType)+0x38>
 17eb8e2:       8b 46 14                mov    eax,DWORD PTR [esi+0x14]
 17eb8e5:       8d 65 f8                lea    esp,[ebp-0x8]
 17eb8e8:       5b                      pop    ebx
 17eb8e9:       5e                      pop    esi
 17eb8ea:       5d                      pop    ebp
 17eb8eb:       c3                      ret    
 17eb8ec:       31 c0                   xor    eax,eax
 17eb8ee:       c3                      ret    
 17eb8ef:       90                      nop


The module_offset in the raw dump in the crash report is 0x17eb8c4,
so our EIP (0x2c6a48c4) means 0x17eb8c4 above.

EBX is 0x2e820f28 and EDX is 0x211fa509.

Note that the instruction at the very start compares only the low bytes
(cmp    dl,0x16) when deciding whether to jump to the end.

It's not clear to me whether aType is supposed to be passed in DL (with the
other 24 bits ignored) or in EDX (consider all 32 bits).  Either way, things
seem wrong.



Looking at the calling function, we have (from the crash report's raw
dump), module_offset: 0x17ff1ac (which should be the return address...
but apparently it's one byte less than the return address!?), and the
calling code there looks like:

 17ff19f:       8a 95 5c ff ff ff       mov    dl,BYTE PTR [ebp-0xa4]
 17ff1a5:       8b 45 0c                mov    eax,DWORD PTR [ebp+0xc]
 17ff1a8:       e8 00 c7 fe ff          call   17eb8ad <mozilla::PseudoElementForStyleContext(nsIFrame*, mozilla::CSSPseudoElementType)>
 17ff1ad:       52                      push   edx


This makes me think it's a miscompilation that the above code is using
a test instruction that uses all of EDX:
 17eb8c4:       f6 84 93 98 1d 33 ff    test   BYTE PTR [ebx+edx*4-0xcce268],0x4

The Android NDK's objdump and my system objdump on Ubuntu 16.04 at least
agree on that instruction.  In AT&T syntax, it's:
 17eb8c4:       f6 84 93 98 1d 33 ff    testb  $0x4,-0xcce268(%ebx,%edx,4)
Looking nsCSSPseudoElements::PseudoElementSupportsUserActionState() which is called a bit later, it seems the function there only uses one byte. I suspect this is what was compiled before I move that function into the header, and thus it didn't crash before bug 1269976:

0177f3e2 <_ZN19nsCSSPseudoElements36PseudoElementSupportsUserActionStateEN7mozilla20CSSPseudoElementTypeE>:
 177f3e2:	e8 c0 6d bc fe       	call   3461a7 <__x86.get_pc_thunk.cx>
 177f3e7:	81 c1 41 8b 1e 02    	add    $0x21e8b41,%ecx
 177f3ed:	55                   	push   %ebp
 177f3ee:	89 e5                	mov    %esp,%ebp
 177f3f0:	0f b6 45 08          	movzbl 0x8(%ebp),%eax
 177f3f4:	5d                   	pop    %ebp
 177f3f5:	f6 84 81 98 1d 33 ff 	testb  $0x8,-0xcce268(%ecx,%eax,4)
 177f3fc:	08 
 177f3fd:	0f 95 c0             	setne  %al
 177f400:	c3                   	ret    
 177f401:	90                   	nop
00:50:42 <jandem> dbaron: that sounds like https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64037, it comes up from time to time in spidermonkey :/
00:50:55 <jandem> i checked the enum there has uint8_t storage
I mean, before I move nsCSSPseudoElements::PseudoElementSupportsStyleAttribute into the header.

So the solution would probably be: move nsCSSPseudoElements::PseudoElementSupportsStyleAttribute back to the source file, so that the compiler wouldn't mis-optimize it when inlining.
Depends on: 1277189
01:58:51 <xidorn> dbaron: I have no idea. if upgrading ndk is non-trivial, I'd suggest we move nsCSSPseudoElements::PseudoElementSupportsStyleAttribute back to the .cpp file, or mark it never inline for x86 android build
02:01:01  → adalucinet and KaiRo joined  ⇐ reed and AdrianSV quit  ↔ mgoodwin_OoO and sankha nipped out  •  mtabara|away → mtabara, fscholz|off → fscholz  
02:10:56 <glandium> dbaron: it's supposed to be fixed in rev 218721 of the 4.9 branch, and NDK r11 has a newer revision than that
02:11:47 <glandium> dbaron: also, we don't have to land a NDK upgrade to validate that it does indeed fix it
02:11:57 <glandium> that can be checked with a try build



Also, we could perhaps use __attribute__((noinline)), #ifdef'd based on the bad GCC versions, based on https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes .  (There's a good bit of __attribute__((noinline)) usage in the tree already; see https://mxr.mozilla.org/mozilla-central/search?string=__%28%28noinline%29%29&tree=mozilla-central
MozReview-Commit-ID: 4VjAra5B6GM
Attachment #8759868 - Flags: review?(nfroyd)
Assignee: nobody → dbaron
Status: NEW → ASSIGNED
MozReview-Commit-ID: IXYYn3mLQBf
Attachment #8759870 - Flags: review?(bugzilla)
I tested locally that the MOZ_GCC_VERSION_AT_LEAST() and
MOZ_GCC_VERSION_AT_MOST() expressions do sensible things when I fiddle
with the numbers to make them more or less than my local gcc version.
(I tested this for all 4 expressions.)

I don't know for sure that this will fix the crashes we're seeing, but
it seems like it should undo the change that triggered it, so I think
it's worth trying.

MozReview-Commit-ID: IXYYn3mLQBf
Attachment #8759871 - Flags: review?(bugzilla)
Attachment #8759868 - Flags: review?(nfroyd) → review+
Attachment #8759870 - Attachment is obsolete: true
Attachment #8759870 - Flags: review?(bugzilla)
Comment on attachment 8759871 [details] [diff] [review]
Add __attribute__((noinline)) to work around compiler bug on Android/x86

Review of attachment 8759871 [details] [diff] [review]:
-----------------------------------------------------------------

Looks fine if it works.

::: layout/style/nsCSSPseudoElements.h
@@ +109,5 @@
> +
> +  // Work around https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64037 ,
> +  // which is a general gcc bug that we seem to have hit only on Android/x86.
> +#if defined(ANDROID) && defined(__i386__) && defined(__GNUC__) && !defined(__clang__)
> +#if (MOZ_GCC_VERSION_AT_LEAST(4,8,0) && MOZ_GCC_VERSION_AT_MOST(4,8,4)) || (MOZ_GCC_VERSION_AT_LEAST(4,9,0) && MOZ_GCC_VERSION_AT_MOST(4,9,2))

This line, as well as the line above, looks too long. Could you break both of them into two lines?
Attachment #8759871 - Flags: review?(bugzilla) → review+
https://hg.mozilla.org/mozilla-central/rev/efd20b79bbc5
https://hg.mozilla.org/mozilla-central/rev/4c3d7660fd37
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
(In reply to Martin Stránský from comment #20)
> This crash is still live in FF 52 ESR/i386, at least on RHEL6/CentOS6:
> 
> https://bugs.centos.org/view.php?id=13253&nbn=1
> https://bugzilla.redhat.com/show_bug.cgi?id=1451055

Same signature doesn't mean it is the same bug, but it could be.

This crash was from a compiler bug, and the workaround was only added for Android because that's the only place we use -Os I suppose. It is possible that the compiler RHEL6 / CentOS6 uses still have this bug, and your build configuration triggers it.
You need to log in before you can comment on or make changes to this bug.