Closed Bug 1083461 Opened 10 years ago Closed 10 years ago

Segmentation fault on Wikipedia search page, if Firefox was compiled with LLVM 3.5.0

Categories

(Core :: CSS Parsing and Computation, defect)

33 Branch
x86_64
Linux
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla36
Tracking Status
firefox36 --- fixed
firefox-esr31 36+ fixed

People

(Reporter: Kaorukun, Assigned: dbaron)

References

()

Details

(Keywords: crash)

Attachments

(6 files, 1 obsolete file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0
Build ID: 20141015214340

Steps to reproduce:

1. Compile Firefox 33.0 from source using LLVM-3.5.0 on a x86_64 Linux platform
2. Visit Wikipedia's fulltext search page : http://en.wikipedia.org/w/index.php?search=&title=Special%3ASearch&fulltext=Search
3. Click anywhere in the page, including the Firefox' navigation toolbar,
or press Alt-Home to try to return to homepage
or hover with the mouse over the Search button and then move the mouse outside of the button

System: Gentoo Linux x86_64, kernel 3.17.0, all packages updated to the latest stable versions.



Actual results:

In step 3, firefox immediately crashes with a segmentation fault

The issue is always reproducible.
The issue is reproducible when using a fresh Firefox profile. I tested by creating a new user account, logged in and started Firefox, i.e. without any add-ons, themes or even any modified settings.
The issue only occurs when Firefox was compiled with llvm-3.5.0. When compiled with llwm-3.4.2, everything works fine. I did not try gcc.
The issue does not occur with the binary firefox package (www-client/firefox-bin-33.0 in Gentoo Portage)
The issue occurs with a different version of Firefox as well - I tried 31.1. When compiled with llvm-3.5.0 it crashes, when compiled with llvm-3.4.2 it works without issues.
In case of v31.1 following message appears on the terminal when crash occurs, but not with v33:
[5046] ###!!! ABORT: Aborting on channel error.: file /var/tmp/portage/www-client/firefox-31.1.1/work/mozilla-esr31/ipc/glue/MessageChannel.cpp, line 1533
[5046] ###!!! ABORT: Aborting on channel error.: file /var/tmp/portage/www-client/firefox-31.1.1/work/mozilla-esr31/ipc/glue/MessageChannel.cpp, line 1533



Expected results:

No segmentation fault or any other issue should occur.
My layman guess is that a tiny animation implemented in Wikipedias search entry field or a similar animation on the "Search" button is causing the issue. The entry field has a blue colored bar when it has focus. Once you click anywhere in the window, the field loses focus, playing brief animation to retract this bar and in that moment the crash occurs. In case when I hover with the mouse over the Search button, the crash occurs only when the mouse cursor leaves the button. Issue with Javascript, JIT?
Severity: normal → critical
Keywords: crash
Thank you for the bug report!

> In step 3, firefox immediately crashes with a segmentation fault

Can you please attach a stack trace to this bug using https://bugzilla.mozilla.org/attachment.cgi?bugid=1083461&action=enter ?
Flags: needinfo?(Kaorukun)
Attached file Stack trace from GDB
Flags: needinfo?(Kaorukun)
Disassembled fragment from nsCSSValue.o which was compiled using LLVM 3.4.2,
showing the assembly code for the function
bool nsCSSValueList::operator==(const nsCSSValueList& aOther) const
Disassembled fragment from nsCSSValue.o which was compiled using LLVM 3.5.0,
showing the assembly code for the function
bool nsCSSValueList::operator==(const nsCSSValueList& aOther) const
I think I figured out what happened here.

The affected code is located in layout/style/nsCSSValue.cpp . The segfault happens because nsCSSValue::operator== receives a null pointer as &aOther reference, as you can see from the attachment 8506025 [details]. I disassembled the code of nsCSSValueList::operator== which calls nsCSSValue::operator== and found that the resulting x86 code looks quite differently when compiled with LLVM 3.5.0. If you go through the asm code in the attachment 8506319 [details] and assume that it was called with this != NULL and aOther == NULL, you can see that the call to nsCSSValue::operator== (from "if (p1->mValue != p2->mValue)") happens without checking the for loop's condition "p1 && p2" first.

So, it seems like a bug in LLVM, I see nothing wrong with Mozilla's code of nsCSSValueList::operator==. I'll submit a bug to LLVM team. This one can be closed I guess.
Thank you for hunting that down!

Marking invalid, but please reopen if it turns out there's an issue on our side after all.
Status: UNCONFIRMED → RESOLVED
Closed: 10 years ago
Resolution: --- → INVALID
Submitted bug at LLVM:
http://llvm.org/bugs/show_bug.cgi?id=21308
Hi, LLVM developer here. This bug is indeed valid, please reopen it. See llvm.org/PR21308#c8 for details why.

We also looked into why the clang warning on "&ref == nullptr" didn't fire, and the answer is that there were too many levels of indirection for our compiler warning to work through. It would slow down compiles too much. (It's about as complicated as the -Wuninitialized checker, which is extremely costly at compile time and we really don't want anything more like that.) The clang static analyzer probably would have caught this, though I haven't checked whether it does.
Nick, thanks for the analysis!
Status: RESOLVED → REOPENED
Component: General → CSS Parsing and Computation
Ever confirmed: true
Flags: needinfo?(dbaron)
Resolution: INVALID → ---
So presumably the issue is here:

3850     case eUnit_Dasharray:
3851     case eUnit_Filter:
3852     case eUnit_Shadow:
3853     case eUnit_BackgroundPosition:
3854       return *mValue.mCSSValueList == *aOther.mValue.mCSSValueList;

in that one of the values has a null mCSSValueList for some reason.
Oh, that's in StyleAnimationValue::operator==.
Indeed.  From the header file:

    eUnit_Dasharray, // nsCSSValueList* (never null)
    eUnit_Filter, // nsCSSValueList* (may be null)
    eUnit_Shadow, // nsCSSValueList* (may be null)
    eUnit_Transform, // nsCSSValueList* (never null)
    eUnit_BackgroundPosition, // nsCSSValueList* (never null)

This is a regression from https://hg.mozilla.org/mozilla-central/rev/ae6c685d5a37 .
Blocks: 569719
Flags: needinfo?(dbaron)
I think all 4 callers converted from |Equal| methods in that patch are probably wrong.
Assignee: nobody → dbaron
Status: REOPENED → ASSIGNED
Does this patch help?

(I haven't actually tested it beyond compiling it, since right now the network connection between my laptop and the machine on which I compile is rather slow.)
Flags: needinfo?(Kaorukun)
Thank you!
I could not successfully apply the patch to the v33.0 source. I guess you're working on a newer code version, and 33.0 is missing a few surrounding lines, so I reapplied the changes manually. It compiled and worked well, no more segfaults on Wikipedia search page.
I'll attach the resulting patch for the v33.0 source, just for reference.
Flags: needinfo?(Kaorukun)
This is a copy of previously submitted patch from David Baron, only converted to be applicable to v33.0 source. I was able to successfully compile and test Firefox v33.0 with it. It eliminates the reported segfault issue.
This is a copy of previously submitted patch from David Baron, only converted to be applicable to v33.0 source. I was able to successfully compile and test Firefox v33.0 with it. It eliminates the reported segfault issue.

I'm terribly sorry, I submitted wrong file by mistake previously
Attachment #8507502 - Attachment is obsolete: true
Comment on attachment 8507471 [details] [diff] [review]
Convert nsCSSValue{,Pair}List::operator== back to a static Equals method so that it can be validly called on null pointers

Maybe add a comment about why the operator== and operator!= are deleted?

r=me
Attachment #8507471 - Flags: review?(bzbarsky) → review+
https://hg.mozilla.org/mozilla-central/rev/983af7fa41f5
Status: ASSIGNED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
Is there a workaround for this? Maybe via blocking something, or an addon? Due to corporate policy I won't be able to upgrade to a newer version of the browser for years, and this seems to happen a couple of times every day on various websites, which is quite annoying. Thanks.
Saurabh, you have a corporate policy that is forcing you to use a non-default compile of Firefox?
It's the esr version 31.3.0. Has the same bug.
Hmm.  Maybe we should backport this to ESR, if we're building that with new enough clang...
Flags: needinfo?(dbaron)
Comment on attachment 8507471 [details] [diff] [review]
Convert nsCSSValue{,Pair}List::operator== back to a static Equals method so that it can be validly called on null pointers

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration:
  Fixes optimization bug with newer compilers that ome uer seem to be hitting.
User impact if declined: 
  crashes on builds compiled with new clang versions (not sure if we do that or if somebody else is compiling them)
Fix Landed on Version: 36
Risk to taking this patch (and alternatives if risky):  low risk
String or UUID changes made by this patch: none

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.
Flags: needinfo?(dbaron)
Attachment #8507471 - Flags: approval-mozilla-esr31?
Attachment #8507471 - Flags: approval-mozilla-esr31? → approval-mozilla-esr31+
Flags: qe-verify-
I have now esr version 31.6.0, and this bug is not fixed there. I am not sure from these comments what esr version is expected to have the fix.
Do you have a stack trace?
Flags: needinfo?(saurabh)
No information in stack trace, there's a bunch of ?? from libxulrunner.so. The error message is:
[8618] ###!!! ABORT: Aborting on channel error.: file /builddir/build/BUILD/firefox-31.6.0/mozilla-esr31/ipc/glue/MessageChannel.cpp, line 1533
[8618] ###!!! ABORT: Aborting on channel error.: file /builddir/build/BUILD/firefox-31.6.0/mozilla-esr31/ipc/glue/MessageChannel.cpp, line 1533
Here's a simple url that triggers the bug: https://www.xcdsystem.com/iie2015 (just visiting the site causes firefox to crash with the above message).
Flags: needinfo?(saurabh)
Could you try either (a) running the stack trace through tools/rb/fix_linux_stack.py or (b) attaching to the process with gdb before the crash (where 4828 is the PID):

$ gdb
(gdb) attach 4828

and then getting a stack trace from gdb once the crash happens:

(gdb) backtrace
Flags: needinfo?(saurabh)
As I said the stack trace shows just a bunch of ?? from libxul.so. I downloaded fix_linux_stack.py from here: https://dxr.mozilla.org/mozilla-central/source/tools/rb/fix_linux_stack.py but I have no idea how to use it. It does not have a help flag or anything.
Flags: needinfo?(saurabh)
Where are you getting this build?  I was under the impression it was a build you had built yourself.
Flags: needinfo?(saurabh)
Could you go to about:crashes, then, click a link to one or more crash reports for the crash you're seeing, and paste the URLs into this bug?
Flags: needinfo?(saurabh)
about:crashes gives a "Problem loading page" error and says "The url is not valid and cannot be loaded".
Flags: needinfo?(saurabh)
(In reply to Saurabh from comment #36)
> From https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/

If that's where you got the build, then it wasn't compiled with LLVM, and it's not this bug.
You need to log in before you can comment on or make changes to this bug.