Closed Bug 1576303 Opened 5 years ago Closed 5 years ago

crash entering into URL bar on ppc64le

Categories

(Core :: JavaScript Engine, task, P1)

Other
Linux
task

Tracking

()

RESOLVED FIXED
mozilla71
Tracking Status
firefox-esr68 --- fixed
firefox70 --- fixed
firefox71 --- fixed

People

(Reporter: spectre, Assigned: spectre)

References

Details

Attachments

(2 files, 12 obsolete files)

47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review

Entering any text into the URL bar causes a crash in current opt (-O3 non-debug) builds on ppc64le from trunk. This seems to be completely unrelated to bug 1512162. Brief backtrace follows, full crash signature provided.

@tcampbell, does any of this look familiar? Any places we should start looking? This smells like some sort of memory barrier issue. Interestingly, -O3 but with debug on works fine.

% hg bisect --bad
The first bad revision is:
changeset:   482748:552b13ce8016
user:        Ted Campbell <tcampbell@mozilla.com>
date:        Mon Jul 15 05:44:14 2019 +0000
summary:     Bug 1565556 - Remove JSScript::hasTryNotes/ScopeNotes/ResumeOffsets. r=jandem
Thread 1 "firefox" received signal SIGSEGV, Segmentation fault.
0x00007ffff02a7040 in js::gc::Cell::storeBuffer (this=0x7fffffffa868)
    at /home/spectre/src/mozilla-central/js/src/gc/Cell.h:281
281	  return chunk()->trailer.storeBuffer;
[...]
#0  0x00007ffff02a7040 in js::gc::Cell::storeBuffer() const
    (this=0x7fffffffa868)
    at /home/spectre/src/mozilla-central/js/src/gc/Cell.h:281
#1  0x00007ffff02a7040 in js::InternalBarrierMethods<JS::Value>::postBarrier(JS::Value*, JS::Value const&, JS::Value const&)
    (next=..., prev=..., vp=0x7fffffffa5f0)
    at /home/spectre/src/mozilla-central/js/src/gc/Barrier.h:352
#2  0x00007ffff02a7040 in js::WriteBarriered<JS::Value>::post(JS::Value const&, JS::Value const&) (next=..., prev=..., this=0x7fffffffa5f0)
    at /home/spectre/src/mozilla-central/js/src/gc/Barrier.h:456
#3  0x00007ffff02a7040 in js::HeapPtr<JS::Value>::HeapPtr(JS::Value const&)
    (v=..., this=0x7fffffffa5f0)
    at /home/spectre/src/mozilla-central/js/src/gc/Barrier.h:595
#4  0x00007ffff02a7040 in js::OrderedHashMap<js::HashableValue, js::HeapPtr<JS::Value>, js::HashableValue::Hasher, js::ZoneAllocPolicy>::Entry::Entry<JS::Handle<JS::Value> >(js::HashableValue const&, JS::Handle<JS::Value>&&)
    (v=..., k=..., this=0x7fffffffa598)
    at /home/spectre/src/mozilla-central/js/src/ds/OrderedHashTable.h:768
#5  0x00007ffff02a7040 in js::OrderedHashMap<js::HashableValue, js::HeapPtr<JS::Value>, js::HashableValue::Hasher, js::ZoneAllocPolicy>::put<JS::Handle<JS::Value> >(js::HashableValue const&, JS::Handle<JS::Value>&&)
    (value=..., key=..., this=0x7fffba215ab0)
    at /home/spectre/src/mozilla-central/js/src/ds/OrderedHashTable.h:811
#6  0x00007ffff02a7040 in js::MapObject::set_impl(JSContext*, JS::CallArgs const&) (cx=0x7fffd9849000, args=...)
    at /home/spectre/src/mozilla-central/js/src/builtin/MapObject.cpp:785
#7  0x00007ffff02a7d18 in JS::CallNonGenericMethod<&js::MapObject::is, &js::MapObject::set_impl>(JSContext*, JS::CallArgs const&)
    (args=..., cx=<optimized out>)
    at /home/spectre/src/mozilla-central/js/src/vm/NativeObject.h:1425
#8  0x00007ffff02a7d18 in js::MapObject::set(JSContext*, unsigned int, JS::Value*) (cx=<optimized out>, argc=<optimized out>, vp=<optimized out>)
    at /home/spectre/src/mozilla-central/js/src/builtin/MapObject.cpp:796
#9  0x00007ffff021d43c in CallJSNative(JSContext*, bool (*)(JSContext*, unsigned int, JS::Value*), JS::CallArgs const&)
    (args=..., native=0x7ffff02a7be0 <js::MapObject::set(JSContext*, unsigned int, JS::Value*)>, cx=0x7fffd9849000)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/CallArgs.h:284
#10 0x00007ffff021d43c in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) (cx=0x7fffd9849000, args=..., construct=<optimized out>) at /home/spectre/src/mozilla-central/js/src/vm/Interpreter.cpp:539
#11 0x00007ffff020f9a0 in js::CallFromStack(JSContext*, JS::CallArgs const&) (args=..., cx=<optimized out>) at /home/spectre/src/mozilla-central/js/src/vm/Interpreter.cpp:3084
#12 0x00007ffff020f9a0 in Interpret(JSContext*, js::RunState&) (cx=0x7fffd9849000, state=...) at /home/spectre/src/mozilla-central/js/src/vm/Interpreter.cpp:3084
Flags: needinfo?(tcampbell)

A simple-minded backout crashes with the same signature.

That's a bit surprising to me. Some quick googling doesn't show any sort of weird size_t/uintptr_t definitions for ppc64le and 64-bit little-endian is our best tested configuration.

Looking at the backtrace, it shows the crash occurs when 'Cell*' is 0x7fffffffa868 which is almost certainly non-sense. The GC allocated Cells must be inside 1MB chunk allocations and then we use pointer tricks to find the chunk trailer from a given Cell *.

Between Frame #4 and #5 the 'this' pointer for the same js::OrderedHashMap changes. Start your investigation there and see if you can determine why the code or the compile managed to screw that up.

Flags: needinfo?(tcampbell)

Oops.. That is OrderedHashMap vs OrderedHashMap::Entry. It still seems worth poking in that area.

Looking at the definition of OrderedHashMap there are definitely some code smells about strict C++ object model issues. There might be some funny business of move-references being abused. That might explain why we try and do GC-related operations on a stack address. Will follow up with GC / C++ knowledgeable folks.

Building at -O2 does seem to mask it, which does suggest there is some corner case or maybe even subtle UB that gets tickled (and that the offending revision probably just unmasked some problem that was already there).

No longer blocks: 1565556

I tried poking around some corners but didn't get much traction since I have to confess I don't understand this code very well. In the short term as a bandaid we could adopt a deoptimization solution temporarily (as we did in bug 1512162, even though this isn't the same root cause), but I wasn't able to get the scope much smaller than gc/Barrier.h. Dan, does this fix the crash for you? Are you able to minimize the scope of the #pragma any further? It seemed on my system I needed to have the headers covered by the #pragma too but I'm loathe to add it to gc/Cell.h also if I can avoid it.

Assignee: nobody → spectre
Status: NEW → ASSIGNED
Flags: needinfo?(dan)

Looking at this a little closer, it doesn't seem like the stack-copy/move-reference thing I thought might be happening is actually happening. The JS::Value (which is a union of double/int/Cell*, etc) seems to think things are Object/String but the address is a stack address. There should not be any cases where a JS::Value points to a stack so this seems like more problems similar to Bug 1512162.

One thing that does seem a bit suspicious in js/src/ds/OrderedHashTable.h is that we really are supposed to be using the return values of placement-new instead of the original buffer pointer. I'd be surprised if this changes the results but it might be interesting.

What version of GCC is being used for the ppc64le builds?

What lines in OrderedHashTable.h do you mean, specifically? As for the compiler,

% gcc --version
gcc (GCC) 9.1.1 20190503 (Red Hat 9.1.1-1)
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I assume you meant something like this. No, it didn't seem to make any difference to the crash or signature.

Attachment #9088617 - Attachment description: Use return value of operator new (didn't work) → Use return value of placement-new (didn't work)

In the absence of a clear explanation why, I will submit the bandaid fix so that at least it can be "unbroken" a la bug 1512162. Maybe it really is a compiler bug but I can't reduce it enough to figure it out.

@tcampbell, please let me know if I should redirect the review.

Flags: needinfo?(tcampbell)

Thanks! (clearing n-i, marking leave-open so I can continue to work on finding a better solution)

Flags: needinfo?(tcampbell)

Sorry about that. I apparently got distracted before hitting the submit problem.

The placement-new patch was what I had in mind. Thanks for ruling it out. I'm not sure what may be upsetting compilers still.

I'd be curious if scoping down to Value.h still solves the problem. In particular, the bitsFromDouble and setDouble methods.

https://bugzilla.mozilla.org/show_bug.cgi?id=1312488 is something we've run into in the past but the Value code has changed a bit this year and is no longer a union.

Thanks, I'll look into that further. It has some commonalities with bug 1512162. I have to figure out how bug 1551313 broke the build first, though.

Pushed by btara@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4559e60f4dde
Temporary compiler workaround for crashes on ppc64le. r=tcampbell

Keywords: checkin-needed

I reduced it down to this block, pretty much the mutators. (I intentionally pulled the tree before the prior workaround landed.) Does anything jump out at you from these functions? It looks like they all manipulate asBits_. Thanks for your help on this since I don't know this code well.

Flags: needinfo?(tcampbell)

Just for laughs, I changed no-inline-functions to no-strict-aliasing and that, also, fixes it. Is there something odd about that big union that it doesn't like?

Attached patch WIP (obsolete) — Splinter Review

@tcampbell: Okay, I'm at the point I'm going to need a little guidance. With no-strict-aliasing I was able to reduce the scope down to toTag(). I proved this was the bad function by moving the scope to another function and watching it crash, then move it back and it didn't. I then looked at the assembly by throwing a trap instruction at the function itself. With strict aliasing on, the code at PC looks like this:

   0x00007ffff01da104 <+276>:	b       0x7ffff01da11c <js::NativeObject::initializeSlotRange(unsigned int, unsigned int)+300>
// end of prior basic block
   0x00007ffff01da108 <+280>:	nop
   0x00007ffff01da10c <+284>:	nop
   0x00007ffff01da110 <+288>:	addi    r29,r29,8
   0x00007ffff01da114 <+292>:	cmpld   r29,r23
   0x00007ffff01da118 <+296>:	bge     0x7ffff01da1fc <js::NativeObject::initializeSlotRange(unsigned int, unsigned int)+524>
   0x00007ffff01da11c <+300>:	addi    r8,r30,1
   0x00007ffff01da120 <+304>:	mr      r24,r30
   0x00007ffff01da124 <+308>:	std     r20,0(r29)
   0x00007ffff01da128 <+312>:	clrldi  r30,r8,32
=> 0x00007ffff01da12c <+316>:	trap
   0x00007ffff01da130 <+320>:	ld      r9,0(r29)
   0x00007ffff01da134 <+324>:	rldicl  r10,r9,17,47

Without strict aliasing (i.e., no-strict-aliasing), it looks like this -- which is the entirety of the function:

=> 0x00007ffff01b3110 <+0>:	trap
   0x00007ffff01b3114 <+4>:	ld      r3,0(r3)
   0x00007ffff01b3118 <+8>:	rldicl  r3,r3,17,47
   0x00007ffff01b311c <+12>:	blr

You wouldn't have to know PowerPC assembly to see that the generated code (for toTag at least) is pretty much the same: it's taking asBits_ and rotating it. The only difference is that with strict aliasing gcc seems to assume it knows enough to inline the function, and that's why no-inline-functions also (in a wider scope) fixes the issue. (Btw, I don't understand why it's storing r20 to 0(r29) and then immediately reading that value; that does suggest a compiler bug when it could simply have done mr r9,r20. The trap wouldn't have caused it to spill because other instructions didn't spill their own registers.)

As you can see from the WIP above I tried a whole bunch of ways of rewriting the function and either got a compiler error or it still crashed. However, slapping __attribute__((noinline,noclone)) as a function decorator also solves the problem without having to use any #pragmas. That's the current working solution.

What I need to know: is there another way to rewrite this? Or do you think the attribute solution is the best option? I'll prep a patch either way to backout the temporary bandaid and put in this better idea, whichever it happens to be.

Attachment #9087964 - Attachment is obsolete: true
Attachment #9088617 - Attachment is obsolete: true
Attachment #9089603 - Attachment is obsolete: true
Attachment #9089784 - Flags: feedback?(tcampbell)

You mention no-strict-aliasing, but SpiderMonkey should be compiled with -fno-strict-aliasing by default. Did you turn that off? Or why isn't that sufficient to stop this?

No, I didn't turn it off. When I run gmake in obj.../js/src, I see it in the command line options to gcc but the pragma is clearly changing the generated code even with it already "off." I don't have an explanation for this; should I just write this off as a weird side effect of messing with the compiler?

Again, the idea would be pretty much just

  private:
+#if (__GNUC__ && __linux__ && __PPC64__ && _LITTLE_ENDIAN)
+__attribute__((noinline,noclone))
+#endif
   JSValueTag toTag() const {

Would that be acceptable? Like I say, the generated code doesn't quite make sense in the inlined case, so maybe it's a compiler bug after all.

Thanks for continuing to investigate.

I'd agree that the attribute is preferable to the pragma (particularly on the barrier file). The strict-aliasing thing is certainly weird and as Jan says, firefox generally is far from supporting -fstrict-aliasing still and our options should turn it off.

There is still some C++ badness in Value.h and I will attach a patch that turns JS::Value from a union to a struct that should be a lot more defined by the letter of the language. Can you test this patch and see if it helps your case? I would need to do some slight cleanup before landing, but the general approach should be fine for all platforms.

Flags: needinfo?(tcampbell)

Does this patch help at all? It removes the most egregious type-punning code and tries to be closer to pedantic c++.

Attachment #9090267 - Flags: feedback?(spectre)

That looks like it would avoid a lot of the weirdness. I'm out of town on business until Friday but I will try it first thing when I get back.

I get an immediate crash during startup with the JS::Value patch applied (and Cameron's workaround removed)

#0  0x00007fffb3898d48 in __libc_signal_restore_set (set=0x7ffff6311c38) at ../sysdeps/unix/sysv/linux/internal-signals.h:84
#1  0x00007fffb3898d48 in raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:48
#2  0x00007fffa80f56fc in nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*) (signo=<optimized out>, info=<optimized out>, context=0x7ffff6311f20)
    at /mnt/dan/firefox.git/toolkit/profile/nsProfileLock.cpp:169
#3  0x00007fffa8ab80fc in js::UnixExceptionHandler(int, siginfo_t*, void*) (signum=<optimized out>, info=0x7ffff6312c98, context=0x7ffff6311f20)
    at /mnt/dan/firefox.git/js/src/ds/MemoryProtectionExceptionHandler.cpp:273
#4  0x00007fffb39004e8 in <signal handler called> () at arch/powerpc/kernel/vdso64/sigtramp.S
#5  0x00007fffa03b5b58 in JS::Value::setObject(JSObject&) (this=<optimized out>, obj=...) at /mnt/dan/firefox.git/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:450
#6  0x00007fffa857b810 in JS::Value::setObjectOrNull(JSObject*) (arg=<optimized out>, this=<optimized out>)
    at /mnt/dan/firefox.git/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:501
#7  0x00007fffa857b810 in JS::ObjectOrNullValue (obj=<optimized out>) at /mnt/dan/firefox.git/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:1045
#8  0x00007fffa857b810 in js::EnvironmentObject::initEnclosingEnvironment(JSObject*) (this=0x3027bd22e040, enclosing=<optimized out>)
    at /mnt/dan/firefox.git/js/src/vm/EnvironmentObject.h:277
#9  0x00007fffa85589c4 in js::LexicalEnvironmentObject::createTemplateObject(JSContext*, JS::Handle<js::Shape*>, JS::Handle<JSObject*>, js::gc::InitialHeap) (cx=<optimized out>, shape=..., 
    shape@entry=..., enclosing=..., heap=heap@entry=js::gc::TenuredHeap) at /mnt/dan/firefox.git/js/src/vm/EnvironmentObject.cpp:928
#10 0x00007fffa855aafc in js::LexicalEnvironmentObject::createGlobal(JSContext*, JS::Handle<js::GlobalObject*>) (cx=<optimized out>, global=...)
    at /mnt/dan/firefox.git/obj-powerpc64le-unknown-linux-gnu/dist/include/js/RootingAPI.h:1213
#11 0x00007fffa85af230 in js::GlobalObject::createInternal(JSContext*, JSClass const*) (cx=0x7fff9d249000, clasp=<optimized out>)
    at /mnt/dan/firefox.git/obj-powerpc64le-unknown-linux-gnu/dist/include/js/RootingAPI.h:1213
#12 0x00007fffa8844a70 in JSRuntime::createSelfHostingGlobal(JSContext*) (cx=0x7fff9d249000) at /mnt/dan/firefox.git/js/src/vm/SelfHosting.cpp:2664
#13 0x00007fffa8867e58 in JSRuntime::initSelfHosting(JSContext*) (this=<optimized out>, cx=<optimized out>) at /mnt/dan/firefox.git/js/src/vm/SelfHosting.cpp:2775
#14 0x00007fffa8b1d578 in JS::InitSelfHostedCode(JSContext*) (cx=0x7fff9d249000) at /mnt/dan/firefox.git/js/src/jsapi.cpp:434
#15 0x00007fffa1b88280 in nsXPConnect::InitStatics() () at /mnt/dan/firefox.git/js/xpconnect/src/nsXPConnect.cpp:141
#16 0x00007fffa1b20470 in xpcModuleCtor() () at /mnt/dan/firefox.git/js/xpconnect/src/XPCModule.cpp:11
#17 0x00007fffa6bd9fdc in nsLayoutModuleInitialize() () at /mnt/dan/firefox.git/layout/build/nsLayoutModule.cpp:107
#18 0x00007fffa033db64 in nsComponentManagerImpl::Init() (this=0x7fff9eac0300) at /mnt/dan/firefox.git/xpcom/components/nsComponentManager.cpp:493
#19 0x00007fffa03d9cbc in NS_InitXPCOM(nsIServiceManager**, nsIFile*, nsIDirectoryServiceProvider*)
    (aResult=0x7fffb0050a80, aBinDirectory=<optimized out>, aAppFileLocationProvider=<optimized out>) at /mnt/dan/firefox.git/xpcom/build/XPCOMInit.cpp:444
#20 0x00007fffa8110318 in ScopedXPCOMStartup::Initialize() (this=0x7fffb0050a80) at /mnt/dan/firefox.git/obj-powerpc64le-unknown-linux-gnu/dist/include/nsCOMPtr.h:839
#21 0x00007fffa8122cec in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) (this=0x7ffff6313848, 
    this@entry=0x7ffff6313868, argc=argc@entry=4, argv=argv@entry=0x7ffff6314ed8, aConfig=...) at /mnt/dan/firefox.git/obj-powerpc64le-unknown-linux-gnu/dist/include/mozilla/UniquePtr.h:308
#22 0x00007fffa81235e0 in XRE_main(int, char**, mozilla::BootstrapConfig const&) (argc=<optimized out>, argv=0x7ffff6314ed8, aConfig=...)
    at /mnt/dan/firefox.git/toolkit/xre/nsAppRunner.cpp:4800
#23 0x00007fffa81336a8 in mozilla::BootstrapImpl::XRE_main(int, char**, mozilla::BootstrapConfig const&) (this=<optimized out>, argc=<optimized out>, argv=<optimized out>, aConfig=...)
    at /mnt/dan/firefox.git/toolkit/xre/Bootstrap.cpp:45
#24 0x000000012bbf7bf4 in do_main(int, char**, char**) (argc=<optimized out>, argv=0x7ffff6314ed8, envp=<optimized out>) at /mnt/dan/firefox.git/browser/app/nsBrowserApp.cpp:218
#25 0x000000012bbf7d24 in main(int, char**, char**) (argc=<optimized out>, argv=0x7ffff6314ed8, envp=0x7ffff6314f00) at /mnt/dan/firefox.git/browser/app/nsBrowserApp.cpp:300
Flags: needinfo?(dan)

(It isn't at all obvious in Bugzilla, but the Signature is for things that correspond to https://crash-stats.mozilla.org/ )

Crash Signature: #0 0x00007ffff02a7040 in js::gc::Cell::storeBuffer() const (this=0x7fffffffa868) at /home/spectre/src/mozilla-central/js/src/gc/Cell.h:281 #1 0x00007ffff02a7040 in js::InternalBarrierMethods<JS::Value>::postBarrier(JS::Value*, JS::Value const…

Thanks for testing that out Dan!

Can either of you let me know the build steps you used, or see if your same build process / distro / compiler also reproduces on x64. The types of failures we are seeing don't seem very ppc specific and I wonder if this is a cross-platform UB issue that GCC is biting us on.

This is on Fedora 30 for ppc64le, gcc-9.2.1-1.fc30.ppc64le and the .mozconfig used for ./mach build on today's trunk is

export CC=/usr/bin/gcc
export CXX=/usr/bin/g++
export PATH=$HOME/.cargo/bin:$PATH

mk_add_options MOZ_MAKE_FLAGS="-j32"
ac_add_options --enable-application=browser
ac_add_options --enable-optimize="-Og"
ac_add_options --enable-debug
ac_add_options --disable-release
ac_add_options --disable-tests
ac_add_options --enable-linker=bfd

I'm heading back to my home base and should be able to analyse this a little further this evening.

Approximately the same as Dan's, F30, gcc 9.2.1-1. My optimize flags are -Og -mcpu=power9 for a debug build but that shouldn't affect this.

I see the startup crash, but it's actually an assertion: Assertion failure: &toObject() == &obj, at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:450. The backtrace is the same as Dan's, namely,

#0  0x00007fffe3e35f0c in JS::Value::setObject(JSObject&)
    (this=<optimized out>, obj=...)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:450
#1  0x00007fffebfe99a8 in JS::Value::setObjectOrNull(JSObject*)
    (arg=<optimized out>, this=<optimized out>)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:501
#2  0x00007fffebfe99a8 in JS::ObjectOrNullValue (obj=<optimized out>)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:1045
#3  0x00007fffebfe99a8 in js::EnvironmentObject::initEnclosingEnvironment(JSObject*) (this=0x21e38f52e040, enclosing=<optimized out>)
    at /home/spectre/src/mozilla-central/js/src/vm/EnvironmentObject.h:277
#4  0x00007fffebfc4e00 in js::LexicalEnvironmentObject::createTemplateObject(JSContext*, JS::Handle<js::Shape*>, JS::Handle<JSObject*>, js::gc::InitialHeap)
    (cx=<optimized out>, shape=..., 
    shape@entry=..., enclosing=..., heap=heap@entry=js::gc::TenuredHeap)
    at /home/spectre/src/mozilla-central/js/src/vm/EnvironmentObject.cpp:928

This actually occurs with or without my prior workaround.

Attachment #9090267 - Flags: feedback?(spectre)

I read the instruction wrong, so I'm redoing the analysis. In the function prologue r4 which is obj gets moved to r30. (r3 is this.)

   0x00007fffe3e35c20 <+0>:	addis   r2,r12,4020
   0x00007fffe3e35c24 <+4>:	addi    r2,r2,3296
   0x00007fffe3e35c28 <+8>:	mflr    r0
   0x00007fffe3e35c2c <+12>:	std     r0,16(r1)
   0x00007fffe3e35c30 <+16>:	std     r30,-16(r1)
   0x00007fffe3e35c34 <+20>:	std     r31,-8(r1)
   0x00007fffe3e35c38 <+24>:	stdu    r1,-48(r1)
   0x00007fffe3e35c3c <+28>:	mr      r31,r1
=> 0x00007fffe3e35c40 <+32>:	mr      r30,r4

The shift and store section looks like this (there are trap instructions between the three lines in setObject), essentially the entire body of setObjectNoCheck.

=> 0x00007fffe3e35d04 <+228>:	trap
   0x00007fffe3e35d08 <+232>:	li      r9,-1
   0x00007fffe3e35d0c <+236>:	mr      r10,r30
   0x00007fffe3e35d10 <+240>:	rldimi  r10,r9,49,0
   0x00007fffe3e35d14 <+244>:	std     r10,0(r3)
   0x00007fffe3e35d18 <+248>:	trap
   0x00007fffe3e35d1c <+252>:	bl      0x7fffe3e35a8c <JS::Value::toObject() const+8>

r9 gets loaded with all one bits. The object is loaded into r10 as a mask target. It shifts the all-ones left 49 bit positions and then inserts it into r10 with another mask of all one-bits and logical-ORs it with the result. It then stores it into asBits_. The assertion follows the second trap.

(gdb) set $pc+=4
(gdb) si
0x00007fffe3e35d0c	457	    asBits_ = bitsFromTagAndPayload(JSVAL_TAG_OBJECT, PayloadType(obj));
(gdb) i reg r9
r9             0xffffffffffffffff  18446744073709551615
(gdb) si
0x00007fffe3e35d10	457	    asBits_ = bitsFromTagAndPayload(JSVAL_TAG_OBJECT, PayloadType(obj));
(gdb) i reg r10 r30
r10            0x2f9b43a2d060      52343901180000
r30            0x2f9b43a2d060      52343901180000
(gdb) si
0x00007fffe3e35d14 in JS::Value::setObjectNoCheck (obj=0x2f9b43a2d060, 
    this=0x7fffffffc5c0)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:457
457	    asBits_ = bitsFromTagAndPayload(JSVAL_TAG_OBJECT, PayloadType(obj));
(gdb) i reg r10
r10            0xfffe2f9b43a2d060  18446233467657310304
(gdb) si
JS::Value::setObject (this=0x7fffffffc5c0, obj=...)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:451
451	__asm__("trap\n");
(gdb) x/16dx $r3
0x7fffffffc5c0:	0x43a2d060	0xfffe2f9b	0x98542100	0x2358c2e9
[...]
0x00007fffe3e35d20 in JS::Value::setObject (this=0xfffe2f9b43a32f9c, obj=...)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:452
452	    MOZ_ASSERT(&toObject() == &obj);
(gdb) disas $pc, $pc+0x30
Dump of assembler code from 0x7fffe3e35d20 to 0x7fffe3e35d50:
=> 0x00007fffe3e35d20 <JS::Value::setObject(JSObject&)+256>:	nop
   0x00007fffe3e35d24 <JS::Value::setObject(JSObject&)+260>:	cmpd    r30,r3
   0x00007fffe3e35d28 <JS::Value::setObject(JSObject&)+264>:	bne     0x7fffe3e35d8c <JS::Value::setObject(JSObject&)+364>
   0x00007fffe3e35d2c <JS::Value::setObject(JSObject&)+268>:	addi    r1,r31,48
   0x00007fffe3e35d30 <JS::Value::setObject(JSObject&)+272>:	ld      r0,16(r1)
   0x00007fffe3e35d34 <JS::Value::setObject(JSObject&)+276>:	mtlr    r0
   0x00007fffe3e35d38 <JS::Value::setObject(JSObject&)+280>:	ld      r30,-16(r1)
   0x00007fffe3e35d3c <JS::Value::setObject(JSObject&)+284>:	ld      r31,-8(r1)
   0x00007fffe3e35d40 <JS::Value::setObject(JSObject&)+288>:	blr
   0x00007fffe3e35d44 <JS::Value::setObject(JSObject&)+292>:	li      r9,0
   0x00007fffe3e35d48 <JS::Value::setObject(JSObject&)+296>:	b       0x7fffe3e35cfc <JS::Value::setObject(JSObject&)+220>
   0x00007fffe3e35d4c <JS::Value::setObject(JSObject&)+300>:	li      r5,448
End of assembler dump.
(gdb) i reg r3
r3             0xfffe2f9b43a32f9c  18446233467657334684
(gdb) i reg r30
r30            0x2f9b43a2d060      52343901180000

r3 is the return value from toObject() but it still looks like a Value, not an object pointer, and the comparison fails. It doesn't even look like the right value.

@tcampbell, I guess I'm not understanding what toPointer() does. Is that an actual object pointer? It's returning what looks like a Value complete with tag intact. gdb session:

r30            0x7f3b82d060        546459275360
(gdb) i reg r10
r10            0xfffe007f3b82d060  18446181670215405664
(gdb) cont
Continuing.

Thread 1 "firefox" hit Breakpoint 1, 0x00007fffe3e35a8c in JS::Value::toObject
    (this=0x7fffffffc5c0)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:728
728	  JSObject& toObject() const {
(gdb) i reg r3
r3             0x7fffffffc5c0      140737488340416
(gdb) x/16dx $r3
0x7fffffffc5c0:	0x3b82d060	0xfffe007f	0xf74b2900	0x73b631b9 <<< asBits_ in the first two words, OK.
0x7fffffffc5d0:	0x3b82e040	0x0000007f	0xffffc5e0	0x00007fff
0x7fffffffc5e0:	0xffffc670	0x00007fff	0x00000000	0x00000000
0x7fffffffc5f0:	0xebfc5d28	0x00007fff	0xffffc600	0x00007fff
(gdb) disp/i $pc
1: x/i $pc
=> 0x7fffe3e35a8c <JS::Value::toObject() const+8>:	std     r31,-8(r1)
(gdb) si
0x00007fffe3e35a90	728	  JSObject& toObject() const {
1: x/i $pc
=> 0x7fffe3e35a90 <JS::Value::toObject() const+12>:	stdu    r1,-48(r1)
(gdb) 
0x00007fffe3e35a94	728	  JSObject& toObject() const {
1: x/i $pc
=> 0x7fffe3e35a94 <JS::Value::toObject() const+16>:	mr      r31,r1
(gdb) 
JS::Value::isObject (this=0x7fffffffc5c0)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:729
729	    MOZ_ASSERT(isObject());
1: x/i $pc
=> 0x7fffe3e35a98 <JS::Value::toObject() const+20>:	ld      r9,0(r3)
(gdb) 
0x00007fffe3e35a9c	729	    MOZ_ASSERT(isObject());
1: x/i $pc
=> 0x7fffe3e35a9c <JS::Value::toObject() const+24>:	rldicl  r10,r9,17,47
(gdb) i reg r9
r9             0xfffe007f3b82d060  18446181670215405664
(gdb) si
0x00007fffe3e35aa0	729	    MOZ_ASSERT(isObject());
1: x/i $pc
=> 0x7fffe3e35aa0 <JS::Value::toObject() const+28>:	li      r8,-1
(gdb) 
0x00007fffe3e35aa4	729	    MOZ_ASSERT(isObject());
1: x/i $pc
=> 0x7fffe3e35aa4 <JS::Value::toObject() const+32>:	rlwinm  r8,r8,0,15,29
(gdb) 
0x00007fffe3e35aa8	729	    MOZ_ASSERT(isObject());
1: x/i $pc
=> 0x7fffe3e35aa8 <JS::Value::toObject() const+36>:	cmpld   r10,r8
(gdb) 
0x00007fffe3e35aac	729	    MOZ_ASSERT(isObject());
1: x/i $pc
=> 0x7fffe3e35aac <JS::Value::toObject() const+40>:	
    bgt     0x7fffe3e35af4 <JS::Value::toObject() const+112>
(gdb) 
JS::Value::toObject (this=0x7fffffffc5c0)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:616
616	    return asBits_ >= JSVAL_SHIFTED_TAG_OBJECT;
1: x/i $pc
=> 0x7fffe3e35ab0 <JS::Value::toObject() const+44>:	lis     r10,-3
(gdb) 
0x00007fffe3e35ab4	616	    return asBits_ >= JSVAL_SHIFTED_TAG_OBJECT;
1: x/i $pc
=> 0x7fffe3e35ab4 <JS::Value::toObject() const+48>:	ori     r10,r10,65535
(gdb) 
0x00007fffe3e35ab8	616	    return asBits_ >= JSVAL_SHIFTED_TAG_OBJECT;
1: x/i $pc
=> 0x7fffe3e35ab8 <JS::Value::toObject() const+52>:	rldicr  r10,r10,32,31
(gdb) 
0x00007fffe3e35abc	616	    return asBits_ >= JSVAL_SHIFTED_TAG_OBJECT;
1: x/i $pc
=> 0x7fffe3e35abc <JS::Value::toObject() const+56>:	oris    r10,r10,65535
(gdb) 
0x00007fffe3e35ac0	616	    return asBits_ >= JSVAL_SHIFTED_TAG_OBJECT;
1: x/i $pc
=> 0x7fffe3e35ac0 <JS::Value::toObject() const+60>:	ori     r10,r10,65535
(gdb) 
0x00007fffe3e35ac4	616	    return asBits_ >= JSVAL_SHIFTED_TAG_OBJECT;
1: x/i $pc
=> 0x7fffe3e35ac4 <JS::Value::toObject() const+64>:	cmpld   r9,r10
(gdb) 
0x00007fffe3e35ac8	616	    return asBits_ >= JSVAL_SHIFTED_TAG_OBJECT;
1: x/i $pc
=> 0x7fffe3e35ac8 <JS::Value::toObject() const+68>:	
    ble     0x7fffe3e35b3c <JS::Value::toObject() const+184>
(gdb) 
731	    MOZ_ASSERT((asBits_ & detail::ValueGCThingPayloadMask) != 0);
1: x/i $pc
=> 0x7fffe3e35acc <JS::Value::toObject() const+72>:	ld      r3,0(r3)
(gdb) si
0x00007fffe3e35ad0	731	    MOZ_ASSERT((asBits_ & detail::ValueGCThingPayloadMask) != 0);
1: x/i $pc
=> 0x7fffe3e35ad0 <JS::Value::toObject() const+76>:	clrldi. r9,r3,17
(gdb) i reg r3
r3             0xfffe007f3b82d060  18446181670215405664
(gdb) si
0x00007fffe3e35ad4	731	    MOZ_ASSERT((asBits_ & detail::ValueGCThingPayloadMask) != 0);
1: x/i $pc
=> 0x7fffe3e35ad4 <JS::Value::toObject() const+80>:	
    beq     0x7fffe3e35b84 <JS::Value::toObject() const+256>
(gdb) si
732	    MOZ_ASSERT((asBits_ & 0x7) == 0);
1: x/i $pc
=> 0x7fffe3e35ad8 <JS::Value::toObject() const+84>:	andi.   r9,r3,7
(gdb) 
0x00007fffe3e35adc	732	    MOZ_ASSERT((asBits_ & 0x7) == 0);
1: x/i $pc
=> 0x7fffe3e35adc <JS::Value::toObject() const+88>:	
    bne     0x7fffe3e35bcc <JS::Value::toObject() const+328>
(gdb) 
JS::Value::toPointer<JSObject> (tag=JSVAL_TAG_OBJECT, this=<optimized out>)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:734
734	    return *toPointer<JSObject>(JSVAL_TAG_OBJECT);
which is really return reinterpret_cast<T*>(uintptr_t(asBits_ ^ tag));
1: x/i $pc
=> 0x7fffe3e35ae0 <JS::Value::toObject() const+92>:	xoris   r3,r3,1
(gdb) i reg r3
r3             0xfffe007f3b82d060  18446181670215405664
(gdb) si
0x00007fffe3e35ae4 in JS::Value::toObject (this=<optimized out>)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:734
734	    return *toPointer<JSObject>(JSVAL_TAG_OBJECT);
1: x/i $pc
=> 0x7fffe3e35ae4 <JS::Value::toObject() const+96>:	xori    r3,r3,65532
(gdb) i reg r3
r3             0xfffe007f3b83d060  18446181670215471200
(gdb) si
0x00007fffe3e35ae8	734	    return *toPointer<JSObject>(JSVAL_TAG_OBJECT);
1: x/i $pc
=> 0x7fffe3e35ae8 <JS::Value::toObject() const+100>:	addi    r1,r31,48
(gdb) i reg r3
r3             0xfffe007f3b832f9c  18446181670215430044
(gdb) 

In return reinterpret_cast<T*>(uintptr_t(asBits_ ^ tag));, is that XOR supposed to mask off the tag? This is only doing it on the lower half.

I figured it out. toPointer should be

  template <typename T>
  T* toPointer(JSValueTag tag) const {
#if defined(JS_NUNBOX32)
    return reinterpret_cast<T*>(uintptr_t(asBits_));
#elif defined(JS_PUNBOX64)
    // Note: the 'Spectre mitigations' comment at the top of this class
    // explains why we use XOR here and in other to* methods.
    uint64_t shiftedTag = (uint64_t)tag << JSVAL_TAG_SHIFT;
    return reinterpret_cast<T*>(uintptr_t(asBits_ ^ shiftedTag));
#endif
  }

With this in, though, now I get our old friend assertion when typing in the URL bar: Assertion failure: isDouble(), at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:426

I'll work on this some more.

Attachment #9089784 - Attachment is obsolete: true
Attachment #9089784 - Flags: feedback?(tcampbell)
Attached patch WIP - working for -O3 and -Og (obsolete) — Splinter Review

This is mostly informational, but this snapshot (overlaid on tip) makes a working browser with both -O3 and -Og. In fact, I was able to even remove bug 1512162's workaround as well.

Besides the fix above, this also sets JS_NONCANONICAL_HARDWARE_NAN. The assertion didn't like the sign bit being set on the NaN generated by XPConnect, which was otherwise perfectly valid. This may have been the whole problem, so I'm going to make a test build with just that change and see if that fixes things too.

Yes, simply setting JS_NONCANONICAL_HARDWARE_NAN seems sufficient, at least right now (tested on both -Og and -O3). It just seems wacky such a "minor" thing would manifest in this manner. Since this would only affect ppc64le, this should be safe for beta, too. I will post a backout of the workaround and the "easy" fix.

I do like the simpler Value.h though, it gets rid of a lot of hard to follow stuff.

Nice work with investigations. I'll try and give a bit more context for this Value encoding stuff and look over this again on Monday.

Good catch on the tag-vs-shiftTag bug. I must have grabbed the wrong grabbed the wrong version of patch. Oops.

The JS::Value type is a uint64_t (on all platforms) that can encode double, int, JSObject*, JSString*, etc. On 64-bit platforms we take advantage of the fact that in practice for platforms we support that addresses have the high-bits set to zero. We can then use these bits to tag which type of value is actually stored. The toPointer method tries to clear these encoded tag bits to get back a usable pointer. Due to spectre mitigations (as well as some other nice failsafe benefits) we use XOR to toggle the expected bits instead of using AND masking.

The JS_NONCANONICAL_HARDWARE_NAN is needed on platforms that do not generate the same bit-level encoding of NaN as X86/ARM. (IEEE-754 allows any non-zero mantissa to mean NaN, but spidermonkey expects only a limited number). There is currently a startup check for the must obvious mismatches in [1]. SpiderMonkey should be able to handle NaN with sign bit set or cleared since that does come up on X86/ARM inside the JITs.

If the JS_NONCANONICAL_HARDWARE_NAN fix is sufficient, that would be great. I also see in [2] that another project was running into NaN issues running some webassembly related thing on POWER8.

[1] https://searchfox.org/mozilla-central/rev/4bec213c1a534329bd6fd97f00c0fcfbd60e26ec/js/src/vm/Initialization.cpp#75-84
[2] https://github.com/WebAssembly/wabt/pull/1046

Good to hear. I'll take the simpler Value.h into another bug and let it ride the trains. Enabling JS_NONCANONICAL_HARDWARE_NAN for just ppc64le seems like a reasonable thing to still get into beta.

Yes, those in [2] are the hex representations of the NaNs I saw in both bug 1512162 and this one: 0xfff80000 (and 0xfff8000000000000) and, originally, 0xffc00000 (and 0xffc0000000000000). However, I don't get a failure in CheckCanonicalNaN. Should we be asserting? Are those NaNs that would bug out?

In the meantime, the beta-suitable patch is coming up. If that sticks, I'll probably do a backout for bug 1512162 also after.

Correction: the NaN was fff9800000000000. That is greater than JSVAL_SHIFTED_TAG_MAX_DOUBLE, so that's why it asserts. Still, it seems like we should have failed CheckCanonicalNaN.

[Background] In IEEE-754 Floating point, the fraction for a NaN is 52-bits long. If this is all 0, then we have +/- Infinity and if non-zero we are a NaN. Many architectures use the most-significant bit of this fraction to indicate if this is a signalling or non-signally NaN (with polarity varying between architectures). For ARM/X86, the canonical NaN generated for operations such as Inf - Inf has a fraction that is 8_0000_0000_0000. The SpiderMonkey Value representation is designed around this being 'normal'. For platforms such as SPARC where the NaN is different and we don't have JIT support, the JS_NONCANONICAL_HARDWARE_NAN check is sufficient to normalize NaNs in the places that C++ would care.

The value FFF9_8000_0000_0000 is actually the representation of UndefinedValue and much more likely to be a compile-time bug than the hardware generating it. That is the value that the default Value() constexpr constructor.

This is reminding me a lot of Bug 1312488. We no longer have the same work around that we used there, but we could try reintroducing it by changing the implementation of 'setDouble' to have |*this = Value::fromDouble(d);|.

That makes more sense. However, with this patch,

   void setDouble(double d) {
-    asBits_ = bitsFromDouble(d);
-    MOZ_ASSERT(isDouble());
+    *this = Value::fromDouble(d);
+    if (!isDouble()) {
+      fprintf(stderr, "TRAPPED: %lx %f\n", asBits_, d);
+      __asm__("trap\n");
+    }
   }

(which I assume is what you mean), I see TRAPPED: fff9800000000000 -nan, so unless you meant somewhere else that doesn't seem to be what's gone wrong. Is there a way to trap an uninitialized Value(), or is that allowed/supported?

Some debugging.

[Switching to thread 1 (Thread 0x7ffff7ff4f50 (LWP 2118))]
#0  JS::Value::setDouble (d=<optimized out>, this=<optimized out>)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:460
460	      __asm__("trap\n");
(gdb) bt 5
#0  0x00007fffe55c8e2c in JS::Value::setDouble(double)
    (d=<optimized out>, this=<optimized out>)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:460
#1  0x00007fffe55c8e2c in JS::Value::setNumber(double)
    (this=<optimized out>, d=<optimized out>)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/Value.h:528
#2  0x00007fffe55bf288 in js::MutableWrappedPtrOperations<JS::Value, JS::MutableHandle<JS::Value> >::setNumber(double) (d=<optimized out>, this=0x7fffffff9fe0)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/RootingAPI.h:669
#3  0x00007fffe55bf288 in XPCConvert::NativeData2JS(JSContext*, JS::MutableHandle<JS::Value>, void const*, nsXPTType const&, nsID const*, unsigned int, nsresult*)
    (cx=<optimized out>, d=..., s=0x7fffffffa218, type=..., iid=<optimized out>, arrlen=<optimized out>, pErr=0x7fffffffa05c)
    at /home/spectre/src/mozilla-central/js/xpconnect/src/XPCConvert.cpp:115
#4  0x00007fffe562ae1c in CallMethodHelper::GatherAndConvertResults()
    (this=0x7fffffffa188)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/RootingAPI.h:1239
(More stack frames follow...)
(gdb) up 2
#2  0x00007fffe55bf288 in js::MutableWrappedPtrOperations<JS::Value, JS::MutableHandle<JS::Value> >::setNumber (d=<optimized out>, this=0x7fffffff9fe0)
    at /home/spectre/src/mozilla-central/obj-powerpc64le-unknown-linux-gnu/dist/include/js/RootingAPI.h:669
669	  DECLARE_NONPOINTER_MUTABLE_ACCESSOR_METHODS(*ptr);
(gdb) p d
$1 = <optimized out>
(gdb) up 1
#3  XPCConvert::NativeData2JS (cx=<optimized out>, d=..., s=0x7fffffffa218, 
    type=..., iid=<optimized out>, arrlen=<optimized out>, pErr=0x7fffffffa05c)
    at /home/spectre/src/mozilla-central/js/xpconnect/src/XPCConvert.cpp:115
115	      d.setNumber(*static_cast<const float*>(s));
(gdb) p d
$2 = {<js::MutableHandleBase<JS::Value, JS::MutableHandle<JS::Value> >> = {<js::MutableWrappedPtrOperations<JS::Value, JS::MutableHandle<JS::Value> >> = {<js::WrappedPtrOperations<JS::Value, JS::MutableHandle<JS::Value> >> = {<No data fields>}, <No data fields>}, <No data fields>}, ptr = 0x7fffffffa0f0}
(gdb) x/16lx s
0x7fffffffa218:	0xffcc0000	0x00000000	0x00000000	0x00000000
0x7fffffffa228:	0x01000048	0x00000000	0xffffa260	0x00007fff
0x7fffffffa238:	0xe3e655f0	0x00007fff	0xe3e65620	0x00007fff
0x7fffffffa248:	0xf3976900	0x00007fff	0xffffa280	0x00007fff

It looks like the float started as 0xffcc0000 (I assume that is indeed a valid NaN) but somehow ends up as undefined on the other end.

Some more debugging. I put a debugging fprintf into setNumber (as called by XPConnect). This is pretty spammy but I do see, in normal operation,

setNumber: 4090853333333333 1057.300000
setNumber: 7ff8000000000000 nan
setNumber: 4090c4cccccccccd 1073.200000
setNumber: 7ff0000000000000 inf
setNumber: 7ff0000000000000 inf

which suggests that those are generated correctly. But, at the point of the assertion,

setNumber: fff9800000000000 -nan
TRAPPED: fff9800000000000 -nan

So, I guess it still is some sort of miscompilation, since the value shouldn't change.

Going back to my debugging for bug 1512162, I get up to this point where the Interpreter calls, ultimately, nsPrefBranch::GetFloatPrefWithDefault(const char* aPrefName, float aDefaultValue, uint8_t aArgc, float* aRetVal), based on ["autoFill.stddevMultiplier", [0.0, "getFloatPref"]] in browser/components/urlbar/UrlbarPrefs.jsm.

(gdb) f 10
#10 0x00007fffebd95f24 in Interpret (cx=<optimized out>, 
    cx@entry=0x7fffd1d49000, state=...)
    at /home/spectre/src/mozilla-central/js/src/vm/Interpreter.cpp:3084
3084	          if (!CallFromStack(cx, args)) {
(gdb) p state
$16 = (js::RunState &) @0x7fffffffabf8: {kind_ = js::RunState::Invoke, 
  script_ = {<js::RootedBase<JSScript*, JS::Rooted<JSScript*> >> = {<js::MutableWrappedPtrOperations<JSScript*, JS::Rooted<JSScript*> >> = {<js::WrappedPtrOperations<JSScript*, JS::Rooted<JSScript*> >> = {<No data fields>}, <No data fields>}, <No data fields>}, stack = 0x7fffd1d49030, prev = 0x7fffffffb2e0, 
    ptr = 0x29260e29c4c0}}
(gdb) x/32lx state.script_.stack
0x7fffd1d49030:	0xffffa8f0	0x00007fff	0xffffa8d8	0x00007fff
0x7fffd1d49040:	0xffffa8a8	0x00007fff	0x00000000	0x00000000
0x7fffd1d49050:	0x00000000	0x00000000	0x00000000	0x00000000
0x7fffd1d49060:	0xffffa3f0	0x00007fff	0xffffa7e8	0x00007fff
0x7fffd1d49070:	0xffffa170	0x00007fff	0xffffad80	0x00007fff
0x7fffd1d49080:	0x00000000	0x00000000	0x00000000	0x00000000
0x7fffd1d49090:	0xd0bba800	0x00007fff	0xd1d4c000	0x00007fff
0x7fffd1d490a0:	0xffdfe541	0x00007fff	0xffe00d41	0x00007fff
(gdb) x/4lx 0x00007fffffffa8f0
0x7fffffffa8f0:	0xd1d49030	0x00007fff	0xffffa7b8	0x00007fff
(gdb) x/4lx 0x00007fffffffa8d8
0x7fffffffa8d8:	0xd1d49038	0x00007fff	0xffffaf10	0x00007fff
(gdb) p args
$17 = {<JS::detail::CallArgsBase<JS::detail::IncludeUsedRval>> = {
    argv_ = 0x7fffa972f550, argc_ = 2, constructing_ = false, 
    ignoresReturnValue_ = false, wantUsedRval_ = {
      usedRval_ = false}}, <No data fields>}
(gdb) x/32lx args.argv_
0x7fffa972f550:	0x57be3960	0xfffb0f7d	0x00000000	0xfff88000
0x7fffa972f560:	0x00000000	0xfff98000	0x00000000	0xfff90000
0x7fffa972f570:	0xcdcdcdcd	0xcdcdcdcd	0xcdcdcdcd	0xcdcdcdcd
0x7fffa972f580:	0xcdcdcdcd	0xcdcdcdcd	0xcdcdcdcd	0xcdcdcdcd
0x7fffa972f590:	0xcdcdcdcd	0xcdcdcdcd	0xcdcdcdcd	0xcdcdcdcd
0x7fffa972f5a0:	0xcdcdcdcd	0xcdcdcdcd	0xcdcdcdcd	0xcdcdcdcd
0x7fffa972f5b0:	0xcdcdcdcd	0xcdcdcdcd	0xcdcdcdcd	0xcdcdcdcd
0x7fffa972f5c0:	0xcdcdcdcd	0xcdcdcdcd	0xcdcdcdcd	0xcdcdcdcd

This is already wrong because the default value should be 0.0, not a NaN. But what should the call args look like? I see 0xfff8800000000000 which I assume is probably corresponding to aArgc. The 0xfff9800000000000 value is already present, so I'm assuming the call args are on the stack in LIFO order. By the time XPConnect is invoked, the wrong value is already being passed to it.

The default value for a js::Value that hasn't been filled in yet is 0xfff9800000000000 / UndefinedValue so it isn't surprising to see it in a few places in CallArgs which includes things like the 'this' pointer. That said, this may have been miscompiled earlier and we are just pushing it around.

I've managed to kick off a build in QEMU and if it finishes before tomorrow, I should be able to poke around in GDB myself and save some round-trips. Using gcc-9 on my normal x86_64 build didn't seem to turn up anything obvious.

One possible thing to try if you can is to look at [1] and replace that with a mozilla::BitwiseCast<uint64_t> or just a bare memcpy. This code is definitely in violation of C++ aliasing rules, and is unique to ppc64 targets. It was also last looked at more than a decade ago so it is entirely possible as modern GCC got more clever that this slipped though.

[1] https://searchfox.org/mozilla-central/source/xpcom/reflect/xptcall/md/unix/xptcinvoke_ppc64_linux.cpp#70

@tcampbell, if access to a real ppc hw would help, let me know.

I don't think that would cause this because it would only be triggered if the number of float arguments overflowed (there's only one here and tons of volatile float registers available), but just to be sure, I tried

         if (!s->IsIndirect() && s->type == nsXPTType::T_DOUBLE) {
             if (i < FPR_COUNT)
                 fpregs[i]    = s->val.d;
             else
-                *(double *)d = s->val.d;
+               memcpy(reinterpret_cast<char*>(d), &(s->val.d), sizeof(s->val.d));

and got the same result. I agree it should be corrected and I will do this in a followup but I don't think that's the bug here.

Good point about number of floats. I think next step is I'll take Dan up on his offer (-> email) and poke around later this week.

@tcampbell, you should have an email with details, if not, please check spam folder.

Priority: -- → P1

Using the server that Dan provided access to, I am able to reproduce the problem.

This is not a miscompilation and is actually a decade on XPCOM bug. The gist of the problem is that when passing floating point arguments we should use the first-available floating pointer register and not the i-th floating point register.

I'll put a patch up this week. I was able to revert the patches here and on Bug 1512162 and successfully type into URL and load reddit.

Wow! Thanks for doing that!

That might be ABI-specific, though. On 32-bit and I think 64BE, it should still be i-th, so maybe this should be specific to 64LE. I should have caught that while looking at the code, though.

Ted, that's an awesome news.

Cameron, you seem to have more familiarity with this stuff so maybe you can work out the patches here? I realize I'm not familiar with the different ABI changes here.

Files in mozilla-central/xpcom/reflect/xptcall/md need to be tweaked. The X64 platform is an example to use: https://searchfox.org/mozilla-central/source/xpcom/reflect/xptcall/md/unix/xptcinvoke_x86_64_unix.cpp

Flags: needinfo?(spectre)

Yes, I can do that. What was the change you made, though, just so we're looking at the same block of code?

Flags: needinfo?(spectre)
Attached patch WIP - Fix XPCOM glue for ppc64 (obsolete) — Splinter Review

This fixes the issue for me by tweaking the XPCOM glue code. It should probably better follow the x86_64 pattern for simplicity. One area I was very unclear about was the 'if (i > = 7)' line.

The other area of uncertainty for me is what the ABI is for ppc, and ppc64be. The original ppc glue has similar looking code.

The plain _ppc_ ones are 32-bit. We should leave those alone, because I know they're correct (I use them on TenFourFox). It looks like your change didn't touch those, which is fine.

Unfortunately, the _ppc64_ ones are used for both BE and LE. I need to look at the ABI spec again to make sure I get this right.

It looks like from comparing v1 and v2 ABIs that register selection is the same, so the change should work for both (most BE systems are v1). I'm writing up the patch right now and will test it. What's amazing is that this issue escaped notice for so long.

See also: https://searchfox.org/mozilla-central/source/xpcom/reflect/xptcall/md/unix/xptcstubs_ppc64_linux.cpp#135

My patch isn't quite right. It appears by ABI that a float arg advances gp index, but not vice-versa. Worth looking at what GCC does for some examples. The failing code that started this crash was the signature of [1].

[1] https://searchfox.org/mozilla-central/rev/7ed8e2d3d1d7a1464ba42763a33fd2e60efcaedc/modules/libpref/Preferences.cpp#2167

Yep, I've already accounted for that. I'm trying to work around the profiler compile issues, which is slowing down my rebuild.

Attachment #9091280 - Attachment is obsolete: true
Attached file xptcinvoke_ppc64_linux.cpp (obsolete) —

I think it works. With all the hackarounds backed out and the profiler compile fixes in, I no longer get assertions in debug, and the build works fine in opt. Before I unwind all the fixes in the tree, is this what you meant by "following the x86_64 pattern"? I tried to harmonize them as much as possible without requiring additional changes in the assembly stubs. If this looks fine, I'll turn it into a Phabricator patch.

Attachment #9088619 - Attachment is obsolete: true
Attachment #9090267 - Attachment is obsolete: true
Attachment #9091275 - Attachment is obsolete: true
Attachment #9093125 - Attachment is obsolete: true
Attachment #9093193 - Flags: feedback?(tcampbell)
Comment on attachment 9093193 [details] xptcinvoke_ppc64_linux.cpp Excellent! Nit: Add a space or two to indent the URLs to it is a little easier to read.
Attachment #9093193 - Flags: feedback?(tcampbell) → feedback+

I wonder if we should use the opportunity to normalize the xptcstubs_ppc64_linux.cpp to better match the x86_64 version as well. The PrepareAndDispatch can be almost identical except for [1] where we would also need to increment nr_gpr.

[1] https://searchfox.org/mozilla-central/rev/7ed8e2d3d1d7a1464ba42763a33fd2e60efcaedc/xpcom/reflect/xptcall/md/unix/xptcstubs_x86_64_linux.cpp#80

Oh, wow, that's wrong too. I'll work on that today.

Sorry, I linked to the x86_64 as an example. The ppc64 code actually looks correct thanks to Bug 1498938. My proposal is simply to make it similar to X86_64 so it is more likely to be maintained correctly.

Oh, I see it incrementing the GPR count at the end. Yes, that's certainly less clear.

I'm reworking the original one. There's a bug in it that's causing weird font sizes that goes away when I revert it.

Attachment #9093193 - Attachment is obsolete: true
Attached file xptcinvoke_ppc64_linux.cpp (obsolete) —

Fixed. I had a typo in the float section that I missed.

Attachment #9093511 - Flags: feedback?(tcampbell)
Attached file xptcstubs_ppc64_linux.cpp (obsolete) —

And here is the stubs one. Does this look good to you as well? I also made a tiny change to the assembly (not here, but I will put in the final patch) to hoist the mtlr a little earlier, which since this gets called a lot, should help smooth things a bit.

Assuming so, do you want all the changes together, including the backouts for bug 1512162 and the temporary fix here, or should I make those separate Phabricator submissions?

Attachment #9093512 - Flags: feedback?(tcampbell)

(I'll fix the spacing on the URLs, I just noticed I forgot to do that in the original file.)

Comment on attachment 9093511 [details] xptcinvoke_ppc64_linux.cpp Excellent! Nit: Remove the deprecated 'invoke_count_words' code. We seem to inline it in xptcinvoke_asm_ppc64_linux.S these days. Nit: nr_gpr++; // ABI requires skipping gprs even for floating-point arguments.
Attachment #9093511 - Flags: feedback?(tcampbell) → feedback+
Attachment #9093512 - Flags: feedback?(tcampbell) → feedback+

My advice for final phabricator patches is to make Part 1 to fix up the xptc* files and fix the bug, and then add part 2 to revert the work-arounds. It will make history easier to follow in 5 years :)

(We should also add froydnj as a blocking reviewer on phabricator)

I forgot to put the comment in, but I think it's addressed by the example, isn't it?

For the record, no crashes observed with the last 2 patches from Cameron applied on trunk.

@msirringhaus, you're probably in the best position to verify on big-endian ppc64 based on your comment in bug 1511604 (I'd ask @awilfox but that account appears to be disabled). Please note the above patches for reference and advise right away if a follow up patch is required for BE. I don't have any BE ppc64 systems running Linux or *BSD, but I have preserved the semantics, so it should "just work."

We (SUSE) actually only do ppc64le builds. Our big endian platform is s390x, which is unaffected by this changeset, AFAIK.
In openSUSE I can at least build for ppc64, but I don't have a test-machine available either.
I'll ask around to see if I can find something.

I did try out your patches for ppc64le on 68esr though, which seems to works fine.

OK, thanks. Let me know if you find a ppc64 build you can play with. Correct, s390x is unaffected by this specific problem. Thanks for also testing on 68ESR.

In the meantime, I'll ask for check-in and then we can start moving it up to beta and ESR once it sticks on central.

Keywords: checkin-needed

Pushed by tcampbell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4887ef7af824
Fix and tune xptcall for ppc64le, and harmonize with x86_64's. r=tcampbell,froydnj
https://hg.mozilla.org/integration/autoland/rev/2d6979f57d6a
part 2 - remove now unnecessary workarounds for ppc64le in js/src/gc and js/xpconnect/src. r=tcampbell

Keywords: checkin-needed
Attachment #9093511 - Attachment is obsolete: true
Attachment #9093512 - Attachment is obsolete: true

Comment on attachment 9093808 [details]
Bug 1576303 - Fix and tune xptcall for ppc64le, and harmonize with x86_64's. r?tcampbell,froydnj

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: This PPC64-only patch fixes the a bug that causes browser to never start.
  • User impact if declined: PPC64(LE) builds on Fedora/SUSE/?? will not be usable.
  • Fix Landed on Version: 71
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Changes are only in ppc64 specific files. These are not-part-of-the-build for any other platform.
  • String or UUID changes made by this patch:
Attachment #9093808 - Flags: approval-mozilla-esr68?

Comment on attachment 9093809 [details]
Bug 1576303 part 2 - remove now unnecessary workarounds for ppc64le in js/src/gc and js/xpconnect/src. r?tcampbell,froydnj

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: In Bug 1512162 and Bug 1576303, we uplifted workarounds to ESR68 for PPC64 builds. This patch reverts both of them in favour of "Bug 1576303 - Fix and tune xptcall for ppc64le, and harmonize with x86_64's" which is being requested to uplift. These changes where only #ifdefs that apply to PPC64 configuration. It is easy to remove them so we should remove these workarounds from ESR68 and remain aligned with m-c.
  • User impact if declined: Unnecessary work-around remain in ESR68 that don't exist anywhere else. This patch isn't strictly needed to fix any bug though.
  • Fix Landed on Version: 71
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): #ifdef for tier-3 platform only. We are reverting other uplifts to esr68 in favour of the original code.
  • String or UUID changes made by this patch:
Attachment #9093809 - Flags: approval-mozilla-esr68?

I'm going to ask for this on beta too, since the Barrier.h workaround is there.

Comment on attachment 9093808 [details]
Bug 1576303 - Fix and tune xptcall for ppc64le, and harmonize with x86_64's. r?tcampbell,froydnj

Beta/Release Uplift Approval Request

  • User impact if declined: Poorer performance and possibly crashes on ppc64le depending on optimization level. The issue also potentially affects BE ppc64 though we haven't seen any clear reports yet.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Strictly, unbelievably, completely provably NPOTB: the xpconnect pieces are strictly for ppc64, and the other parts just back out ppc64le specific workarounds that were never part of any Tier-1 build.
  • String changes made/needed: none
Attachment #9093808 - Flags: approval-mozilla-beta?
Attachment #9093809 - Flags: approval-mozilla-beta?

Comment on attachment 9093808 [details]
Bug 1576303 - Fix and tune xptcall for ppc64le, and harmonize with x86_64's. r?tcampbell,froydnj

Removes a workaround, NPOTB. Let's uplift for beta 9.

Attachment #9093808 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9093809 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Can we close this bug now?

Flags: needinfo?(spectre)

Comment on attachment 9093808 [details]
Bug 1576303 - Fix and tune xptcall for ppc64le, and harmonize with x86_64's. r?tcampbell,froydnj

Approved for 68.2esr also.

Attachment #9093808 - Flags: approval-mozilla-esr68? → approval-mozilla-esr68+
Attachment #9093809 - Flags: approval-mozilla-esr68? → approval-mozilla-esr68+
Target Milestone: --- → mozilla71

I tried the mozilla-esr68 branch on a PPC64LE machine using the STR from this bug and the browser no longer crashes. Closing as fixed.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(spectre)
Resolution: --- → FIXED

Yes, sorry, I forgot to remove the keyword.

Depends on: 1690152
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: