Closed Bug 1339190 Opened 7 years ago Closed 7 years ago

32 bit shell build startup crash

Categories

(Core :: JavaScript Engine: JIT, defect, P1)

x86
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox54 --- fixed
firefox55 --- fixed

People

(Reporter: decoder, Assigned: sstangl)

References

Details

Attachments

(1 file, 2 obsolete files)

Right now I'm seeing a startup crash on our 32-bit shell builds:

Thread 1 "js" received signal SIGSEGV, Segmentation fault.
0x00000000 in ?? ()
(gdb) bt
#0  0x00000000 in ?? ()
#1  0x08a6f471 in loadiJIT_Funcs () at js/src/vtune/jitprofiling.c:263
#2  0x08a6f58d in loadiJIT_Funcs () at js/src/vtune/jitprofiling.c:170
#3  iJIT_IsProfilingActive () at js/src/vtune/jitprofiling.c:166
#4  0x08a89d5c in js::vtune::IsProfilingActive () at js/src/vtune/VTuneWrapper.h:29
#5  js::vtune::MarkStub (code=0xed45e038, name=0x8ab2e38 "ProfilerExitFrameStub") at js/src/vtune/VTuneWrapper.cpp:34
#6  0x0854fadf in js::jit::JitRuntime::generateProfilerExitFrameTailStub (this=0xed037800, cx=0xf7945800) at js/src/jit/x86/Trampoline-x86.cpp:1365
#7  0x08300b50 in js::jit::JitRuntime::initialize (this=0xed037800, cx=0xf7945800, lock=...) at js/src/jit/Ion.cpp:230
#8  0x08572d63 in JSRuntime::createJitRuntime (this=0xf7941000, cx=0xf7945800) at js/src/jscompartment.cpp:192
#9  0x089e493c in JSRuntime::getJitRuntime (cx=0xf7945800, this=<optimized out>) at js/src/vm/Runtime.h:640
#10 JS::Zone::createJitZone (this=0xf7947000, cx=0xf7945800) at js/src/gc/Zone.cpp:306
#11 0x0857303c in JS::Zone::getJitZone (cx=0xf7945800, this=0xf7947000) at js/src/gc/Zone.h:277
#12 JSCompartment::ensureJitCompartmentExists (this=0xf7948000, cx=0xf7945800) at js/src/jscompartment.cpp:209
#13 0x08215414 in CanEnterBaselineJIT (cx=cx@entry=0xf7945800, script=..., script@entry=..., osrFrame=osrFrame@entry=0x0) at js/src/jit/BaselineJIT.cpp:335
#14 0x082155e4 in js::jit::CanEnterBaselineMethod (cx=0xf7945800, state=...) at js/src/jit/BaselineJIT.cpp:407
#15 0x0816732c in js::RunScript (cx=0xf7945800, state=...) at js/src/vm/Interpreter.cpp:392
#16 0x081697a4 in js::ExecuteKernel (cx=0xf7945800, script=..., envChainArg=..., newTargetValue=..., evalInFrame=..., result=0xffffd018) at js/src/vm/Interpreter.cpp:687
#17 0x08169b89 in js::Execute (cx=0xf7945800, script=..., envChainArg=..., rval=0xffffd018) at js/src/vm/Interpreter.cpp:720
#18 0x08542aab in Evaluate (cx=cx@entry=0xf7945800, scopeKind=scopeKind@entry=js::ScopeKind::Global, env=..., optionsArg=..., srcBuf=..., rval=...) at js/src/jsapi.cpp:4551
#19 0x085430ef in JS::Evaluate (cx=0xf7945800, options=..., 
    bytes=0xf7985000 "var std_Symbol = Symbol;\nvar std_WeakMap = WeakMap;\nvar std_StopIteration = StopIteration;\nfunction List() {\n    this.length = 0;\n}\nMakeConstructible(List, {__proto__: null});\nfunction Record() {\n    "..., length=299190, rval=...)
    at js/src/jsapi.cpp:4603
#20 0x087d5879 in JSRuntime::initSelfHosting (this=0xf7941000, cx=0xf7945800) at js/src/vm/SelfHosting.cpp:2922
#21 0x0852a4ff in JS::InitSelfHostedCode (cx=0xf7945800) at js/src/jsapi.cpp:627
#22 0x08076240 in main (argc=1, argv=0xffffd564, envp=0xffffd56c) at js/src/shell/js.cpp:8131


Compiler is gcc (Ubuntu 5.3.1-14ubuntu2) 5.3.1 20160413.

Build flags are --enable-posix-nspr-emulation --enable-valgrind --enable-gczeal --disable-tests --enable-stdcxx-compat --enable-debug --without-intl-api --enable-optimize --target=i686-pc-linux-gnu
Flags: needinfo?(jdemooij)
Likely from the vtune integration, bug 1332466.
Flags: needinfo?(jdemooij) → needinfo?(sstangl)
I can't reproduce locally, with/without VTune.

The crashing line is a call to dlopen(). I don't see what could be calling nullptr. Is there any chance that you could poke around in GDB and see what exactly is null here?
I can reproduce, with the following configure line, on ubuntu 16.04 LTS:

CC="gcc -m32 -march=pentium-m -msse -msse2 -mfpmath=sse" CXX="g++ -m32 -march=pentium-m -msse -msse2 -mfpmath=sse" AR="ar" /code/mozilla-inbound/js/src/configure  --enable-ccache --enable-debug --disable-optimize --without-intl-api --target=i686-pc-linux --host=i686-pc-linux

Does this help?

I've tried stepping in gdb, using the stepi from the dlopen call, but then a lot of unanotated assembly shows up, and there's no real way to find what the issue is (gone through ~100 instructions). At some point PC is set to 0, that's all I can say. Is there some library dependency for vtune support?

Another thing I could do, if that would help, is to set up a virtual machine on the cloudz and get your SSH public key so you can log in and try all the things you want there, by yourself. Would that help?
It reproduces on Gary's machine. The crash is within dlopen(), when you ask it to dlopen() a library that does not exist.

It only reproduces on Ubuntu. A short program that just calls dlopen() with a library that isn't found doesn't crash, so something must be different about the context in which it's being called here.

dlopen() is spec'd to return NULL in this case.

Single-stepping through the code (which is different only due to processor variation between the two machines), the Ubuntu flavor fails at the end of _dl_runtime_resolve_avx+344, where it calls |bnd jmpq *%r11|. The Fedora flavor has an %r11 of "0x7ffff79b5f70", corresponding correctly to dlopen@@GLIBC_2.2.5, so it can return, while Ubuntu's %r11 is NULL.

I'll look into what's wrong with the Ubuntu version, but it looks like this is just a problem with Ubuntu's version of dlopen(). So Ubuntu users will probably need some workaround, because this can't be fixed on our end.
Fedora's working glibc is 2.24-4.fc25 (2016-12-23).
Ubuntu's broken glibc is 2.23-0ubuntu3 (2016-02-19).
Investigating further, it seems that _dl_runtime_resolve_avx calls _dl_fixup ~50 times, which eventually returns an incorrect address that causes re-entry into _dl_runtime_resolve_avx, at which point _dl_fixup returns NULL but that address is used as a jump target.

Gary just confirmed that Nightly doesn't crash, even on Ubuntu, which includes the same code. So for better or worse, this only manifests in shell builds.
Given that crashes appear to only manifest in the Ubuntu shell version and not in the browser, this patch changes --enable-profiling to no longer imply --enable-vtune. Then --enable-vtune is added to the mozconfigs of the platforms on which it was included by detection.

For local builds, this disables VTune integration unless --enable-vtune is passed.

For nightlies, there should be no change.
Flags: needinfo?(sstangl)
Attachment #8843411 - Flags: review?(mh+mozilla)
Comment on attachment 8843411 [details] [diff] [review]
0001-Bug-1339190-Make-enable-profiling-not-imply-enable-v.patch

Review of attachment 8843411 [details] [diff] [review]:
-----------------------------------------------------------------

We should have differences between local builds and automation builds, not more.

With that being said, I reproduced the bug locally, and it happens because dlopen *itself* is NULL, which in turn, is because dlopen and friends are defined as weak, and when linking the js binary with -ldl on Ubuntu, where -Wl,--as-needed is default, results in libdl not being loaded.

Why did you need all these weak attributes in ittnotify_config.h? that seems very wrong.
Attachment #8843411 - Flags: review?(mh+mozilla) → review-
(In reply to Mike Hommey [:glandium] from comment #9)
> We should have differences between local builds and automation builds, not
> more.

Missing word: We should have *less* (...)
(In reply to Mike Hommey [:glandium] from comment #9)
> With that being said, I reproduced the bug locally, and it happens because
> dlopen *itself* is NULL, which in turn, is because dlopen and friends are
> defined as weak, and when linking the js binary with -ldl on Ubuntu, where
> -Wl,--as-needed is default, results in libdl not being loaded.

That wasn't the case on Gary's machine. The call to dlopen() occurred, and could be stepped through, but when it returned to the PLT it attempted to re-enter dlopen() and crashed. But that's helpful, thanks.

> Why did you need all these weak attributes in ittnotify_config.h? that seems
> very wrong.

That file comes from Intel.
Gary confirmed that removing the "weak" attribute fixes the crash on his machine. (Although he wasn't able to reproduce the crash with current mozilla-central, and had to verify on an older revision.)

Thanks for your help, Mike!
Attachment #8843411 - Attachment is obsolete: true
Attachment #8845209 - Flags: review?(bbouvier)
If the header comes from intel, you risk this change being reverted when it's updated next...
Historically, I'm the one that pulls the changes.

I can add a note into js/src/vtune/README, which contains instructions for updating and doesn't get clobbered. Probably a good idea.
Comment on attachment 8845209 [details] [diff] [review]
0001-Remove-__attribute__-weak-from-VTune-headers.patch

Review of attachment 8845209 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, it fixes the issue locally (that had expand to 64 bits builds too on my machine).

However, this introduces a gazillion warnings at compile-time, like the following:

/code/mozilla-inbound/js/src/vtune/ittnotify_static.c:1123:29: warning: the address of ‘dlsym’ will always evaluate as ‘true’ [-Waddress]
                 if (DL_SYMBOLS && (groups != __itt_group_none || lib_name != NULL))
                             ^
/code/mozilla-inbound/js/src/vtune/ittnotify_static.c:1123:38: warning: the address of ‘dlclose’ will always evaluate as ‘true’ [-Waddress]
                 if (DL_SYMBOLS && (groups != __itt_group_none || lib_name != NULL))
                                      ^
/code/mozilla-inbound/js/src/vtune/ittnotify_static.c:1210:14: warning: the address of ‘pthread_mutex_init’ will always evaluate as ‘true’ [-Waddress]
         if (PTHREAD_SYMBOLS) __itt_mutex_unlock(&_N_(_ittapi_global).mutex);
              ^
/code/mozilla-inbound/js/src/vtune/ittnotify_static.c:1210:33: warning: the address of ‘pthread_mutex_lock’ will always evaluate as ‘true’ [-Waddress]
         if (PTHREAD_SYMBOLS) __itt_mutex_unlock(&_N_(_ittapi_global).mutex);
                                 ^
/code/mozilla-inbound/js/src/vtune/ittnotify_static.c:1210:55: warning: the address of ‘pthread_mutex_unlock’ will always evaluate as ‘true’ [-Waddress]
         if (PTHREAD_SYMBOLS) __itt_mutex_unlock(&_N_(_ittapi_global).mutex);
                                                       ^
/code/mozilla-inbound/js/src/vtune/ittnotify_static.c:1210:79: warning: the address of ‘pthread_mutex_destroy’ will always evaluate as ‘true’ [-Waddress]
/code/mozilla-inbound/js/src/vtune/ittnotify_static.c:1210:104: warning: the address of ‘pthread_mutexattr_init’ will always evaluate as ‘true’ [-Waddress]
/code/mozilla-inbound/js/src/vtune/ittnotify_static.c:1210:130: warning: the address of ‘pthread_mutexattr_settype’ will always evaluate as ‘true’ [-Waddress]
/code/mozilla-inbound/js/src/vtune/ittnotify_static.c:1210:159: warning: the address of ‘pthread_mutexattr_destroy’ will always evaluate as ‘true’ [-Waddress]
/code/mozilla-inbound/js/src/vtune/ittnotify_static.c:1210:188: warning: the address of ‘pthread_self’ will always evaluate as ‘true’ [-Waddress]

I'll defer to your judgement whether it's okay to have them or not (since it's 3rd party code), but this might cause breakages on treeherder. Thanks for the patch though, and adding a note to the README describing how to update and not clobber this change sounds like a good idea too.
Attachment #8845209 - Flags: review?(bbouvier) → review+
DL_SYMBOLS and PTHREAD_SYMBOLS test that all the weak symbols are defined. Since they are not weak anymore, they are always defined, and the compiler is complaining about the fact that the conditions are always true as a consequence. I think changing the #defines to expand to 1 would remove the warnings.
(In reply to Mike Hommey [:glandium] from comment #16)
> I think changing the #defines to expand to 1 would remove the warnings.

That patch is slowly running through Tryserver as we speak.
Tree's closed, so uploading here to use checkin-needed. Carrying r+ from Comment 15.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c0da6b4da0b551e136edd28814c3f5a42db38a51

Will need to be backported to aurora (no change).
Attachment #8845209 - Attachment is obsolete: true
Assignee: nobody → sstangl
Keywords: checkin-needed
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5a03382283ae
Fix Ubuntu shell startup crash. r=bbouvier
Keywords: checkin-needed
Priority: -- → P1
Blocks: 1332466
Comment on attachment 8845677 [details] [diff] [review]
0001-Bug-1339190-Fix-Ubuntu-shell-startup-crash.-r-bbouvi.patch

Approval Request Comment
[Feature/Bug causing the regression]: 1332466
[User impact if declined]: Ubuntu JS shell startup crashes. Never reproduced in the browser, but theoretically the browser can be affected also depending on properties of the executable. It would manifest as a startup crash.
[Is this code covered by automated tests?]: Yes.
[Has the fix been verified in Nightly?]: Yes.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No.
[Why is the change risky/not risky?]: It just removes a bogus function attribute tag.
[String changes made/needed]: None.
Attachment #8845677 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/5a03382283ae
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Comment on attachment 8845677 [details] [diff] [review]
0001-Bug-1339190-Fix-Ubuntu-shell-startup-crash.-r-bbouvi.patch

fix a linux shell startup crash in aurora54
Attachment #8845677 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: