Closed
Bug 1399746
Opened 7 years ago
Closed 6 years ago
On Mac OSX, NSPR _PR_Fini gets called too early, other cleanup code calls NSPR functions afterwards, which fails | Assertion failure: 0 == rv, at mozilla/nsprpub/pr/src/pthreads/ptthread.c:292
Categories
(NSPR :: NSPR, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
4.19
People
(Reporter: jorgk-bmo, Assigned: glandium)
References
Details
(Whiteboard: [Thunderbird-testfailure: XZ Mac only])
Attachments
(2 files, 5 obsolete files)
3.14 KB,
patch
|
Details | Diff | Splinter Review | |
1.16 KB,
patch
|
KaiE
:
review+
|
Details | Diff | Splinter Review |
Xpcshell: TEST-UNEXPECTED-FAIL | dom/push/test/xpcshell/test_permissions.js | xpcshell return code: -11 PID 6447 | Assertion failure: 0 == rv, at /builds/slave/tb-c-cen-m64-d-000000000000000/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:292 and many more. Mozmill: Assertion failure: 0 == rv, at /builds/slave/tb-c-cen-m64-d-000000000000000/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:292 [log…] PROCESS-CRASH | content-policy | application crashed [@ mozalloc_abort(char const*)]
Reporter | ||
Comment 1•7 years ago
|
||
Those crashed started here Wed Sep 13, 2017, 15:16:44 but only on Mac. Wed Sep 13, 2017, 20:08:24 only on Mac, Wed Sep 13, 2017, 23:45:36 on Mac and Windows. Strange. Going by the first failure: M-C last good: 9873269746359dbcccbd5b44f0e82c0225 M-C first bad: 1888ec2f277f6bb26271b8808e08914a21 https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=9873269746359dbcccbd5b44f0e82c0225&tochange=1888ec2f277f6bb26271b8808e08914a21 Looks like I have to reproduce it locally to figure out what's going on.
Summary: Various Xpshell and Mozmill failures in debug mode on 2017-09-14: PID 6709 | Assertion failure: 0 == rv, at mozilla/nsprpub/pr/src/pthreads/ptthread.c:292 → Various Xpshell and Mozmill failures in debug mode on 2017-09-13: PID 6709 | Assertion failure: 0 == rv, at mozilla/nsprpub/pr/src/pthreads/ptthread.c:292
Reporter | ||
Comment 2•7 years ago
|
||
The failures are Mac only, nothing I can reproduce locally here. The Windows debug X is green, if it's not, it's cause by something else, the Windows debug Z is green or orange due to bug 1377692 or something else. The Windows debug Z orange comes and goes.
Whiteboard: [Thunderbird-testfailure: XZ Mac only]
Comment 3•7 years ago
|
||
FWIW I suspect from https://hg.mozilla.org/mozilla-central/rev/a104181ac18b. That adds a PR_GetCurrentThread call that can happen very early at startup. Not sure why should pthreads fail though.
Comment 4•7 years ago
|
||
Failing pt_AttachThread() is called from PR_GetCurrentThread(). A call to PR_GetCurrentThread() was added in Bug 1393710 during the regression range.
Reporter | ||
Comment 5•7 years ago
|
||
James and Nathan, do you feel responsible ;-)
Flags: needinfo?(nfroyd)
Flags: needinfo?(jacheng)
Comment 6•7 years ago
|
||
(In reply to Stefan Sitter from comment #4) > Failing pt_AttachThread() is called from PR_GetCurrentThread(). A call to > PR_GetCurrentThread() was added in Bug 1393710 during the regression range. No. :) Looks like the only way pt_AttachThread() can fail is if NSPR isn't initialized. Perhaps adding a PR_Initialize call somewhere in the Thunderbird startup sequence would be appropriate. Would also be good to try and figure out how Thunderbird startup is so different from Firefox in this respect.
Flags: needinfo?(nfroyd)
Flags: needinfo?(jacheng)
Reporter | ||
Comment 7•7 years ago
|
||
Our mainline at https://dxr.mozilla.org/comm-central/source/mail/app/nsMailApp.cpp it almost identical to https://dxr.mozilla.org/mozilla-central/source/browser/app/nsBrowserApp.cpp So were in the startup sequence is the call to PR_Initialize in FF? pt_AttachThread() asserts after calling |pthread_setspecific(pt_book.key, thred);| and there is a |if (!_pr_initialized) return NULL;| before, so if not initialised, it would have returned early. Can you please clarify your analysis. Also, why only problems on Mac?
Flags: needinfo?(nfroyd)
Comment 8•7 years ago
|
||
(In reply to Jorg K (GMT+2) from comment #7) > Our mainline at > https://dxr.mozilla.org/comm-central/source/mail/app/nsMailApp.cpp > it almost identical to > https://dxr.mozilla.org/mozilla-central/source/browser/app/nsBrowserApp.cpp > > So were in the startup sequence is the call to PR_Initialize in FF? I don't think we ever call it. > pt_AttachThread() asserts after calling |pthread_setspecific(pt_book.key, > thred);| and there is a |if (!_pr_initialized) return NULL;| before, so if > not initialised, it would have returned early. > > Can you please clarify your analysis. Also, why only problems on Mac? Oh, I see, we're asserting inside pt_AttachThread(). I don't know why _PR_InitThreads wouldn't have gotten called, then...unless somebody is racing the _PR_InitThreads call, which seems like a possibility. I don't know why it's only a problem on Mac.
Flags: needinfo?(nfroyd)
Reporter | ||
Comment 9•7 years ago
|
||
I'm not sure what makes you think a missing call to _PR_InitThreads is to blame here. I can't see where any of this stuff is initialised and I also can't see where pthread_setspecific() is defined and why it would fail. I don't even know what NSPR is.
Comment 10•7 years ago
|
||
(In reply to Jorg K (GMT+2) from comment #9) > I'm not sure what makes you think a missing call to _PR_InitThreads is to > blame here. I can't see where any of this stuff is initialised and I also > can't see where pthread_setspecific() is defined and why it would fail. I > don't even know what NSPR is. OK, so, this is the line that is failing: rv = pthread_setspecific(pt_book.key, thred); PR_ASSERT(0 == rv); So the pthread_setspecific call is failing. When does pthread_setspecific return errors? Consulting: http://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_setspecific.html we see: The pthread_setspecific() function will fail if: [ENOMEM] Insufficient memory exists to associate the value with the key. The pthread_setspecific() function may fail if: [EINVAL] The key value is invalid. No errors are returned from pthread_getspecific(). My guess is that we have sufficient memory, so we're probably hitting the EINVAL case. The Linux man page is a little more helpful about EINVAL: pthread_key_delete and pthread_setspecific return the following error code on error: EINVAL key is not a valid, allocated TSD key so if we're getting EINVAL, that means we haven't called allocated the key, and we can only allocate keys via pthread_key_create. OK, so where does pthread_key_create get called? We must only call it in one place, and indeed, we call it in _PR_InitThreads: http://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptthread.c#992 I don't know why it gets hidden behind _PT_PTHREAD_KEY_CREATE, but that's NSPR for you. If the pthread_key_create call had failed, we would have asserted there, so we must not have made that call yet. How could we have not made that call? _PR_InitThreads gets called from _PR_InitStuff: http://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/misc/prinit.c#180 and _PR_InitStuff gets called from PR_Init and _PR_ImplicitInitialization (which is called by many NSPR functions if PR_Init has not been called): http://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/misc/prinit.c#266 http://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/misc/prinit.c#219 so NSPR didn't get initialized? That's really weird. Wait, no, it *did* get initialized because in pt_AttachThread: http://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptthread.c#269 we would have returned early if NSPR hadn't been initialized. So, NSPR has been initialized, but we haven't called _PR_InitThreads? That's really weird. How could that possibly happen? Oh, here's a problem: http://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/misc/prinit.c#141 _PR_InitStuff will return if it thinks it's already been called. But it sets _pr_initialized *before* it's actually called all the requisite initialization functions. That's not good; it means we can have a scenario like the following: Thread 1: Call into _PR_InitStuff via some path Thread 1: Set _pr_initialized = PR_TRUE ...before Thread 1 calls _PR_InitThread from _PR_InitStuff... Thread 2: Call pt_AttachThread Thread 2: Check _pr_initialized, find that it's true Thread 2: Try to use pt_book.key, find that it hasn't been initialized, assert I don't know that this is exactly what's happening--for one, I don't know why there would be two threads running around during Thunderbird startup like this. But something like this must be going wrong in Thunderbird startup...but I don't know why Thunderbird startup would be any different than Firefox startup in this respect. Figuring out what thread NS_DebugBreak is getting called on would be a good start; the logs that comment 0 came from aren't linked in the bug, so I can't see those.
Reporter | ||
Comment 11•7 years ago
|
||
Logs can be found on the C-C treeherder, for example here: https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=f8fea55e9bd53d7f068455c89376c04e3a82559e&selectedJob=131000946 https://archive.mozilla.org/pub/thunderbird/tinderbox-builds/comm-central-macosx64-debug/1505383930/comm-central_yosemite_r7-debug_test-xpcshell-bm106-tests1-macosx-build6.txt.gz (Clicking that link has hung my FF twice, so be warned)
Reporter | ||
Comment 12•7 years ago
|
||
I managed to open the 27MB 320282-line log in Chrome but there's nothing interesting: 05:34:23 INFO - PID 6943 | Assertion failure: 0 == rv, at /builds/slave/tb-c-cen-m64-d-000000000000000/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:292 05:34:23 INFO - PID 6943 | Redirecting call to abort() to mozalloc_abort 05:34:23 INFO - PID 6943 | Hit MOZ_CRASH() at /builds/slave/tb-c-cen-m64-d-000000000000000/build/mozilla/memory/mozalloc/mozalloc_abort.cpp:33
Comment 13•7 years ago
|
||
Huh, you're not getting crash stacks on those errors. That's unfortunate. Oh, this seems to be happening during shutdown. Maybe we're trying to get the current thread after we've already shut down NSPR, or something? That would be a lot more believable than the racy scenario outlined above. Perhaps the code in NS_DebugBreak should be updated to not try and get the current thread after XPCOM itself has shutdown, or something like that.
Comment 15•7 years ago
|
||
(In reply to Jorg K (GMT+2) from comment #14) > So who could do that? You could.
Flags: needinfo?(nfroyd)
Reporter | ||
Comment 16•7 years ago
|
||
I wonder what made you say that. You may know that I'm the only person on the Thunderbird team making sure that Thunderbird remains buildable. I'm a jack-of-all-trades (master of none), a mix of secretary, sheriff and developer. I know absolutely nothing about NSPR (which appears to be some sort of Mozilla sub-module). So I don't think I have the right skill set and I also don't have a Mac where that appears to be failing. In general, Thunderbird uses the Mozilla "toolbox" and that toolbox just went bust a bit. So while of course we're sitting on a pile of legacy code that breaks all the time and we address it, see those leaking atoms in January 2017), here I don't see the urge to go and fix the toolbox, à la (quote) "Perhaps the code in NS_DebugBreak should be updated to not try and get the current thread after XPCOM itself has shutdown, or something like that." I'm still remembering your comment in bug 1334771 comment #72 (quote): === I know it's frustrating to have all these messages explode on some random merge. The messages are telling you that there is a problem, and that there's an opportunity to make TB better. === Can you convince me that this is an opportunity to make TB better? Then I will address the issue.
Summary: Various Xpshell and Mozmill failures in debug mode on 2017-09-13: PID 6709 | Assertion failure: 0 == rv, at mozilla/nsprpub/pr/src/pthreads/ptthread.c:292 → Various Xpshell and Mozmill failures in debug mode on Mac on 2017-09-13: PID 6709 | Assertion failure: 0 == rv, at mozilla/nsprpub/pr/src/pthreads/ptthread.c:292
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 18•7 years ago
|
||
Kai, Wan-Teh and Mike: For a while we couldn't build Thunderbird on Mac, but now that it's working again we see the crashes in debug builds: https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=48a4c3f584cc13bef1fc1c405c7edb312dc7995e&selectedJob=135549592 https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=48a4c3f584cc13bef1fc1c405c7edb312dc7995e&selectedJob=135549591 Nathan suggested that I debug it but it's really beyond me. The error started with landing of bug 1393710 https://hg.mozilla.org/mozilla-central/rev/a104181ac18b (see comment #3). Could you please advise.
Flags: needinfo?(wtc)
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(kaie)
Comment 19•7 years ago
|
||
Jorg: Let's check in this patch to get the information that will validate our hypothesis. This patch expands the PR_ASSERT(0 == rv) macro, and also prints the values of 'rv' and pt_book.keyCreated. We expect rv to be EINVAL. If pt_book.keyCreated is 0 (PR_FALSE), then this validates our hypothesis that NSPR has been shut down by _PR_Fini, which does not set _pr_initialized to PR_FALSE (that line is commented out). I haven't checked in a patch recently, so I will need someone else to push this patch for me. Kai, could you do this for me?
Reporter | ||
Comment 20•7 years ago
|
||
Hi, thanks for your help here. I'll push this to our C-C try on a Mac debug build where the error occurs, OK? Here it is: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=dfa9ce9fe1de66800c8fdfc4eb6f48cd2d26e142 That applies your patch to the m-c (mozilla) directory for a Thunderbird build and you can click on the X and Z later to see the logs.
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(kaie)
Reporter | ||
Comment 21•7 years ago
|
||
I had to do this again since your patch didn't apply, it was missing the nsprpub/. Here we go again: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=adec0e2d6bfe2ba70b2dbe5171d9fb22df8f414f
Reporter | ||
Comment 22•7 years ago
|
||
As you can see here https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=adec0e2d6bfe2ba70b2dbe5171d9fb22df8f414f&selectedJob=135808586 your debug says: Assertion failure: 0 == rv (22), pt_book.keyCreated=0 22 seems to be EINVAL. So what's next?
Comment 23•7 years ago
|
||
Hi Jorg: Yes, 22 is EINVAL. The next step is to track down how NSPR is shut down and what code calls PR_GetCurrentThread() after NSPR has been shut down. NSPR can be shut down in two ways. 1. Via an explicit PR_Cleanup() call. PR_Cleanup() resets _pr_initialized to false. 2. By the "fini" function of the NSPR dynamic shared library (or the dynamic shared library that contains NSPR). In this case, the _PR_Fini function in nsprpub/pr/src/pthreads/ptthread.c is called. _PR_Fini does not modify _pr_initialized. Since _pr_initialized is still true, we know NSPR is shut down by the second method. So some code calls PR_GetCurrentThread() after the NSPR dynamic shared library (or the dynamic shared library that contains NSPR) has been unloaded.
Reporter | ||
Comment 24•7 years ago
|
||
(In reply to Wan-Teh Chang from comment #23) > Yes, 22 is EINVAL. The next step is to track down how NSPR is > shut down and what code calls PR_GetCurrentThread() after NSPR > has been shut down. So how do you suggest this is done? I'm the Thunderbird maintainer and know nothing about NSPR. All I know is that this problem started occurring after this was landed, apparently for debugging purposes: https://hg.mozilla.org/mozilla-central/rev/a104181ac18b#l1.32 As noted above, we only see the problem on Mac. > Since _pr_initialized is still true, we know NSPR is shut down > by the second method. So some code calls PR_GetCurrentThread() > after the NSPR dynamic shared library (or the dynamic shared > library that contains NSPR) has been unloaded. I guess something along the lines of comment #13 needs to be done. Once NSPR is shut down, PR_GetCurrentThread() can't be called any more.
Comment 25•7 years ago
|
||
Hi Jorg: I am very sympathetic. The sad reality is that no single person involved has the setup, expertise, and motivation to debug this. So we just need to do as much as we can. 1. We should get a stack trace of the code that calls NS_DebugBreak by changing the PR_Assert in my patch to a crash. Please replace the following line in my patch: + PR_Assert(msg, __FILE__, __LINE__); with: + fprintf(stderr, "Assertion failure: %s, at %s:%d\n", msg, __FILE__, __LINE__); + fflush(stderr); + { int *p = NULL; *p = 0; } 2. We can add log messages to verify that PR_Cleanup() is not called but _PR_Fini() is called. Both of these functions are defined in nsprpub/pr/src/pthreads/ptthread.c. Please add: fprintf(stderr, "_PR_Fini called\n"); fflush(stderr); to the beginning of _PR_Fini() and add a null pointer dereference: { int *p = NULL; *p = 0; } to the beginning of PR_Cleanup().
Reporter | ||
Comment 26•7 years ago
|
||
OK, try run with these changes: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=f324af5881352700c28f261aa5cdb6ee107957f1 So the idea is to crash pt_AttachThread() to get a stack dump and expect NS_DebugBreak() somewhere on the stack?
Comment 27•7 years ago
|
||
Yes, we want to get a stack trace (by forcing a crash) that shows what code is calling NS_DebugBreak(), which triggers the PR_ASSERT failure in pt_AttachThread().
Reporter | ||
Comment 28•7 years ago
|
||
I see the connection now: NS_DebugBreak() calls PR_GetCurrentThread() and that calls pt_AttachThread(). Seems unfortunate that PR_GetCurrentThread() does some initialisation stuff if called during shutdown. But let's see what the call stack is.
Reporter | ||
Comment 29•7 years ago
|
||
Hmm, the result is not very conclusive: https://archive.mozilla.org/pub/thunderbird/try-builds/mozilla@jorgk.com-f324af5881352700c28f261aa5cdb6ee107957f1/try-comm-central-macosx64-debug/try-comm-central_yosemite_r7-debug_test-xpcshell-bm106-tests1-macosx-build3.txt.gz (opening that hangs my Firefox, so be careful). I see: 20:57:10 INFO - exiting test 20:57:10 INFO - PID 6462 | JavaScript strict warning: resource://gre/modules/IndexedDBHelper.jsm, line 170: ReferenceError: reference to undefined property "result" 20:57:10 INFO - "CONSOLE_MESSAGE: (warn) [JavaScript Warning: "ReferenceError: reference to undefined property "result"" {file: "resource://gre/modules/IndexedDBHelper.jsm" line: 170}]" 20:57:10 INFO - PID 6462 | [6462, Main Thread] WARNING: OOPDeinit() without successful OOPInit(): file /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/toolkit/crashreporter/nsExceptionHandler.cpp, line 3562 20:57:10 INFO - PID 6462 | _PR_Fini called 20:57:10 INFO - PID 6462 | nsStringStats 20:57:10 INFO - PID 6462 | => mAllocCount: 9146 20:57:10 INFO - PID 6462 | => mReallocCount: 140 20:57:10 INFO - PID 6462 | => mFreeCount: 9146 20:57:10 INFO - PID 6462 | => mShareCount: 6841 20:57:10 INFO - PID 6462 | => mAdoptCount: 404 20:57:10 INFO - PID 6462 | => mAdoptFreeCount: 404 20:57:10 INFO - PID 6462 | => Process ID: 6462, Thread ID: 140735196943104 20:57:10 INFO - PID 6462 | Assertion failure: 0 == rv (22), pt_book.keyCreated=0, at /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:299 20:57:10 INFO - <<<<<<< 20:57:10 INFO - TEST-START | dom/push/test/xpcshell/test_clearAll_successful.js but no stack. Others look like 20:57:14 INFO - exiting test 20:57:14 INFO - PID 6464 | [6464, Main Thread] WARNING: unable to post continuation event: file /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/xpcom/io/nsStreamUtils.cpp, line 485 20:57:14 INFO - PID 6464 | [6464, Main Thread] WARNING: some msg dbs left open: '!m_dbCache.Length()', file /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mailnews/db/msgdb/src/nsMsgDatabase.cpp, line 84 20:57:14 INFO - PID 6464 | db left open /private/var/folders/n0/6tpw2p6j3vx8fzyb0wrmlcwc00000w/T/xpc-profile-AEQFF0/ImapMail/localhost/gabba0.msf 20:57:14 INFO - PID 6464 | db left open /private/var/folders/n0/6tpw2p6j3vx8fzyb0wrmlcwc00000w/T/xpc-profile-AEQFF0/ImapMail/localhost/Trash.msf 20:57:14 INFO - PID 6464 | db left open /private/var/folders/n0/6tpw2p6j3vx8fzyb0wrmlcwc00000w/T/xpc-profile-AEQFF0/ImapMail/localhost/Trash.msf 20:57:14 INFO - PID 6464 | [6464, Main Thread] WARNING: OOPDeinit() without successful OOPInit(): file /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/toolkit/crashreporter/nsExceptionHandler.cpp, line 3562 20:57:14 INFO - PID 6464 | _PR_Fini called 20:57:14 INFO - PID 6464 | nsStringStats 20:57:14 INFO - PID 6464 | => mAllocCount: 13771 20:57:14 INFO - PID 6464 | => mReallocCount: 473 20:57:14 INFO - PID 6464 | => mFreeCount: 13760 -- LEAKED 11 !!! 20:57:14 INFO - PID 6464 | => mShareCount: 10692 20:57:14 INFO - PID 6464 | => mAdoptCount: 3108 20:57:14 INFO - PID 6464 | => mAdoptFreeCount: 3107 -- LEAKED 1 !!! 20:57:14 INFO - PID 6464 | => Process ID: 6464, Thread ID: 140735196943104 20:57:14 INFO - PID 6464 | Assertion failure: 0 == rv (22), pt_book.keyCreated=0, at /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:299 20:57:14 INFO - <<<<<<< So it looks like it's exiting a test and wants to report some error that occurred during the test. Please advise further steps.
Reporter | ||
Comment 30•7 years ago
|
||
Marco and Andrew, should that JS error be fixed? [JavaScript Warning: "ReferenceError: reference to undefined property "result"" {file: "resource://gre/modules/IndexedDBHelper.jsm" line: 170}]
Flags: needinfo?(mak77)
Flags: needinfo?(bugmail)
Reporter | ||
Comment 31•7 years ago
|
||
The Mozmill log isn't more conclusive: https://archive.mozilla.org/pub/thunderbird/try-builds/mozilla@jorgk.com-f324af5881352700c28f261aa5cdb6ee107957f1/try-comm-central-macosx64-debug/try-comm-central_yosemite_r7-debug_test-mozmill-bm106-tests1-macosx-build2.txt.gz 20:10:00 INFO - _PR_Fini called 20:10:00 INFO - nsStringStats 20:10:00 INFO - => mAllocCount: 117 20:10:00 INFO - => mReallocCount: 2 20:10:00 INFO - => mFreeCount: 117 20:10:00 INFO - => mShareCount: 269 20:10:00 INFO - => mAdoptCount: 0 20:10:00 INFO - => mAdoptFreeCount: 0 20:10:00 INFO - => Process ID: 722, Thread ID: 140735235392256 20:10:00 INFO - Assertion failure: 0 == rv (22), pt_book.keyCreated=0, at /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:299 Instead of a home-made crash dereferencing null, should we just do a MOZ_CRASH() or similar?
Reporter | ||
Comment 32•7 years ago
|
||
With MOZ_CRASH(): https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=fa55617dcd93679de8c8c36633f6470cbe707ba1
Comment 33•7 years ago
|
||
I don't know much about indexedDB, forwarding to Jan.
Flags: needinfo?(mak77) → needinfo?(jvarga)
Reporter | ||
Comment 34•7 years ago
|
||
(In reply to Jorg K (GMT+2) from comment #32) > With MOZ_CRASH(): That didn't build: Undefined symbols for architecture x86_64: "_MOZ_CRASH", referenced from: _PR_GetCurrentThread in ptthread.o _PR_Cleanup in ptthread.o ld: symbol(s) not found for architecture x86_64
Comment 36•7 years ago
|
||
Jorg: If we can't get a stack trace by dereferencing a null pointer, we can find out if Mac OS X has a function that prints the current stack trace. Also, could you find out if Thunderbird has an NSPR dynamic shared library? The file name would be libnspr4.dylib. Firefox right now puts the NSPR code into some other dynamic shared library. If Thunderbird puts the NSPR code into some other dynamic shared library, then there may be multiple "destructor" functions in that dynamic shared library. If so, we need to make sure _PR_Fini is called after any other "destructor" function that may call an NSPR function. This probably can be accomplished by passing NSPR's object files last to the linker.
Flags: needinfo?(wtc)
Reporter | ||
Comment 37•7 years ago
|
||
There isn't a single reference to libnspr4 in comm-central, so I guess TB does whatever FF does: https://dxr.mozilla.org/comm-central/search?q=libnspr4&redirect=false The Mac "backtrace" is described here: https://developer.apple.com/legacy/library/documentation/Darwin/Reference/ManPages/man3/backtrace.3.html https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=1eff36626eb55c6e8f3dbe04c5093facdf9770f2
Reporter | ||
Comment 38•7 years ago
|
||
Here is something that looks like a stack: https://archive.mozilla.org/pub/thunderbird/try-builds/mozilla@jorgk.com-1eff36626eb55c6e8f3dbe04c5093facdf9770f2/try-comm-central-macosx64-debug/try-comm-central_yosemite_r7-debug_test-mozmill-bm106-tests1-macosx-build3.txt.gz 17:48:25 INFO - _PR_Fini called 17:48:25 INFO - nsStringStats 17:48:25 INFO - => mAllocCount: 117 17:48:25 INFO - => mReallocCount: 2 17:48:25 INFO - => mFreeCount: 117 17:48:25 INFO - => mShareCount: 269 17:48:25 INFO - => mAdoptCount: 0 17:48:25 INFO - => mAdoptFreeCount: 0 17:48:25 INFO - => Process ID: 718, Thread ID: 140735298851584 17:48:25 INFO - Assertion failure: 0 == rv (22), pt_book.keyCreated=0, at /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:304 17:48:25 INFO - 0 libnss3.dylib 0x0000000105ea8477 PR_GetCurrentThread + 183 17:48:25 INFO - 1 XUL 0x0000000106532215 _ZNK7mozilla14VolatileBuffer26NonHeapSizeOfExcludingThisEv + 5257989 17:48:25 INFO - 2 XUL 0x000000010656bbe9 _ZNK7mozilla14VolatileBuffer26NonHeapSizeOfExcludingThisEv + 5493977 17:48:25 INFO - 3 libsystem_c.dylib 0x00007fff9686a72d __cxa_finalize_ranges + 322 17:48:25 INFO - 4 libsystem_c.dylib 0x00007fff9686aa30 exit + 55 17:48:25 INFO - 5 plugin-container 0x0000000105d5ce8b start + 59 17:48:25 INFO - 6 ??? 0x000000000000000c 0x0 + 12 https://archive.mozilla.org/pub/thunderbird/try-builds/mozilla@jorgk.com-1eff36626eb55c6e8f3dbe04c5093facdf9770f2/try-comm-central-macosx64-debug/try-comm-central_yosemite_r7-debug_test-xpcshell-bm106-tests1-macosx-build4.txt.gz has the same thing.
Flags: needinfo?(wtc)
Updated•7 years ago
|
OS: Unspecified → Mac OS X
Comment 39•7 years ago
|
||
(In reply to Jorg K (GMT+2) from comment #37) > There isn't a single reference to libnspr4 in comm-central, so I guess TB > does whatever FF does: The question is, does the list of binary files that are part of the thunderbird application, contain a file with that name?
Comment 40•7 years ago
|
||
Hi Jorg, The stack trace in your comment 38 is very useful. Could you also print the stack trace when _PR_Fini is called (i.e., after we print "_PR_Fini called\n" to stderr)? It is possible that all the NSPR and NSS code is put in a single dynamic shared library, libnss3.dylib. The stack trace when _PR_Fini is called should tell us the name of the dynamic shared library that contains the NSPR code. Is libsystem_c.dylib a system library of Mac OS X? Is "XUL" in the stack trace a dynamic shared library? We should use a C++ name demangler to recover the C++ symbol of _ZNK7mozilla14VolatileBuffer26NonHeapSizeOfExcludingThisEv.
Flags: needinfo?(wtc)
Reporter | ||
Comment 41•7 years ago
|
||
(In reply to Kai Engert (:kaie:) from comment #39) > (In reply to Jorg K (GMT+2) from comment #37) > > There isn't a single reference to libnspr4 in comm-central, so I guess TB > > does whatever FF does: > The question is, does the list of binary files that are part of the > thunderbird application, contain a file with that name? I really can't answer that. If libnspr4 doesn't show up in comm-central https://dxr.mozilla.org/comm-central/search?q=libnspr4&redirect=false we do whatever Firefox does. I also don't have a Mac and very little Mac knowledge. Perhaps one could download the TB app for Mac and see what's in in. (In reply to Wan-Teh Chang from comment #40) > The stack trace in your comment 38 is very useful. Could you > also print the stack trace when _PR_Fini is called (i.e., > after we print "_PR_Fini called\n" to stderr)? Coming up. > Is libsystem_c.dylib a system library of Mac OS X? As I said above, I have basically no Mac knowledge, but it appears that way: https://www.google.com.au/search?source=hp&q=libsystem_c.dylib&oq=libsystem_c.dylib
Comment 42•7 years ago
|
||
(In reply to Wan-Teh Chang from comment #40) > Is "XUL" in the stack trace a dynamic shared library? We should > use a C++ name demangler to recover the C++ symbol of > _ZNK7mozilla14VolatileBuffer26NonHeapSizeOfExcludingThisEv. https://dxr.mozilla.org/mozilla-central/source/memory/volatile/VolatileBufferFallback.cpp?q=%2Bfunction%3A%22mozilla%3A%3AVolatileBuffer%3A%3ANonHeapSizeOfExcludingThis%28%29+const%22&redirect_type=single#86
Comment 43•7 years ago
|
||
Whoops, betrayed by DXR's linux bias. https://dxr.mozilla.org/mozilla-central/rev/20d57b9c4183973af4af5e078dff2aec0b74f928/memory/volatile/VolatileBufferOSX.cpp#118 is the OSX implementation of that method.
Comment 44•7 years ago
|
||
(In reply to Kai Engert (:kaie:) from comment #39) > The question is, does the list of binary files that are part of the > thunderbird application, contain a file with that name? I don't have a Mac, but I was able to download thunderbird-58.0a1.en-US.mac.dmg, convert it using dmg2img, and mount it on Linux. find |egrep -i "nspr|nss" ./Daily.app/Contents/MacOS/libnss3.dylib ./Daily.app/Contents/MacOS/libnssckbi.dylib ./Daily.app/Contents/MacOS/libnssdbm3.dylib find . -name \*.dylib ./Daily.app/Contents/MacOS/libfreebl3.dylib ./Daily.app/Contents/MacOS/libldap60.dylib ./Daily.app/Contents/MacOS/libldif60.dylib ./Daily.app/Contents/MacOS/liblgpllibs.dylib ./Daily.app/Contents/MacOS/libmozglue.dylib ./Daily.app/Contents/MacOS/libnss3.dylib ./Daily.app/Contents/MacOS/libnssckbi.dylib ./Daily.app/Contents/MacOS/libnssdbm3.dylib ./Daily.app/Contents/MacOS/libplugin_child_interpose.dylib ./Daily.app/Contents/MacOS/libprldap60.dylib ./Daily.app/Contents/MacOS/libsoftokn3.dylib Apparently NSPR is contained inside libnss3.dylib
Comment 45•7 years ago
|
||
memory/volatile/VolatileBuffer.h contains: NS_INLINE_DECL_THREADSAFE_REFCOUNTING(VolatileBuffer) which probably causes a thread check on buffer destruction.
Comment 46•7 years ago
|
||
If libnss3.dylib containing nspr is unloaded, but a C symbol contained in libnss3.dylib is called after its unloaded, then either the automatic unloading of libraries is broken on OSX, or something deliberately unloads it at the wrong time.
Reporter | ||
Comment 47•7 years ago
|
||
Thanks for the action here. I've got a new try going and paste the results here when it's done: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=419c83c822f32c1a51426843917d386d38c53718
Comment 48•7 years ago
|
||
Jorg, I wonder if it still works to build NSPR in its own shared library (dylib) files. Could you please start a try build, with this as the only patch, and see if the build works, and you're still crashing?
Reporter | ||
Comment 49•7 years ago
|
||
Sorry, I messed up the "stack patch", here another one: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=0cf1a0022ad7e22bfc80c1cc078fe416a0cae731 Kai, I'll do your try next.
Reporter | ||
Comment 50•7 years ago
|
||
Here's the "no fold" try: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=3c1da749e3cf3ea7733645d7c118c3b6c32521ff
Reporter | ||
Comment 51•7 years ago
|
||
OK, here the stack result, I'll post the "no fold" when it's ready: https://archive.mozilla.org/pub/thunderbird/try-builds/mozilla@jorgk.com-0cf1a0022ad7e22bfc80c1cc078fe416a0cae731/try-comm-central-macosx64-debug/try-comm-central_yosemite_r7-debug_test-xpcshell-bm106-tests1-macosx-build5.txt.gz 14:24:01 INFO - PID 5626 | _PR_Fini called 14:24:01 INFO - PID 5626 | 0 libnss3.dylib 0x0000000100ab7624 PR_Sleep + 804 14:24:01 INFO - PID 5626 | 1 ??? 0x00007fff634d1fd0 0x0 + 140734859386832 14:24:01 INFO - PID 5626 | 2 ??? 0x00007fff634c12e2 0x0 + 140734859317986 14:24:01 INFO - PID 5626 | 3 libsystem_c.dylib 0x00007fff892b672d __cxa_finalize_ranges + 322 14:24:01 INFO - PID 5626 | 4 libsystem_c.dylib 0x00007fff892b6a30 exit + 55 14:24:01 INFO - PID 5626 | 5 xpcshell 0x000000010096bd9b start + 59 14:24:01 INFO - PID 5626 | 6 ??? 0x000000000000001f 0x0 + 31 14:24:01 INFO - PID 5626 | nsStringStats 14:24:01 INFO - PID 5626 | => mAllocCount: 27168 14:24:01 INFO - PID 5626 | => mReallocCount: 1140 14:24:01 INFO - PID 5626 | => mFreeCount: 27168 14:24:01 INFO - PID 5626 | => mShareCount: 19572 14:24:01 INFO - PID 5626 | => mAdoptCount: 626 14:24:01 INFO - PID 5626 | => mAdoptFreeCount: 626 14:24:01 INFO - PID 5626 | => Process ID: 5626, Thread ID: 140735093183232 14:24:01 INFO - PID 5626 | Assertion failure: 0 == rv (22), pt_book.keyCreated=0, at /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:304 14:24:01 INFO - PID 5626 | 0 libnss3.dylib 0x0000000100ab6d07 PR_GetCurrentThread + 183 14:24:01 INFO - PID 5626 | 1 XUL 0x0000000101140115 _ZNK7mozilla14VolatileBuffer26NonHeapSizeOfExcludingThisEv + 5258133 14:24:01 INFO - PID 5626 | 2 XUL 0x0000000101179ae9 _ZNK7mozilla14VolatileBuffer26NonHeapSizeOfExcludingThisEv + 5494121 14:24:01 INFO - PID 5626 | 3 libsystem_c.dylib 0x00007fff892b672d __cxa_finalize_ranges + 322 14:24:01 INFO - PID 5626 | 4 libsystem_c.dylib 0x00007fff892b6a30 exit + 55 14:24:01 INFO - PID 5626 | 5 xpcshell 0x000000010096bd9b start + 59 14:24:01 INFO - PID 5626 | 6 ??? 0x000000000000001f 0x0 + 31 14:24:01 INFO - PID 5626 | [5626, Main Thread] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/xpcom/base/nsTraceRefcnt.cpp, line 171 14:24:01 INFO - <<<<<<<
Reporter | ||
Comment 52•7 years ago
|
||
The "no fold" patch caused a package error: https://archive.mozilla.org/pub/thunderbird/try-builds/mozilla@jorgk.com-3c1da749e3cf3ea7733645d7c118c3b6c32521ff/try-comm-central-macosx64-debug/try-comm-central-macosx64-debug-bm87-try1-build15.txt.gz Error: /builds/slave/tb-try-c-cen-m64-d-00000000000/build/objdir-tb/mail/installer/package-manifest:77: Missing file(s): DailyDebug.app/Contents/Resources/libmozsqlite3.dylib Traceback (most recent call last): File "/builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/toolkit/mozapps/installer/packager.py", line 333, in <module> main() File "/builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/toolkit/mozapps/installer/packager.py", line 302, in main copier.add(mozpath.join(respath, 'removed-files'), removals) File "/tools/python/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/python/mozbuild/mozpack/errors.py", line 131, in accumulate raise AccumulatedErrors() mozpack.errors.AccumulatedErrors make[3]: *** [stage-package] Error 1
Comment 53•7 years ago
|
||
Comment 51 shows two stacks. __cxa_finalize_ranges calls the cleanup functions that are registered to be called at exit time. We see that the NSPR cleanup function _PR_Fini is called first, and afterwards another cleanup function gets called, which belongs to some XPCOM code, and which tries to use NSPR's thread-local-storage variables, which have already been destroyed. This means we have an ordering problem. (Can we assume that both calls are made from the main thread?) How many automatic cleanup functions does Mozilla use? IIUC on Mac we use the __attribute__ ((destructor)) mechanism to register the _PR_Fini unload function. Searching through the rest of Firefox code, I found two additional functions: security/nss/lib/freebl/stubs.c 728:void __attribute((destructor)) FREEBL_unload() intl/icu/source/common/ucln_imp.h 169:static void ucln_destructor() __attribute__((destructor)) ; Are there any occurrences of __attribute__((destructor)) in the Thunderbird specific code? The NSS function seems safe, it might only call dlclose(). The ucln function cleanup function is more complex, it loops through various other cleanup things. Either we must (a) find a way to guarantee that _PR_Fini is the last of all registered cleanup functions to be called (might not be possible), or (b) change ucln_destructor to not use the same cleanup mechanism as NSPR (probably difficult) or (c) Change GetCurrentThread() to be failsafe after having gone through _PR_Fini. or (d) change _PR_Fini to do nothing (dirty). Is it reasonable to hope that the "threadsafe reference counting" is the only NSPR service that these additional cleanup functions might call? If yes, we could try (c), and test if it's sufficient. Or could that other cleanup code call additional, more complex functions, like NSPR memory destruction functions? If yes, then we need either (a), (b), (d).
Comment 54•7 years ago
|
||
As a test to confirm that the race with the intl/icu code is indeed the culprit, you could test to disable that cleanup code. In ucln_imp.h, you could change both definitions of UCLN_CLEAN_ME_UP to do nothing.
Comment 55•7 years ago
|
||
(c) isn't trivial either. Changing GetCurrentThread to simply return NULL would result in the ownership check (comment 45) to fail and nsAutoOwningThread::AssertCurrentThreadOwnsMe would trigger a forced crash. Unless anyone has good ideas how to do (a) or (b), we'll have to use (d). _PR_Fini calls pthread_key_delete(). The man page says "Any attempt to use key following the call to pthread_key_delete() results in undefined behavior." But that's what's happening currently, PR_GetCurrentThread() calls _PT_PTHREAD_GETSPECIFIC() with the deleted key. Since we apparently cannot guarantee that the key might be used after _PR_Fini, we probably shouldn't delete it. And we probably should keep the thread identifiers alive, to ensure that late same-thread-checks will continue to work.
Comment 56•7 years ago
|
||
This is a patch for the dirty approach (d). Wan-Teh, do you think we should use this approach, or do you have ideas for (a) and (b), or other ideas?
Attachment #8918987 -
Attachment is obsolete: true
Comment 57•7 years ago
|
||
Kai: Thanks a lot for investigating this. Given Thunderbird's current status, I think it is fine to apply your disable-pr-fini.patch as a Thunderbird-local patch. I think __cxa_finalize_ranges may also call C++ global object destructors and atexit handlers, so it is not enough to just look for __attribute__((destructor)) in the source tree. It is strange that Mac OS X unloads libnss3.dylib (which contains the NSPR code) before it unloads code that uses NSPR. I would think that the operating system should know the dependencies between dynamic shared libraries and unload them in an appropriate order.
Comment 58•7 years ago
|
||
(In reply to Kai Engert (:kaie:) from comment #53) > > Either we must > (a) find a way to guarantee that _PR_Fini is the last of all registered > cleanup functions to be called (might not be possible), > or > (b) change ucln_destructor to not use the same cleanup mechanism as NSPR > (probably difficult) > or > (c) Change GetCurrentThread() to be failsafe after having gone through > _PR_Fini. > or > (d) change _PR_Fini to do nothing (dirty). Thanks for enumerating the possible solutions. We should not consider (c). (b) is speculation because we still haven't identified the function that calls PR_GetCurrentThread(). (a) is the proper solution, but it is not practical because no single person has the setup, expertise, and motivation to do it. (We need to at least identify the function that calls PR_GetCurrentThread() after _PR_Fini() has been called.) So I am afraid that we will need to do (d).
Comment 59•7 years ago
|
||
(In reply to Wan-Teh Chang from comment #57) > Given Thunderbird's current status, I think it is fine to > apply your disable-pr-fini.patch as a Thunderbird-local > patch. I'm not happy with the idea of adding local NSPR patches to applications. If we cannot think of a better fix, I'd prefer to add it to NSPR in general, potentially with #idef for Mac, only.
Updated•7 years ago
|
Summary: Various Xpshell and Mozmill failures in debug mode on Mac on 2017-09-13: PID 6709 | Assertion failure: 0 == rv, at mozilla/nsprpub/pr/src/pthreads/ptthread.c:292 → On Mac OSX, NSPR _PR_Fini gets called too early, other cleanup code calls NSPR functions afterwards, which fails | Assertion failure: 0 == rv, at mozilla/nsprpub/pr/src/pthreads/ptthread.c:292
Updated•7 years ago
|
Assignee: wtc → kaie
Component: General → NSPR
Product: Thunderbird → NSPR
Version: unspecified → other
Comment 60•7 years ago
|
||
Attachment #8919230 -
Attachment is obsolete: true
Attachment #8919369 -
Flags: review?(wtc)
Comment 61•7 years ago
|
||
Comment on attachment 8919369 [details] [diff] [review] 1399746-no-fini-on-osx.patch Jorg, could you please test this patch, to confirm it fixes the issue?
Attachment #8919369 -
Flags: feedback?(jorgk)
Comment 62•7 years ago
|
||
Comment on attachment 8919369 [details] [diff] [review] 1399746-no-fini-on-osx.patch Review of attachment 8919369 [details] [diff] [review]: ----------------------------------------------------------------- ::: pr/src/pthreads/ptthread.c @@ +1067,5 @@ > + /* We disable the cleanup code on Mac OSX, see bug 1399746. > + * The .dylib containing NSPR can get unloaded, and _PR_Fini called, > + * and other code calling NSPR functions can get executed afterwards. > + */ > +#ifndef DARWIN Kai: We don't need to do this for all applications; apparently Firefox has no problem with _PR_Fini() on Mac OS X. _PR_Fini() was added to support applications that load and unload. the NSPR shared library using dlopen() and dlclose(). Those applications may load and unload the NSPR shared library repeatedly. So the unloading of the NSPR shared library must free the pthread specific data key, otherwise the process will run out of pthread specific data keys eventually. I can add an NSPR test program to demonstrate this use case. This is why I recommend applying this patch to only Thunderbird. It is a pragmatic solution. Perhaps I should spend a weekend building Thunderbird on Mac OS X and tracking this down.
Comment 63•7 years ago
|
||
This patch adds a new test program, dlltest_unix.c, that loads and unloads libnspr4.so repeatedly. It demonstrates that use case that caused us to add _PR_Fini().
Comment 64•7 years ago
|
||
Wan-Teh, Thunderbird uses a shared codebase with Firefox. Thunderbird doesn't use its own copy of NSPR. Your suggestion would require someone to change the build system of Thunderbird to dynamically apply a patch on top of NSPR each time after they checkout the updated Firefox/Gecko base code. I don't recommend that, and I really don't want us to get into a habit of allowing applications to patch NSPR. They should never have to. And on some operating systems like Linux, where the NSPR library is shipped only once for the whole operating systems, applications aren't even able to use their own modifications.
Comment 65•7 years ago
|
||
What is supposed to happen, after the runtime environment unloaded a shared library? What happens to the global variables that reside inside that library? Apparently the nspr library was unloaded. But then other code called into NSPR, and the access to global variables like _pr_initialized was still possible? Does that mean the OSX runtime is completly broken? Shouldn't the attempt to access _pr_initialized after unload have caused some sort of memory access failure? Or does it mean that the values of static variables inside dynamic libraries will be preserved across reloads? Is that behavior different across operating systems? I suggest that on platforms, where the contents of static variables from unloaded variables may still be alive after an unload, we should disable the cleanup of the thread storage key allocation. If this is the case for Mac OSX, then it isn't incorrect to always skip _PR_Fini on Mac, because other applications might run into the same problem, even if we haven't yet seen it happen in other applications.
Reporter | ||
Comment 66•7 years ago
|
||
(In reply to Kai Engert (:kaie:) from comment #64) > Thunderbird uses a shared codebase with Firefox. Thunderbird doesn't use its > own copy of NSPR. Your suggestion would require someone to change the build > system of Thunderbird to dynamically apply a patch on top of NSPR each time > after they checkout the updated Firefox/Gecko base code. You can do conditional compiles with #ifdef MOZ_THUNDERBIRD (for examples search DXR for MOZ_THUNDERBIRD). I have read the arguments against this approach. I'll ship off a try with Kai's patch shortly.
Reporter | ||
Comment 67•7 years ago
|
||
Try run for "no fini" patch: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=5b4e8b4efefd0c4d9246a8fa9a19f463695677cf I'll report results tomorrow.
Comment 68•7 years ago
|
||
(In reply to Jorg K (GMT+2) from comment #30) > Marco and Andrew, should that JS error be fixed? > [JavaScript Warning: "ReferenceError: reference to undefined property > "result"" {file: "resource://gre/modules/IndexedDBHelper.jsm" line: 170}] IDBTransaction doesn't have a result property, I guess |txn.result| should be replaced with |event.target.result|
Flags: needinfo?(jvarga)
Reporter | ||
Comment 69•7 years ago
|
||
(In reply to Jan Varga [:janv] from comment #68) > IDBTransaction doesn't have a result property, I guess |txn.result| should > be replaced with |event.target.result| Filed bug 1409641 for that.
Flags: needinfo?(bugmail)
Reporter | ||
Comment 70•7 years ago
|
||
Comment on attachment 8919369 [details] [diff] [review] 1399746-no-fini-on-osx.patch Thanks, that fixes the test failures we were seeing, see comment #67 for link to try run. The X and Z are red for other reasons, the failure reported here has gone away.
Attachment #8919369 -
Flags: feedback?(jorgk) → feedback+
Comment 71•7 years ago
|
||
(In reply to Jorg K (GMT+2) from comment #66) > You can do conditional compiles with #ifdef MOZ_THUNDERBIRD (for examples > search DXR for MOZ_THUNDERBIRD). I think general libray code, like NSPR, shouldn't contain application specific defines.
Comment 72•7 years ago
|
||
Comment on attachment 8916753 [details] [diff] [review] Expand PR_ASSERT(0 == rv) in pt_AttachThread to debug Bug 1399746 marking this debugging patch as obsolete
Attachment #8916753 -
Attachment is obsolete: true
Attachment #8916753 -
Flags: review?(kaie)
Comment 73•7 years ago
|
||
(In reply to Wan-Teh Chang from comment #63) > This patch adds a new test program, dlltest_unix.c, that loads and > unloads libnspr4.so repeatedly. It demonstrates that use case that > caused us to add _PR_Fini(). Unfortuantely I don't have a 64-bit Mac and cannot test on recent OSX versions. Is _PR_Fini called for each unload on Mac, and do static variables like _pr_initialized still have the previous state after reload? Then the reload is broken anyway. Without a recent Mac, I cannot confirm that. But if NSPR code can be executed while the NSPR dylib is unloaded, we apparently are dealing with undefined states anyway, so I wouldn't expect that reloading works correctly on OSX.
Comment 74•7 years ago
|
||
Comment on attachment 8919369 [details] [diff] [review] 1399746-no-fini-on-osx.patch If we want this fixed in NSPR for Thunderbird 58, this must get r+'ed and landed into NSPR by the end of this week.
Attachment #8919369 -
Flags: review?(mh+mozilla)
Comment 75•7 years ago
|
||
It's my understand that although Thunderbird nightly tracks trunk, releases are tied to Firefox ESR releases, so the next major Thunderbird release is 59, not 58, which may take the pressure off slightly.
Reporter | ||
Comment 76•7 years ago
|
||
It would be nice to fix this asap, since it's busting our all debug tests on Mac. It's already affecting TB 58 Daily and TB 57 Beta since bug 1393710 which caused this failure landed on mozilla57.
Assignee | ||
Comment 77•7 years ago
|
||
Comment on attachment 8919369 [details] [diff] [review] 1399746-no-fini-on-osx.patch Review of attachment 8919369 [details] [diff] [review]: ----------------------------------------------------------------- This doesn't seem right. It shouldn't be possible for NSPR destructors to run before XUL ones have run. Or there's a serious bug in whatever version of OSX this is crashing on. Is this running on the same version as the one on Firefox automation? If not that could explain why Firefox is not affected. The stack traces in comment 51 unfortunately lack symbols for the most interesting frames. And even those that are printed out are dubious: __cxa_finalize_ranges + 322 means 322 bytes after __cxa_finalize_ranges. 322 bytes is rather large for a function, and that could actually be a totally different one, just that we don't have the symbols for them. It however doesn't matter too much for that one, because it's the same root for both calls, and the caller is exit, so we know it's probably related to running destructors. It's even worse in XUL, where the symbol being pointed at is _ZNK7mozilla14VolatileBuffer26NonHeapSizeOfExcludingThisEv + 5258133. 5MB after a symbol is certainly not that symbol. Then there are the unknown addresses like 0x00007fff634d1fd0 0x0 + 140734859386832. That would be the most interesting thing to figure out: *exactly* how _PR_Fini is being called. > How many automatic cleanup functions does Mozilla use? Many. __attribute__((destructor)) is not the only way to run destructors. C++ globals with destructors run some too.
Attachment #8919369 -
Flags: review?(mh+mozilla) → review-
Reporter | ||
Updated•7 years ago
|
Severity: normal → critical
Comment 78•7 years ago
|
||
I don't have a 64bit mac. Is there anyone with a 64bit mac, who has the skills and motivation to fix this bug within the next few weeks, in time for Firefox/Thunderbird 59? Any suggestions?
Comment 79•7 years ago
|
||
There's only 64bit mac. The Mac in the NSS CI are therefore 64bit.
Comment 80•7 years ago
|
||
Franziskus, we need a developer that physically controls a 64bit Mac for interactive debugging.
Reporter | ||
Comment 81•7 years ago
|
||
You can get a "loaner" (loan machine). I don't know the details.
Comment 82•7 years ago
|
||
(In reply to Jorg K (GMT+1) [currently bustage-fix only, no NI? or r?] from comment #81) > You can get a "loaner" (loan machine). I don't know the details. Who might know?
Comment 83•7 years ago
|
||
https://wiki.mozilla.org/ReleaseEngineering/How_To/Request_a_loaner (I have no experience with it)
Comment 84•7 years ago
|
||
That page says "To access all loaners, you require access to the Mozilla VPN" Does that mean this access is retricted to Mozilla employees?
Comment 85•7 years ago
|
||
So, I tried applying https://hg.mozilla.org/try/rev/38166cc000b9e7db208482e00e395f118876eaff (which is the patch from https://bugzilla.mozilla.org/show_bug.cgi?id=1399746#c19). Oddly, the results with the debugging patch are much better (although it still does trigger the error once: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=320e35814a22ddcca81595d3d69a08c4a738e282 (recent comm-central build for comparison: https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=28f105287badd523cc9e6cba0e33bc74b5f03dda&filter-searchStr=osx%20debug )
Comment 86•7 years ago
|
||
(In reply to Kai Engert (:kaie:) from comment #84) > That page says > "To access all loaners, you require access to the Mozilla VPN" > > Does that mean this access is retricted to Mozilla employees? I recall at least last winter someone from Thunderbird, (aleth or ewong?) got a loaner for investigations. I don't know if it changed, or the documentation is wrong.
Comment 87•7 years ago
|
||
It looks like this was fixed by Bug 1420407: https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=1654a1adc838380b2262f3981986d61cf9ee4a80&filter-searchStr=osx is green.
Reporter | ||
Comment 88•7 years ago
|
||
Indeed, the latest C-C push at https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=1654a1adc838380b2262f3981986d61cf9ee4a80 does not show those "Assertion failure: 0 == rv" failures any more.
Target Milestone: --- → 4.18
Reporter | ||
Updated•7 years ago
|
Attachment #8919369 -
Flags: review?(wtc)
Comment 89•7 years ago
|
||
That's great news, but very confusing, because that latest change to NSPR should only have affected the Windows platform.
Comment 90•7 years ago
|
||
I assumed based on https://bugzilla.mozilla.org/show_bug.cgi?id=1420407#c0 that this was expected. Was https://hg.mozilla.org/integration/mozilla-inbound/rev/798b87f30668#l16.10 not supposed to be included?
Flags: needinfo?(kaie)
Comment 91•7 years ago
|
||
That was added accidentally, it was still in my local tree when I added the other commit.
Flags: needinfo?(kaie)
Comment 92•7 years ago
|
||
So, I'm sorry that this code was accidentally checked in. I should have paid better attention. On the other hand, it's good to know that this workaround really helps to work around the bug, and apparently, because nobody else has reported any regressions yet, that it isn't causing any problems elsewhere. I suggest that we continue to look out for potential side effects. If nobody has the ability to work on a more correct fix for Mac, we could consider to keep this workaround until someone is able to.
Comment 93•7 years ago
|
||
Mike, you had given r- to the patch. It was landed accidentally, but seems to help. I'm willing to follow proper process and backout, and reactivate the bug. Alternatively, we could keep the patch, even though it isn't ideal. Mike, please let me know your preference. Thanks
Flags: needinfo?(mh+mozilla)
Comment 95•7 years ago
|
||
backed out: https://hg.mozilla.org/projects/nspr/rev/8d2ea9b5cc6d I'll land a NSPR beta 3 with bug 1420407 into m-i, so you'll probably see the bug showing up again. Reopening.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 99•6 years ago
|
||
So, with bug 1439973 providing a reliable way to reproduce this with Firefox, I was able to reproduce and attribute this the OSX dynamic linker. TL;DR: don't use __attribute__((destructor)) on OSX. Use atexit() or C++ class destructors on global instances instead. An smaller way to reproduce the issue: $ cat <<EOF > test.cc #include <stdio.h> #include <stdlib.h> #include <dlfcn.h> void my_atexit() { printf(__FILE__ " atexit\n"); } class Foo { public: ~Foo() { printf(__FILE__ " ~Foo\n"); } }; Foo f; __attribute__((destructor)) void destructor() { printf(__FILE__ " destructor\n"); } __attribute__((constructor)) void constructor() { atexit(my_atexit); printf(__FILE__ " constructor\n"); } int main() { void* handle = dlopen("libbar.dylib", RTLD_GLOBAL); dlclose(handle); return 0; } EOF $ cp test.cc foo.cc $ cp test.cc bar.cc $ clang++ -shared -o libfoo.dylib foo.cc $ clang++ -shared -o libbar.dylib bar.cc libfoo.dylib $ clang++ -o test test.cc So with this the test program dlopens libbar, which depends on libfoo. Running the program produces: $ ./test test.cc constructor foo.cc constructor bar.cc constructor bar.cc destructor foo.cc destructor bar.cc ~Foo bar.cc atexit foo.cc ~Foo foo.cc atexit test.cc destructor test.cc ~Foo test.cc atexit Constructors run in the expected order: test, libfoo, then libbar. Then we dlclose libbar and things start being funny: destructors run in the "right" order ... if they were the only thing to run. But atexit and C++ destructors for libbar run *after* the libfoo destructor. In our case, libbar is XUL and libfoo is nspr. Variant: $ clang++ -o test test.cc libbar.dylib This time libbar is linked to test directly, so dlopen/dlclose have no effect. That's essentially how the plugin-container is linked, contrary to the firefox binary. $ ./test foo.cc constructor bar.cc constructor test.cc constructor test.cc destructor bar.cc destructor foo.cc destructor test.cc ~Foo test.cc atexit bar.cc ~Foo bar.cc atexit foo.cc ~Foo foo.cc atexit
Assignee | ||
Updated•6 years ago
|
Assignee: kaie → mh+mozilla
Assignee | ||
Comment 100•6 years ago
|
||
atexit on its own is not enough, though, atexit must be called before anything in XUL calls atexit/cxa_atexit. I did a try build with an atexit(_PR_Fini) in _PR_InitThreads, and that didn't work out because _PR_Fini would still be called before some XUL destructors.
Assignee | ||
Comment 102•6 years ago
|
||
Theoretically, we could use atexit on all platforms (even HPUX, AIX and Solaris support it). The problem is that while I do know that Solaris actually does the right thing with it, I don't feel like checking whether all those OSes supported here do. So I'd rather assume the current code works fine for those other platforms, and make the fix OSX-only.
Attachment #8919369 -
Attachment is obsolete: true
Attachment #8952948 -
Flags: review?(ted)
Assignee | ||
Comment 103•6 years ago
|
||
Attachment #8952948 -
Attachment is obsolete: true
Attachment #8952948 -
Flags: review?(ted)
Attachment #8952949 -
Flags: review?(ted)
Comment 104•6 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=7fd126b89e382a4936b780efffaa159aaaa5b1b1 should be a build of comm-central with that patch applied.
Comment 105•6 years ago
|
||
Comment on attachment 8952949 [details] [diff] [review] Use atexit() to register _PR_Fini on OSX Mike, thanks a lot for tracking this down. If we want this to be part of Firefox/Thunderbird 60, we should hurry up to get this landed. We're currently considering next monday as the dev freeze date for NSPR 4.19 and NSS 3.36, which are the versions that moz 60 will use.
Attachment #8952949 -
Flags: review+
Assignee | ||
Comment 106•6 years ago
|
||
https://hg.mozilla.org/projects/nspr/rev/875fee4c031efeb1af4f48db37669b96a749ac1d
Status: REOPENED → RESOLVED
Closed: 7 years ago → 6 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Target Milestone: 4.18 → 4.19
Comment 107•6 years ago
|
||
Thanks for the fix here. In order to unblock bug 1353461 we'll need to either: 1. Wait for nspr 4.19 to merge to central 2. Cherry-pick this fix into central 3. Disable the reftest manifest that triggers this Mike, do you know roughly when we might merge nspr 4.19 in? Is there any reason not to cherry-pick this in the meantime?
Flags: needinfo?(mh+mozilla)
Comment 108•6 years ago
|
||
I'll uplift a new nspr beta snapshot to m-c now.
Flags: needinfo?(mh+mozilla)
Comment 109•6 years ago
|
||
Thanks!
Updated•6 years ago
|
Attachment #8952949 -
Flags: review?(ted)
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•