Closed Bug 1399746 Opened 7 years ago Closed 6 years ago

On Mac OSX, NSPR _PR_Fini gets called too early, other cleanup code calls NSPR functions afterwards, which fails | Assertion failure: 0 == rv, at mozilla/nsprpub/pr/src/pthreads/ptthread.c:292

Categories

(NSPR :: NSPR, defect)

Unspecified
macOS
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jorgk-bmo, Assigned: glandium)

References

Details

(Whiteboard: [Thunderbird-testfailure: XZ Mac only])

Attachments

(2 files, 5 obsolete files)

Xpcshell:
TEST-UNEXPECTED-FAIL | dom/push/test/xpcshell/test_permissions.js | xpcshell return code: -11
PID 6447 | Assertion failure: 0 == rv, at /builds/slave/tb-c-cen-m64-d-000000000000000/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:292
and many more.

Mozmill:
Assertion failure: 0 == rv, at /builds/slave/tb-c-cen-m64-d-000000000000000/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:292 [log…]
PROCESS-CRASH | content-policy | application crashed [@ mozalloc_abort(char const*)]
Those crashed started here
Wed Sep 13, 2017, 15:16:44 but only on Mac.
Wed Sep 13, 2017, 20:08:24 only on Mac,
Wed Sep 13, 2017, 23:45:36 on Mac and Windows. Strange.

Going by the first failure:
M-C last good: 9873269746359dbcccbd5b44f0e82c0225
M-C first bad: 1888ec2f277f6bb26271b8808e08914a21

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=9873269746359dbcccbd5b44f0e82c0225&tochange=1888ec2f277f6bb26271b8808e08914a21

Looks like I have to reproduce it locally to figure out what's going on.
Summary: Various Xpshell and Mozmill failures in debug mode on 2017-09-14: PID 6709 | Assertion failure: 0 == rv, at mozilla/nsprpub/pr/src/pthreads/ptthread.c:292 → Various Xpshell and Mozmill failures in debug mode on 2017-09-13: PID 6709 | Assertion failure: 0 == rv, at mozilla/nsprpub/pr/src/pthreads/ptthread.c:292
The failures are Mac only, nothing I can reproduce locally here. The Windows debug X is green, if it's not, it's cause by something else, the Windows debug Z is green or orange due to bug 1377692 or something else. The Windows debug Z orange comes and goes.
Whiteboard: [Thunderbird-testfailure: XZ Mac only]
FWIW I suspect from https://hg.mozilla.org/mozilla-central/rev/a104181ac18b.

That adds a PR_GetCurrentThread call that can happen very early at startup. Not sure why should pthreads fail though.
Failing pt_AttachThread() is called from PR_GetCurrentThread(). A call to PR_GetCurrentThread() was added in Bug 1393710 during the regression range.
James and Nathan, do you feel responsible ;-)
Flags: needinfo?(nfroyd)
Flags: needinfo?(jacheng)
(In reply to Stefan Sitter from comment #4)
> Failing pt_AttachThread() is called from PR_GetCurrentThread(). A call to
> PR_GetCurrentThread() was added in Bug 1393710 during the regression range.

No. :)

Looks like the only way pt_AttachThread() can fail is if NSPR isn't initialized.  Perhaps adding a PR_Initialize call somewhere in the Thunderbird startup sequence would be appropriate.  Would also be good to try and figure out how Thunderbird startup is so different from Firefox in this respect.
Flags: needinfo?(nfroyd)
Flags: needinfo?(jacheng)
Our mainline at
https://dxr.mozilla.org/comm-central/source/mail/app/nsMailApp.cpp
it almost identical to
https://dxr.mozilla.org/mozilla-central/source/browser/app/nsBrowserApp.cpp

So were in the startup sequence is the call to PR_Initialize in FF?

pt_AttachThread() asserts after calling |pthread_setspecific(pt_book.key, thred);| and there is a |if (!_pr_initialized) return NULL;| before, so if not initialised, it would have returned early.

Can you please clarify your analysis. Also, why only problems on Mac?
Flags: needinfo?(nfroyd)
(In reply to Jorg K (GMT+2) from comment #7)
> Our mainline at
> https://dxr.mozilla.org/comm-central/source/mail/app/nsMailApp.cpp
> it almost identical to
> https://dxr.mozilla.org/mozilla-central/source/browser/app/nsBrowserApp.cpp
> 
> So were in the startup sequence is the call to PR_Initialize in FF?

I don't think we ever call it.

> pt_AttachThread() asserts after calling |pthread_setspecific(pt_book.key,
> thred);| and there is a |if (!_pr_initialized) return NULL;| before, so if
> not initialised, it would have returned early.
> 
> Can you please clarify your analysis. Also, why only problems on Mac?

Oh, I see, we're asserting inside pt_AttachThread().  I don't know why _PR_InitThreads wouldn't have gotten called, then...unless somebody is racing the _PR_InitThreads call, which seems like a possibility.

I don't know why it's only a problem on Mac.
Flags: needinfo?(nfroyd)
I'm not sure what makes you think a missing call to _PR_InitThreads is to blame here. I can't see where any of this stuff is initialised and I also can't see where pthread_setspecific() is defined and why it would fail. I don't even know what NSPR is.
(In reply to Jorg K (GMT+2) from comment #9)
> I'm not sure what makes you think a missing call to _PR_InitThreads is to
> blame here. I can't see where any of this stuff is initialised and I also
> can't see where pthread_setspecific() is defined and why it would fail. I
> don't even know what NSPR is.

OK, so, this is the line that is failing:

        rv = pthread_setspecific(pt_book.key, thred);
        PR_ASSERT(0 == rv);

So the pthread_setspecific call is failing.  When does pthread_setspecific return errors?  Consulting:

http://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_setspecific.html

we see:

     The pthread_setspecific() function will fail if:

    [ENOMEM]
        Insufficient memory exists to associate the value with the key. 

    The pthread_setspecific() function may fail if:

    [EINVAL]
        The key value is invalid. 

    No errors are returned from pthread_getspecific().

My guess is that we have sufficient memory, so we're probably hitting the EINVAL case.  The Linux man page is a little more helpful about EINVAL:

       pthread_key_delete and pthread_setspecific return the following error  code  on
       error:

              EINVAL key is not a valid, allocated TSD key

so if we're getting EINVAL, that means we haven't called allocated the key, and we can only allocate keys via pthread_key_create.

OK, so where does pthread_key_create get called?  We must only call it in one place, and indeed, we call it in _PR_InitThreads:

http://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptthread.c#992

I don't know why it gets hidden behind _PT_PTHREAD_KEY_CREATE, but that's NSPR for you.  If the pthread_key_create call had failed, we would have asserted there, so we must not have made that call yet.

How could we have not made that call?  _PR_InitThreads gets called from _PR_InitStuff:

http://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/misc/prinit.c#180

and _PR_InitStuff gets called from PR_Init and _PR_ImplicitInitialization (which is called by many NSPR functions if PR_Init has not been called):

http://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/misc/prinit.c#266
http://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/misc/prinit.c#219

so NSPR didn't get initialized?  That's really weird.  Wait, no, it *did* get initialized because in pt_AttachThread:

http://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptthread.c#269

we would have returned early if NSPR hadn't been initialized.  So, NSPR has been initialized, but we haven't called _PR_InitThreads?  That's really weird.  How could that possibly happen?  Oh, here's a problem:

http://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/misc/prinit.c#141

_PR_InitStuff will return if it thinks it's already been called.  But it sets _pr_initialized *before* it's actually called all the requisite initialization functions.  That's not good; it means we can have a scenario like the following:

Thread 1: Call into _PR_InitStuff via some path
Thread 1: Set _pr_initialized = PR_TRUE
...before Thread 1 calls _PR_InitThread from _PR_InitStuff...
Thread 2: Call pt_AttachThread
Thread 2: Check _pr_initialized, find that it's true
Thread 2: Try to use pt_book.key, find that it hasn't been initialized, assert

I don't know that this is exactly what's happening--for one, I don't know why there would be two threads running around during Thunderbird startup like this.  But something like this must be going wrong in Thunderbird startup...but I don't know why Thunderbird startup would be any different than Firefox startup in this respect.  Figuring out what thread NS_DebugBreak is getting called on would be a good start; the logs that comment 0 came from aren't linked in the bug, so I can't see those.
I managed to open the 27MB 320282-line log in Chrome but there's nothing interesting:
05:34:23     INFO -  PID 6943 | Assertion failure: 0 == rv, at /builds/slave/tb-c-cen-m64-d-000000000000000/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:292
05:34:23     INFO -  PID 6943 | Redirecting call to abort() to mozalloc_abort
05:34:23     INFO -  PID 6943 | Hit MOZ_CRASH() at /builds/slave/tb-c-cen-m64-d-000000000000000/build/mozilla/memory/mozalloc/mozalloc_abort.cpp:33
Huh, you're not getting crash stacks on those errors.  That's unfortunate.

Oh, this seems to be happening during shutdown.  Maybe we're trying to get the current thread after we've already shut down NSPR, or something?  That would be a lot more believable than the racy scenario outlined above.  Perhaps the code in NS_DebugBreak should be updated to not try and get the current thread after XPCOM itself has shutdown, or something like that.
So who could do that?
Flags: needinfo?(nfroyd)
(In reply to Jorg K (GMT+2) from comment #14)
> So who could do that?

You could.
Flags: needinfo?(nfroyd)
I wonder what made you say that. You may know that I'm the only person on the Thunderbird team making sure that Thunderbird remains buildable. I'm a jack-of-all-trades (master of none), a mix of secretary, sheriff and developer. I know absolutely nothing about NSPR (which appears to be some sort of Mozilla sub-module). So I don't think I have the right skill set and I also don't have a Mac where that appears to be failing.

In general, Thunderbird uses the Mozilla "toolbox" and that toolbox just went bust a bit. So while of course we're sitting on a pile of legacy code that breaks all the time and we address it, see those leaking atoms in January 2017), here I don't see the urge to go and fix the toolbox, à la (quote) "Perhaps the code in NS_DebugBreak should be updated to not try and get the current thread after XPCOM itself has shutdown, or something like that."

I'm still remembering your comment in bug 1334771 comment #72 (quote):
===
I know it's frustrating to have all these messages explode on some random merge.  The messages are telling you that there is a problem, and that there's an opportunity to make TB better.
===

Can you convince me that this is an opportunity to make TB better? Then I will address the issue.
Summary: Various Xpshell and Mozmill failures in debug mode on 2017-09-13: PID 6709 | Assertion failure: 0 == rv, at mozilla/nsprpub/pr/src/pthreads/ptthread.c:292 → Various Xpshell and Mozmill failures in debug mode on Mac on 2017-09-13: PID 6709 | Assertion failure: 0 == rv, at mozilla/nsprpub/pr/src/pthreads/ptthread.c:292
Kai, Wan-Teh and Mike: For a while we couldn't build Thunderbird on Mac, but now that it's working again we see the crashes in debug builds:
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=48a4c3f584cc13bef1fc1c405c7edb312dc7995e&selectedJob=135549592
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=48a4c3f584cc13bef1fc1c405c7edb312dc7995e&selectedJob=135549591

Nathan suggested that I debug it but it's really beyond me. The error started with landing of bug 1393710
https://hg.mozilla.org/mozilla-central/rev/a104181ac18b (see comment #3).

Could you please advise.
Flags: needinfo?(wtc)
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(kaie)
Jorg: Let's check in this patch to get the information that
will validate our hypothesis.

This patch expands the PR_ASSERT(0 == rv) macro, and also
prints the values of 'rv' and pt_book.keyCreated. We expect
rv to be EINVAL. If pt_book.keyCreated is 0 (PR_FALSE), then
this validates our hypothesis that NSPR has been shut down
by _PR_Fini, which does not set  _pr_initialized to PR_FALSE
(that line is commented out).

I haven't checked in a patch recently, so I will need someone
else to push this patch for me. Kai, could you do this for me?
Assignee: nobody → wtc
Flags: needinfo?(wtc)
Attachment #8916753 - Flags: review?(kaie)
Hi, thanks for your help here. I'll push this to our C-C try on a Mac debug build where the error occurs, OK?

Here it is:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=dfa9ce9fe1de66800c8fdfc4eb6f48cd2d26e142

That applies your patch to the m-c (mozilla) directory for a Thunderbird build and you can click on the X and Z later to see the logs.
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(kaie)
I had to do this again since your patch didn't apply, it was missing the nsprpub/. Here we go again:

https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=adec0e2d6bfe2ba70b2dbe5171d9fb22df8f414f
As you can see here
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=adec0e2d6bfe2ba70b2dbe5171d9fb22df8f414f&selectedJob=135808586
your debug says: Assertion failure: 0 == rv (22), pt_book.keyCreated=0
22 seems to be EINVAL.

So what's next?
Hi Jorg:

Yes, 22 is EINVAL. The next step is to track down how NSPR is
shut down and what code calls PR_GetCurrentThread() after NSPR
has been shut down.

NSPR can be shut down in two ways.

1. Via an explicit PR_Cleanup() call. PR_Cleanup() resets
_pr_initialized to false.

2. By the "fini" function of the NSPR dynamic shared library
(or the dynamic shared library that contains NSPR). In this
case, the _PR_Fini function in nsprpub/pr/src/pthreads/ptthread.c
is called. _PR_Fini does not modify _pr_initialized.

Since _pr_initialized is still true, we know NSPR is shut down
by the second method. So some code calls PR_GetCurrentThread()
after the NSPR dynamic shared library (or the dynamic shared
library that contains NSPR) has been unloaded.
(In reply to Wan-Teh Chang from comment #23)
> Yes, 22 is EINVAL. The next step is to track down how NSPR is
> shut down and what code calls PR_GetCurrentThread() after NSPR
> has been shut down.
So how do you suggest this is done? I'm the Thunderbird maintainer and know nothing about NSPR. All I know is that this problem started occurring after this was landed, apparently for debugging purposes:
https://hg.mozilla.org/mozilla-central/rev/a104181ac18b#l1.32
As noted above, we only see the problem on Mac.

> Since _pr_initialized is still true, we know NSPR is shut down
> by the second method. So some code calls PR_GetCurrentThread()
> after the NSPR dynamic shared library (or the dynamic shared
> library that contains NSPR) has been unloaded.
I guess something along the lines of comment #13 needs to be done. Once NSPR is shut down, PR_GetCurrentThread() can't be called any more.
Hi Jorg:

I am very sympathetic. The sad reality is that no single person involved
has the setup, expertise, and motivation to debug this. So we just need
to do as much as we can.

1. We should get a stack trace of the code that calls NS_DebugBreak by
changing the PR_Assert in my patch to a crash. Please replace the
following line in my patch:

+            PR_Assert(msg, __FILE__, __LINE__);

with:

+            fprintf(stderr, "Assertion failure: %s, at %s:%d\n", msg, __FILE__, __LINE__);
+            fflush(stderr);
+            { int *p = NULL; *p = 0; }

2. We can add log messages to verify that PR_Cleanup() is not called
but _PR_Fini() is called. Both of these functions are defined in
nsprpub/pr/src/pthreads/ptthread.c. Please add:

    fprintf(stderr, "_PR_Fini called\n");
    fflush(stderr);

to the beginning of _PR_Fini() and add a null pointer dereference:

    { int *p = NULL; *p = 0; }

to the beginning of PR_Cleanup().
OK, try run with these changes:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=f324af5881352700c28f261aa5cdb6ee107957f1

So the idea is to crash pt_AttachThread() to get a stack dump and expect NS_DebugBreak() somewhere on the stack?
Yes, we want to get a stack trace (by forcing a crash) that shows what code
is calling NS_DebugBreak(), which triggers the PR_ASSERT failure in
pt_AttachThread().
I see the connection now: NS_DebugBreak() calls PR_GetCurrentThread() and that calls pt_AttachThread().

Seems unfortunate that PR_GetCurrentThread() does some initialisation stuff if called during shutdown. But let's see what the call stack is.
Hmm, the result is not very conclusive:
https://archive.mozilla.org/pub/thunderbird/try-builds/mozilla@jorgk.com-f324af5881352700c28f261aa5cdb6ee107957f1/try-comm-central-macosx64-debug/try-comm-central_yosemite_r7-debug_test-xpcshell-bm106-tests1-macosx-build3.txt.gz
(opening that hangs my Firefox, so be careful). I see:

20:57:10     INFO -  exiting test
20:57:10     INFO -  PID 6462 | JavaScript strict warning: resource://gre/modules/IndexedDBHelper.jsm, line 170: ReferenceError: reference to undefined property "result"
20:57:10     INFO -  "CONSOLE_MESSAGE: (warn) [JavaScript Warning: "ReferenceError: reference to undefined property "result"" {file: "resource://gre/modules/IndexedDBHelper.jsm" line: 170}]"
20:57:10     INFO -  PID 6462 | [6462, Main Thread] WARNING: OOPDeinit() without successful OOPInit(): file /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/toolkit/crashreporter/nsExceptionHandler.cpp, line 3562
20:57:10     INFO -  PID 6462 | _PR_Fini called
20:57:10     INFO -  PID 6462 | nsStringStats
20:57:10     INFO -  PID 6462 |  => mAllocCount:           9146
20:57:10     INFO -  PID 6462 |  => mReallocCount:          140
20:57:10     INFO -  PID 6462 |  => mFreeCount:            9146
20:57:10     INFO -  PID 6462 |  => mShareCount:           6841
20:57:10     INFO -  PID 6462 |  => mAdoptCount:            404
20:57:10     INFO -  PID 6462 |  => mAdoptFreeCount:        404
20:57:10     INFO -  PID 6462 |  => Process ID: 6462, Thread ID: 140735196943104
20:57:10     INFO -  PID 6462 | Assertion failure: 0 == rv (22), pt_book.keyCreated=0, at /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:299
20:57:10     INFO -  <<<<<<<
20:57:10     INFO -  TEST-START | dom/push/test/xpcshell/test_clearAll_successful.js

but no stack. Others look like

20:57:14     INFO -  exiting test
20:57:14     INFO -  PID 6464 | [6464, Main Thread] WARNING: unable to post continuation event: file /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/xpcom/io/nsStreamUtils.cpp, line 485
20:57:14     INFO -  PID 6464 | [6464, Main Thread] WARNING: some msg dbs left open: '!m_dbCache.Length()', file /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mailnews/db/msgdb/src/nsMsgDatabase.cpp, line 84
20:57:14     INFO -  PID 6464 | db left open /private/var/folders/n0/6tpw2p6j3vx8fzyb0wrmlcwc00000w/T/xpc-profile-AEQFF0/ImapMail/localhost/gabba0.msf
20:57:14     INFO -  PID 6464 | db left open /private/var/folders/n0/6tpw2p6j3vx8fzyb0wrmlcwc00000w/T/xpc-profile-AEQFF0/ImapMail/localhost/Trash.msf
20:57:14     INFO -  PID 6464 | db left open /private/var/folders/n0/6tpw2p6j3vx8fzyb0wrmlcwc00000w/T/xpc-profile-AEQFF0/ImapMail/localhost/Trash.msf
20:57:14     INFO -  PID 6464 | [6464, Main Thread] WARNING: OOPDeinit() without successful OOPInit(): file /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/toolkit/crashreporter/nsExceptionHandler.cpp, line 3562
20:57:14     INFO -  PID 6464 | _PR_Fini called
20:57:14     INFO -  PID 6464 | nsStringStats
20:57:14     INFO -  PID 6464 |  => mAllocCount:          13771
20:57:14     INFO -  PID 6464 |  => mReallocCount:          473
20:57:14     INFO -  PID 6464 |  => mFreeCount:           13760  --  LEAKED 11 !!!
20:57:14     INFO -  PID 6464 |  => mShareCount:          10692
20:57:14     INFO -  PID 6464 |  => mAdoptCount:           3108
20:57:14     INFO -  PID 6464 |  => mAdoptFreeCount:       3107  --  LEAKED 1 !!!
20:57:14     INFO -  PID 6464 |  => Process ID: 6464, Thread ID: 140735196943104
20:57:14     INFO -  PID 6464 | Assertion failure: 0 == rv (22), pt_book.keyCreated=0, at /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:299
20:57:14     INFO -  <<<<<<<

So it looks like it's exiting a test and wants to report some error that occurred during the test.

Please advise further steps.
Marco and Andrew, should that JS error be fixed?
[JavaScript Warning: "ReferenceError: reference to undefined property "result"" {file: "resource://gre/modules/IndexedDBHelper.jsm" line: 170}]
Flags: needinfo?(mak77)
Flags: needinfo?(bugmail)
The Mozmill log isn't more conclusive:
https://archive.mozilla.org/pub/thunderbird/try-builds/mozilla@jorgk.com-f324af5881352700c28f261aa5cdb6ee107957f1/try-comm-central-macosx64-debug/try-comm-central_yosemite_r7-debug_test-mozmill-bm106-tests1-macosx-build2.txt.gz

20:10:00     INFO -  _PR_Fini called
20:10:00     INFO -  nsStringStats
20:10:00     INFO -   => mAllocCount:            117
20:10:00     INFO -   => mReallocCount:            2
20:10:00     INFO -   => mFreeCount:             117
20:10:00     INFO -   => mShareCount:            269
20:10:00     INFO -   => mAdoptCount:              0
20:10:00     INFO -   => mAdoptFreeCount:          0
20:10:00     INFO -   => Process ID: 722, Thread ID: 140735235392256
20:10:00     INFO -  Assertion failure: 0 == rv (22), pt_book.keyCreated=0, at /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:299

Instead of a home-made crash dereferencing null, should we just do a MOZ_CRASH() or similar?
I don't know much about indexedDB, forwarding to Jan.
Flags: needinfo?(mak77) → needinfo?(jvarga)
(In reply to Jorg K (GMT+2) from comment #32)
> With MOZ_CRASH():
That didn't build:
Undefined symbols for architecture x86_64:
  "_MOZ_CRASH", referenced from:
      _PR_GetCurrentThread in ptthread.o
      _PR_Cleanup in ptthread.o
ld: symbol(s) not found for architecture x86_64
Wan-Teh, what would be the next step?
Flags: needinfo?(wtc)
Jorg: If we can't get a stack trace by dereferencing a null pointer,
we can find out if Mac OS X has a function that prints the current
stack trace.

Also, could you find out if Thunderbird has an NSPR dynamic shared
library? The file name would be libnspr4.dylib. Firefox right now
puts the NSPR code into some other dynamic shared library.

If Thunderbird puts the NSPR code into some other dynamic shared
library, then there may be multiple "destructor" functions in that
dynamic shared library. If so, we need to make sure _PR_Fini is
called after any other "destructor" function that may call an NSPR
function. This probably can be accomplished by passing NSPR's object
files last to the linker.
Flags: needinfo?(wtc)
Here is something that looks like a stack:
https://archive.mozilla.org/pub/thunderbird/try-builds/mozilla@jorgk.com-1eff36626eb55c6e8f3dbe04c5093facdf9770f2/try-comm-central-macosx64-debug/try-comm-central_yosemite_r7-debug_test-mozmill-bm106-tests1-macosx-build3.txt.gz

17:48:25     INFO -  _PR_Fini called
17:48:25     INFO -  nsStringStats
17:48:25     INFO -   => mAllocCount:            117
17:48:25     INFO -   => mReallocCount:            2
17:48:25     INFO -   => mFreeCount:             117
17:48:25     INFO -   => mShareCount:            269
17:48:25     INFO -   => mAdoptCount:              0
17:48:25     INFO -   => mAdoptFreeCount:          0
17:48:25     INFO -   => Process ID: 718, Thread ID: 140735298851584
17:48:25     INFO -  Assertion failure: 0 == rv (22), pt_book.keyCreated=0, at /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:304
17:48:25     INFO -  0   libnss3.dylib                       0x0000000105ea8477 PR_GetCurrentThread + 183
17:48:25     INFO -  1   XUL                                 0x0000000106532215 _ZNK7mozilla14VolatileBuffer26NonHeapSizeOfExcludingThisEv + 5257989
17:48:25     INFO -  2   XUL                                 0x000000010656bbe9 _ZNK7mozilla14VolatileBuffer26NonHeapSizeOfExcludingThisEv + 5493977
17:48:25     INFO -  3   libsystem_c.dylib                   0x00007fff9686a72d __cxa_finalize_ranges + 322
17:48:25     INFO -  4   libsystem_c.dylib                   0x00007fff9686aa30 exit + 55
17:48:25     INFO -  5   plugin-container                    0x0000000105d5ce8b start + 59
17:48:25     INFO -  6   ???                                 0x000000000000000c 0x0 + 12

https://archive.mozilla.org/pub/thunderbird/try-builds/mozilla@jorgk.com-1eff36626eb55c6e8f3dbe04c5093facdf9770f2/try-comm-central-macosx64-debug/try-comm-central_yosemite_r7-debug_test-xpcshell-bm106-tests1-macosx-build4.txt.gz
has the same thing.
Flags: needinfo?(wtc)
OS: Unspecified → Mac OS X
(In reply to Jorg K (GMT+2) from comment #37)
> There isn't a single reference to libnspr4 in comm-central, so I guess TB
> does whatever FF does:

The question is, does the list of binary files that are part of the thunderbird application, contain a file with that name?
Hi Jorg,

The stack trace in your comment 38 is very useful. Could you
also print the stack trace when _PR_Fini is called (i.e.,
after we print "_PR_Fini called\n" to stderr)?

It is possible that all the NSPR and NSS code is put in a
single dynamic shared library, libnss3.dylib. The stack trace
when _PR_Fini is called should tell us the name of the dynamic
shared library that contains the NSPR code.

Is libsystem_c.dylib a system library of Mac OS X?

Is "XUL" in the stack trace a dynamic shared library? We should
use a C++ name demangler to recover the C++ symbol of
_ZNK7mozilla14VolatileBuffer26NonHeapSizeOfExcludingThisEv.
Flags: needinfo?(wtc)
(In reply to Kai Engert (:kaie:) from comment #39)
> (In reply to Jorg K (GMT+2) from comment #37)
> > There isn't a single reference to libnspr4 in comm-central, so I guess TB
> > does whatever FF does:
> The question is, does the list of binary files that are part of the
> thunderbird application, contain a file with that name?
I really can't answer that. If libnspr4 doesn't show up in comm-central
https://dxr.mozilla.org/comm-central/search?q=libnspr4&redirect=false
we do whatever Firefox does. I also don't have a Mac and very little Mac knowledge. Perhaps one could download the TB app for Mac and see what's in in.

(In reply to Wan-Teh Chang from comment #40)
> The stack trace in your comment 38 is very useful. Could you
> also print the stack trace when _PR_Fini is called (i.e.,
> after we print "_PR_Fini called\n" to stderr)?
Coming up.

> Is libsystem_c.dylib a system library of Mac OS X?
As I said above, I have basically no Mac knowledge, but it appears that way:
https://www.google.com.au/search?source=hp&q=libsystem_c.dylib&oq=libsystem_c.dylib
(In reply to Wan-Teh Chang from comment #40)
> Is "XUL" in the stack trace a dynamic shared library? We should
> use a C++ name demangler to recover the C++ symbol of
> _ZNK7mozilla14VolatileBuffer26NonHeapSizeOfExcludingThisEv.

https://dxr.mozilla.org/mozilla-central/source/memory/volatile/VolatileBufferFallback.cpp?q=%2Bfunction%3A%22mozilla%3A%3AVolatileBuffer%3A%3ANonHeapSizeOfExcludingThis%28%29+const%22&redirect_type=single#86
(In reply to Kai Engert (:kaie:) from comment #39)
> The question is, does the list of binary files that are part of the
> thunderbird application, contain a file with that name?

I don't have a Mac, but I was able to download thunderbird-58.0a1.en-US.mac.dmg, convert it using dmg2img, and mount it on Linux.

find |egrep -i "nspr|nss"
./Daily.app/Contents/MacOS/libnss3.dylib
./Daily.app/Contents/MacOS/libnssckbi.dylib
./Daily.app/Contents/MacOS/libnssdbm3.dylib

find . -name \*.dylib
./Daily.app/Contents/MacOS/libfreebl3.dylib
./Daily.app/Contents/MacOS/libldap60.dylib
./Daily.app/Contents/MacOS/libldif60.dylib
./Daily.app/Contents/MacOS/liblgpllibs.dylib
./Daily.app/Contents/MacOS/libmozglue.dylib
./Daily.app/Contents/MacOS/libnss3.dylib
./Daily.app/Contents/MacOS/libnssckbi.dylib
./Daily.app/Contents/MacOS/libnssdbm3.dylib
./Daily.app/Contents/MacOS/libplugin_child_interpose.dylib
./Daily.app/Contents/MacOS/libprldap60.dylib
./Daily.app/Contents/MacOS/libsoftokn3.dylib

Apparently NSPR is contained inside libnss3.dylib
memory/volatile/VolatileBuffer.h contains:
  NS_INLINE_DECL_THREADSAFE_REFCOUNTING(VolatileBuffer)
which probably causes a thread check on buffer destruction.
If libnss3.dylib containing nspr is unloaded, but a C symbol contained in libnss3.dylib is called after its unloaded, then either the automatic unloading of libraries is broken on OSX, or something deliberately unloads it at the wrong time.
Thanks for the action here. I've got a new try going and paste the results here when it's done:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=419c83c822f32c1a51426843917d386d38c53718
Attached patch no-fold.patch (obsolete) — Splinter Review
Jorg, I wonder if it still works to build NSPR in its own shared library (dylib) files.

Could you please start a try build, with this as the only patch, and see if the build works, and you're still crashing?
Sorry, I messed up the "stack patch", here another one:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=0cf1a0022ad7e22bfc80c1cc078fe416a0cae731

Kai, I'll do your try next.
OK, here the stack result, I'll post the "no fold" when it's ready:
https://archive.mozilla.org/pub/thunderbird/try-builds/mozilla@jorgk.com-0cf1a0022ad7e22bfc80c1cc078fe416a0cae731/try-comm-central-macosx64-debug/try-comm-central_yosemite_r7-debug_test-xpcshell-bm106-tests1-macosx-build5.txt.gz

14:24:01     INFO -  PID 5626 | _PR_Fini called
14:24:01     INFO -  PID 5626 | 0   libnss3.dylib                       0x0000000100ab7624 PR_Sleep + 804
14:24:01     INFO -  PID 5626 | 1   ???                                 0x00007fff634d1fd0 0x0 + 140734859386832
14:24:01     INFO -  PID 5626 | 2   ???                                 0x00007fff634c12e2 0x0 + 140734859317986
14:24:01     INFO -  PID 5626 | 3   libsystem_c.dylib                   0x00007fff892b672d __cxa_finalize_ranges + 322
14:24:01     INFO -  PID 5626 | 4   libsystem_c.dylib                   0x00007fff892b6a30 exit + 55
14:24:01     INFO -  PID 5626 | 5   xpcshell                            0x000000010096bd9b start + 59
14:24:01     INFO -  PID 5626 | 6   ???                                 0x000000000000001f 0x0 + 31
14:24:01     INFO -  PID 5626 | nsStringStats
14:24:01     INFO -  PID 5626 |  => mAllocCount:          27168
14:24:01     INFO -  PID 5626 |  => mReallocCount:         1140
14:24:01     INFO -  PID 5626 |  => mFreeCount:           27168
14:24:01     INFO -  PID 5626 |  => mShareCount:          19572
14:24:01     INFO -  PID 5626 |  => mAdoptCount:            626
14:24:01     INFO -  PID 5626 |  => mAdoptFreeCount:        626
14:24:01     INFO -  PID 5626 |  => Process ID: 5626, Thread ID: 140735093183232
14:24:01     INFO -  PID 5626 | Assertion failure: 0 == rv (22), pt_book.keyCreated=0, at /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:304
14:24:01     INFO -  PID 5626 | 0   libnss3.dylib                       0x0000000100ab6d07 PR_GetCurrentThread + 183
14:24:01     INFO -  PID 5626 | 1   XUL                                 0x0000000101140115 _ZNK7mozilla14VolatileBuffer26NonHeapSizeOfExcludingThisEv + 5258133
14:24:01     INFO -  PID 5626 | 2   XUL                                 0x0000000101179ae9 _ZNK7mozilla14VolatileBuffer26NonHeapSizeOfExcludingThisEv + 5494121
14:24:01     INFO -  PID 5626 | 3   libsystem_c.dylib                   0x00007fff892b672d __cxa_finalize_ranges + 322
14:24:01     INFO -  PID 5626 | 4   libsystem_c.dylib                   0x00007fff892b6a30 exit + 55
14:24:01     INFO -  PID 5626 | 5   xpcshell                            0x000000010096bd9b start + 59
14:24:01     INFO -  PID 5626 | 6   ???                                 0x000000000000001f 0x0 + 31
14:24:01     INFO -  PID 5626 | [5626, Main Thread] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/xpcom/base/nsTraceRefcnt.cpp, line 171
14:24:01     INFO -  <<<<<<<
The "no fold" patch caused a package error:

https://archive.mozilla.org/pub/thunderbird/try-builds/mozilla@jorgk.com-3c1da749e3cf3ea7733645d7c118c3b6c32521ff/try-comm-central-macosx64-debug/try-comm-central-macosx64-debug-bm87-try1-build15.txt.gz

Error: /builds/slave/tb-try-c-cen-m64-d-00000000000/build/objdir-tb/mail/installer/package-manifest:77: Missing file(s): DailyDebug.app/Contents/Resources/libmozsqlite3.dylib
Traceback (most recent call last):
  File "/builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/toolkit/mozapps/installer/packager.py", line 333, in <module>
    main()
  File "/builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/toolkit/mozapps/installer/packager.py", line 302, in main
    copier.add(mozpath.join(respath, 'removed-files'), removals)
  File "/tools/python/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/builds/slave/tb-try-c-cen-m64-d-00000000000/build/mozilla/python/mozbuild/mozpack/errors.py", line 131, in accumulate
    raise AccumulatedErrors()
mozpack.errors.AccumulatedErrors
make[3]: *** [stage-package] Error 1
Comment 51 shows two stacks. __cxa_finalize_ranges calls the cleanup functions that are registered to be called at exit time.

We see that the NSPR cleanup function _PR_Fini is called first, and afterwards another cleanup function gets called, which belongs to some XPCOM code, and which tries to use NSPR's thread-local-storage variables, which have already been destroyed.

This means we have an ordering problem.

(Can we assume that both calls are made from the main thread?)


How many automatic cleanup functions does Mozilla use?

IIUC on Mac we use the __attribute__ ((destructor)) mechanism to register the _PR_Fini unload function.

Searching through the rest of Firefox code, I found two additional functions:

security/nss/lib/freebl/stubs.c
728:void __attribute((destructor)) FREEBL_unload()

intl/icu/source/common/ucln_imp.h
169:static void ucln_destructor()   __attribute__((destructor)) ;

Are there any occurrences of __attribute__((destructor)) in the Thunderbird specific code?


The NSS function seems safe, it might only call dlclose().

The ucln function cleanup function is more complex, it loops through various other cleanup things.


Either we must
(a) find a way to guarantee that _PR_Fini is the last of all registered cleanup functions to be called (might not be possible),
or
(b) change ucln_destructor to not use the same cleanup mechanism as NSPR (probably difficult)
or
(c) Change GetCurrentThread() to be failsafe after having gone through _PR_Fini.
or
(d) change _PR_Fini to do nothing (dirty).


Is it reasonable to hope that the "threadsafe reference counting" is the only NSPR service that these additional cleanup functions might call? If yes, we could try (c), and test if it's sufficient.

Or could that other cleanup code call additional, more complex functions, like NSPR memory destruction functions? If yes, then we need either (a), (b), (d).
As a test to confirm that the race with the intl/icu code is indeed the culprit, you could test to disable that cleanup code.
In ucln_imp.h, you could change both definitions of UCLN_CLEAN_ME_UP to do nothing.
(c) isn't trivial either. Changing GetCurrentThread to simply return NULL would result in the ownership check (comment 45) to fail and nsAutoOwningThread::AssertCurrentThreadOwnsMe would trigger a forced crash.

Unless anyone has good ideas how to do (a) or (b), we'll have to use (d).

_PR_Fini calls pthread_key_delete(). The man page says
  "Any attempt to use key following the call to pthread_key_delete() results in undefined behavior."

But that's what's happening currently, PR_GetCurrentThread() calls _PT_PTHREAD_GETSPECIFIC() with the deleted key.

Since we apparently cannot guarantee that the key might be used after _PR_Fini, we probably shouldn't delete it. And we probably should keep the thread identifiers alive, to ensure that late same-thread-checks will continue to work.
Attached patch disable-pr-fini.patch (obsolete) — Splinter Review
This is a patch for the dirty approach (d).

Wan-Teh, do you think we should use this approach, or do you have ideas for (a) and (b), or other ideas?
Attachment #8918987 - Attachment is obsolete: true
Kai: Thanks a lot for investigating this.

Given Thunderbird's current status, I think it is fine to
apply your disable-pr-fini.patch as a Thunderbird-local
patch.

I think __cxa_finalize_ranges may also call C++ global
object destructors and atexit handlers, so it is not enough
to just look for __attribute__((destructor)) in the source
tree.

It is strange that Mac OS X unloads libnss3.dylib (which
contains the NSPR code) before it unloads code that uses
NSPR. I would think that the operating system should know
the dependencies between dynamic shared libraries and
unload them in an appropriate order.
(In reply to Kai Engert (:kaie:) from comment #53)
>
> Either we must
> (a) find a way to guarantee that _PR_Fini is the last of all registered
> cleanup functions to be called (might not be possible),
> or
> (b) change ucln_destructor to not use the same cleanup mechanism as NSPR
> (probably difficult)
> or
> (c) Change GetCurrentThread() to be failsafe after having gone through
> _PR_Fini.
> or
> (d) change _PR_Fini to do nothing (dirty).

Thanks for enumerating the possible solutions. We should not consider (c).
(b) is speculation because we still haven't identified the function that
calls PR_GetCurrentThread().

(a) is the proper solution, but it is not practical because no single
person has the setup, expertise, and motivation to do it. (We need to
at least identify the function that calls PR_GetCurrentThread() after
_PR_Fini() has been called.) So I am afraid that we will need to do (d).
(In reply to Wan-Teh Chang from comment #57)
> Given Thunderbird's current status, I think it is fine to
> apply your disable-pr-fini.patch as a Thunderbird-local
> patch.

I'm not happy with the idea of adding local NSPR patches to applications.

If we cannot think of a better fix, I'd prefer to add it to NSPR in general,
potentially with #idef for Mac, only.
Summary: Various Xpshell and Mozmill failures in debug mode on Mac on 2017-09-13: PID 6709 | Assertion failure: 0 == rv, at mozilla/nsprpub/pr/src/pthreads/ptthread.c:292 → On Mac OSX, NSPR _PR_Fini gets called too early, other cleanup code calls NSPR functions afterwards, which fails | Assertion failure: 0 == rv, at mozilla/nsprpub/pr/src/pthreads/ptthread.c:292
Assignee: wtc → kaie
Component: General → NSPR
Product: Thunderbird → NSPR
Version: unspecified → other
Attached patch 1399746-no-fini-on-osx.patch (obsolete) — Splinter Review
Attachment #8919230 - Attachment is obsolete: true
Attachment #8919369 - Flags: review?(wtc)
Comment on attachment 8919369 [details] [diff] [review]
1399746-no-fini-on-osx.patch

Jorg, could you please test this patch, to confirm it fixes the issue?
Attachment #8919369 - Flags: feedback?(jorgk)
Comment on attachment 8919369 [details] [diff] [review]
1399746-no-fini-on-osx.patch

Review of attachment 8919369 [details] [diff] [review]:
-----------------------------------------------------------------

::: pr/src/pthreads/ptthread.c
@@ +1067,5 @@
> +    /* We disable the cleanup code on Mac OSX, see bug 1399746.
> +     * The .dylib containing NSPR can get unloaded, and _PR_Fini called,
> +     * and other code calling NSPR functions can get executed afterwards.
> +    */
> +#ifndef DARWIN

Kai: We don't need to do this for all applications; apparently
Firefox has no problem with _PR_Fini() on Mac OS X.

_PR_Fini() was added to support applications that load and unload.
the NSPR shared library using dlopen() and dlclose(). Those
applications may load and unload the NSPR shared library repeatedly.
So the unloading of the NSPR shared library must free the
pthread specific data key, otherwise the process will run out of
pthread specific data keys eventually. I can add an NSPR test
program to demonstrate this use case.

This is why I recommend applying this patch to only Thunderbird.
It is a pragmatic solution.

Perhaps I should spend a weekend building Thunderbird on Mac OS X
and tracking this down.
This patch adds a new test program, dlltest_unix.c, that loads and
unloads libnspr4.so repeatedly. It demonstrates that use case that
caused us to add _PR_Fini().
Wan-Teh,

Thunderbird uses a shared codebase with Firefox. Thunderbird doesn't use its own copy of NSPR. Your suggestion would require someone to change the build system of Thunderbird to dynamically apply a patch on top of NSPR each time after they checkout the updated Firefox/Gecko base code.

I don't recommend that, and I really don't want us to get into a habit of allowing applications to patch NSPR. They should never have to. And on some operating systems like Linux, where the NSPR library is shipped only once for the whole operating systems, applications aren't even able to use their own modifications.
What is supposed to happen, after the runtime environment unloaded a shared library? What happens to the global variables that reside inside that library?

Apparently the nspr library was unloaded. But then other code called into NSPR, and the access to global variables like _pr_initialized was still possible?

Does that mean the OSX runtime is completly broken? Shouldn't the attempt to access _pr_initialized after unload have caused some sort of memory access failure?

Or does it mean that the values of static variables inside dynamic libraries will be preserved across reloads?

Is that behavior different across operating systems?

I suggest that on platforms, where the contents of static variables from unloaded variables may still be alive after an unload, we should disable the cleanup of the thread storage key allocation.

If this is the case for Mac OSX, then it isn't incorrect to always skip _PR_Fini on Mac, because other applications might run into the same problem, even if we haven't yet seen it happen in other applications.
(In reply to Kai Engert (:kaie:) from comment #64)
> Thunderbird uses a shared codebase with Firefox. Thunderbird doesn't use its
> own copy of NSPR. Your suggestion would require someone to change the build
> system of Thunderbird to dynamically apply a patch on top of NSPR each time
> after they checkout the updated Firefox/Gecko base code.
You can do conditional compiles with #ifdef MOZ_THUNDERBIRD (for examples search DXR for MOZ_THUNDERBIRD). I have read the arguments against this approach. I'll ship off a try with Kai's patch shortly.
(In reply to Jorg K (GMT+2) from comment #30)
> Marco and Andrew, should that JS error be fixed?
> [JavaScript Warning: "ReferenceError: reference to undefined property
> "result"" {file: "resource://gre/modules/IndexedDBHelper.jsm" line: 170}]

IDBTransaction doesn't have a result property, I guess |txn.result| should be replaced with |event.target.result|
Flags: needinfo?(jvarga)
(In reply to Jan Varga [:janv] from comment #68)
> IDBTransaction doesn't have a result property, I guess |txn.result| should
> be replaced with |event.target.result|
Filed bug 1409641 for that.
Flags: needinfo?(bugmail)
Comment on attachment 8919369 [details] [diff] [review]
1399746-no-fini-on-osx.patch

Thanks, that fixes the test failures we were seeing, see comment #67 for link to try run. The X and Z are red for other reasons, the failure reported here has gone away.
Attachment #8919369 - Flags: feedback?(jorgk) → feedback+
(In reply to Jorg K (GMT+2) from comment #66)
> You can do conditional compiles with #ifdef MOZ_THUNDERBIRD (for examples
> search DXR for MOZ_THUNDERBIRD).

I think general libray code, like NSPR, shouldn't contain application specific defines.
Comment on attachment 8916753 [details] [diff] [review]
Expand PR_ASSERT(0 == rv) in pt_AttachThread to debug Bug 1399746

marking this debugging patch as obsolete
Attachment #8916753 - Attachment is obsolete: true
Attachment #8916753 - Flags: review?(kaie)
(In reply to Wan-Teh Chang from comment #63)
> This patch adds a new test program, dlltest_unix.c, that loads and
> unloads libnspr4.so repeatedly. It demonstrates that use case that
> caused us to add _PR_Fini().

Unfortuantely I don't have a 64-bit Mac and cannot test on recent OSX versions.

Is _PR_Fini called for each unload on Mac, and do static variables like _pr_initialized still have the previous state after reload? Then the reload is broken anyway. Without a recent Mac, I cannot confirm that. But if NSPR code can be executed while the NSPR dylib is unloaded, we apparently are dealing with undefined states anyway, so I wouldn't expect that reloading works correctly on OSX.
Comment on attachment 8919369 [details] [diff] [review]
1399746-no-fini-on-osx.patch

If we want this fixed in NSPR for Thunderbird 58, this must get r+'ed and landed into NSPR by the end of this week.
Attachment #8919369 - Flags: review?(mh+mozilla)
It's my understand that although Thunderbird nightly tracks trunk, releases are tied to Firefox ESR releases, so the next major Thunderbird release is 59, not 58, which may take the pressure off slightly.
It would be nice to fix this asap, since it's busting our all debug tests on Mac. It's already affecting TB 58 Daily and TB 57 Beta since bug 1393710 which caused this failure landed on mozilla57.
Comment on attachment 8919369 [details] [diff] [review]
1399746-no-fini-on-osx.patch

Review of attachment 8919369 [details] [diff] [review]:
-----------------------------------------------------------------

This doesn't seem right. It shouldn't be possible for NSPR destructors to run before XUL ones have run. Or there's a serious bug in whatever version of OSX this is crashing on. Is this running on the same version as the one on Firefox automation? If not that could explain why Firefox is not affected.

The stack traces in comment 51 unfortunately lack symbols for the most interesting frames. And even those that are printed out are dubious: __cxa_finalize_ranges + 322 means 322 bytes after __cxa_finalize_ranges. 322 bytes is rather large for a function, and that could actually be a totally different one, just that we don't have the symbols for them. It however doesn't matter too much for that one, because it's the same root for both calls, and the caller is exit, so we know it's probably related to running destructors. It's even worse in XUL, where the symbol being pointed at is _ZNK7mozilla14VolatileBuffer26NonHeapSizeOfExcludingThisEv + 5258133. 5MB after a symbol is certainly not that symbol. Then there are the unknown addresses like 0x00007fff634d1fd0 0x0 + 140734859386832. That would be the most interesting thing to figure out: *exactly* how _PR_Fini is being called.

> How many automatic cleanup functions does Mozilla use?

Many. __attribute__((destructor)) is not the only way to run destructors. C++ globals with destructors run some too.
Attachment #8919369 - Flags: review?(mh+mozilla) → review-
Severity: normal → critical
I don't have a 64bit mac.

Is there anyone with a 64bit mac, who has the skills and motivation to fix this bug within the next few weeks, in time for Firefox/Thunderbird 59?

Any suggestions?
There's only 64bit mac. The Mac in the NSS CI are therefore 64bit.
Franziskus, we need a developer that physically controls a 64bit Mac for interactive debugging.
You can get a "loaner" (loan machine). I don't know the details.
(In reply to Jorg K (GMT+1) [currently bustage-fix only, no NI? or r?] from comment #81)
> You can get a "loaner" (loan machine). I don't know the details.

Who might know?
That page says
  "To access all loaners, you require access to the Mozilla VPN"

Does that mean this access is retricted to Mozilla employees?
(In reply to Kai Engert (:kaie:) from comment #84)
> That page says
>   "To access all loaners, you require access to the Mozilla VPN"
> 
> Does that mean this access is retricted to Mozilla employees?

I recall at least last winter someone from Thunderbird, (aleth or ewong?) got a loaner for investigations. I don't know if it changed, or the documentation is wrong.
It looks like this was fixed by Bug 1420407: https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=1654a1adc838380b2262f3981986d61cf9ee4a80&filter-searchStr=osx is green.
Status: NEW → RESOLVED
Closed: 7 years ago
Depends on: 1420407
Resolution: --- → FIXED
Indeed, the latest C-C push at
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=1654a1adc838380b2262f3981986d61cf9ee4a80
does not show those "Assertion failure: 0 == rv" failures any more.
Target Milestone: --- → 4.18
Attachment #8919369 - Flags: review?(wtc)
That's great news, but very confusing, because that latest change to NSPR should only have affected the Windows platform.
I assumed based on https://bugzilla.mozilla.org/show_bug.cgi?id=1420407#c0 that this was expected. Was https://hg.mozilla.org/integration/mozilla-inbound/rev/798b87f30668#l16.10 not supposed to be included?
Flags: needinfo?(kaie)
That was added accidentally, it was still in my local tree when I added the other commit.
Flags: needinfo?(kaie)
So, I'm sorry that this code was accidentally checked in. I should have paid better attention.

On the other hand, it's good to know that this workaround really helps to work around the bug, and apparently, because nobody else has reported any regressions yet, that it isn't causing any problems elsewhere. I suggest that we continue to look out for potential side effects.

If nobody has the ability to work on a more correct fix for Mac, we could consider to keep this workaround until someone is able to.
Mike, you had given r- to the patch. It was landed accidentally, but seems to help.

I'm willing to follow proper process and backout, and reactivate the bug.

Alternatively, we could keep the patch, even though it isn't ideal.

Mike, please let me know your preference.
Thanks
Flags: needinfo?(mh+mozilla)
I'd rather you backed out.
Flags: needinfo?(mh+mozilla)
backed out:
https://hg.mozilla.org/projects/nspr/rev/8d2ea9b5cc6d

I'll land a NSPR beta 3 with bug 1420407 into m-i, so you'll probably see the bug showing up again.

Reopening.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
No longer depends on: 1420407
See Also: → 1439973
So, with bug 1439973 providing a reliable way to reproduce this with Firefox, I was able to reproduce and attribute this the OSX dynamic linker.

TL;DR: don't use __attribute__((destructor)) on OSX. Use atexit() or C++ class destructors on global instances instead.

An smaller way to reproduce the issue:
$ cat <<EOF > test.cc
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>

void my_atexit() {
  printf(__FILE__ " atexit\n");
}

class Foo {
public:
  ~Foo() {
    printf(__FILE__ " ~Foo\n");
  }
};

Foo f;

__attribute__((destructor))
void destructor() {
  printf(__FILE__ " destructor\n");
}

__attribute__((constructor))
void constructor() {
  atexit(my_atexit);
  printf(__FILE__ " constructor\n");
}

int main() {
    void* handle = dlopen("libbar.dylib", RTLD_GLOBAL);
    dlclose(handle);
    return 0;
}
EOF

$ cp test.cc foo.cc
$ cp test.cc bar.cc
$ clang++ -shared -o libfoo.dylib foo.cc
$ clang++ -shared -o libbar.dylib bar.cc libfoo.dylib
$ clang++ -o test test.cc

So with this the test program dlopens libbar, which depends on libfoo. Running the program produces:

$ ./test
test.cc constructor
foo.cc constructor
bar.cc constructor
bar.cc destructor
foo.cc destructor
bar.cc ~Foo
bar.cc atexit
foo.cc ~Foo
foo.cc atexit
test.cc destructor
test.cc ~Foo
test.cc atexit

Constructors run in the expected order: test, libfoo, then libbar. Then we dlclose libbar and things start being funny: destructors run in the "right" order ... if they were the only thing to run. But atexit and C++ destructors for libbar run *after* the libfoo destructor. In our case, libbar is XUL and libfoo is nspr.

Variant:
$ clang++ -o test test.cc libbar.dylib 

This time libbar is linked to test directly, so dlopen/dlclose have no effect. That's essentially how the plugin-container is linked, contrary to the firefox binary.

$ ./test
foo.cc constructor
bar.cc constructor
test.cc constructor
test.cc destructor
bar.cc destructor
foo.cc destructor
test.cc ~Foo
test.cc atexit
bar.cc ~Foo
bar.cc atexit
foo.cc ~Foo
foo.cc atexit
Assignee: kaie → mh+mozilla
atexit on its own is not enough, though, atexit must be called before anything in XUL calls atexit/cxa_atexit. I did a try build with an atexit(_PR_Fini) in _PR_InitThreads, and that didn't work out because _PR_Fini would still be called before some XUL destructors.
Theoretically, we could use atexit on all platforms (even HPUX, AIX and Solaris support it). The problem is that while I do know that Solaris actually does the right thing with it, I don't feel like checking whether all those OSes supported here do. So I'd rather assume the current code works fine for those other platforms, and make the fix OSX-only.
Attachment #8919369 - Attachment is obsolete: true
Attachment #8952948 - Flags: review?(ted)
Attachment #8952948 - Attachment is obsolete: true
Attachment #8952948 - Flags: review?(ted)
Attachment #8952949 - Flags: review?(ted)
Comment on attachment 8952949 [details] [diff] [review]
Use atexit() to register _PR_Fini on OSX

Mike, thanks a lot for tracking this down. If we want this to be part of Firefox/Thunderbird 60, we should hurry up to get this landed. We're currently considering next monday as the dev freeze date for NSPR 4.19 and NSS 3.36, which are the versions that moz 60 will use.
Attachment #8952949 - Flags: review+
https://hg.mozilla.org/projects/nspr/rev/875fee4c031efeb1af4f48db37669b96a749ac1d
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
Target Milestone: 4.18 → 4.19
Thanks for the fix here. In order to unblock bug 1353461 we'll need to either:

1. Wait for nspr 4.19 to merge to central
2. Cherry-pick this fix into central
3. Disable the reftest manifest that triggers this

Mike, do you know roughly when we might merge nspr 4.19 in? Is there any reason not to cherry-pick this in the meantime?
Flags: needinfo?(mh+mozilla)
I'll uplift a new nspr beta snapshot to m-c now.
Flags: needinfo?(mh+mozilla)
Blocks: 1432480
See Also: → 1498449
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: