Closed Bug 1121269 Opened 9 years ago Closed 9 years ago

[nuwa] __wrap_pthread_key_create causes deadlock in jemalloc3

Categories

(Firefox OS Graveyard :: General, defect)

All
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-b2g:2.1+, firefox36 wontfix, firefox37 wontfix, firefox38 fixed, b2g-v2.1 fixed, b2g-v2.1S fixed, b2g-v2.2 fixed, b2g-master fixed)

RESOLVED FIXED
2.2 S6 (20feb)
blocking-b2g 2.1+
Tracking Status
firefox36 --- wontfix
firefox37 --- wontfix
firefox38 --- fixed
b2g-v2.1 --- fixed
b2g-v2.1S --- fixed
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: erahm, Assigned: glandium)

References

Details

Attachments

(3 files, 6 obsolete files)

1.10 KB, patch
sinker
: review+
Details | Diff | Splinter Review
1.62 KB, patch
Waldo
: review+
Details | Diff | Splinter Review
7.37 KB, patch
sinker
: review+
Details | Diff | Splinter Review
See bug 762449, comment 24 for details. On B2G emulators we're seeing a deadlock in where we initialize jemalloc, that in turn calls __wrap_pthread_key_create, which then tries to use a std::map that then tries to initialize jemalloc again.
This is the same solution we used in mozglue/build/BionicGlue.cpp.
Attachment #8548550 - Flags: review?(mh+mozilla)
Comment on attachment 8548550 [details] [diff] [review]
Use static buffer for first allocation in __wrap_pthread_key_create

Review of attachment 8548550 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/build/Nuwa.cpp
@@ +225,5 @@
> + * static buffer for small sizes, and force the initial vector capacity to
> + * a size enough to store one atfork function table.
> + */
> +template <typename T>
> +struct SpecialAllocator: public std::allocator<T>

Since this is the exact same class as in BionicGlue.cpp, please move it in a header. And put that header somewhere under memory/, like memory/build/.

@@ +259,5 @@
> + * NB: stlport seems to require an allocator for the internal node type rather
> + *     than just a std::pair<K,V>.
> + */
> +typedef std::map<pthread_key_t, void (*)(void *), std::less<pthread_key_t>,
> +        SpecialAllocator<std::priv::_Rb_tree_node<std::pair<const int, void (*)(void*)> > > > TLSKeySet;

This sucks, but let's keep it that way for the moment. Please file a followup to clean this up some way or another. I don't like to rely on private types.
Attachment #8548550 - Flags: review?(mh+mozilla) → feedback+
Attached patch An alternative fix (obsolete) — Splinter Review
This is another way to fix the problem without resorting to a special allocator. We can skip the bookkeeping part in __wrap_pthread_key_create() for non-Nuwa processes. This should fix the problem during jemalloc initialization when the process is not recognize itself as Nuwa.
Attachment #8548728 - Flags: feedback?(tlee)
See Also: → 1026864
(In reply to Cervantes Yu from comment #4)
> Created attachment 8548728 [details] [diff] [review]
> An alternative fix
> 
> This is another way to fix the problem without resorting to a special
> allocator. We can skip the bookkeeping part in __wrap_pthread_key_create()
> for non-Nuwa processes. This should fix the problem during jemalloc
> initialization when the process is not recognize itself as Nuwa.

This changes what is stored in sTLSKeys, which changes what is saved and restored by SaveTLSInfo and RestoreTLSInfo. This may well be what was intended all along, but this is definitely a change in the current behavior. Note that pthread_key_delete only affects sTLSKeys in Nuwa processes.

So if in fact all pthread_keys need to be saved, Nuwa or not (and aiui, they should be), pthread_key_delete is currently wrong.
Comment on attachment 8548728 [details] [diff] [review]
An alternative fix

Review of attachment 8548728 [details] [diff] [review]:
-----------------------------------------------------------------

In fact, the number of these keys are small.  A linear search at a static array is even faster than hash.  With a reasonable assertion on the maximum size, I would suggest to use a simple array XD
Attachment #8548728 - Flags: feedback?(tlee)
(In reply to Mike Hommey [:glandium] from comment #5)
> This changes what is stored in sTLSKeys, which changes what is saved and
> restored by SaveTLSInfo and RestoreTLSInfo. This may well be what was
> intended all along, but this is definitely a change in the current behavior.
> Note that pthread_key_delete only affects sTLSKeys in Nuwa processes.
> 
> So if in fact all pthread_keys need to be saved, Nuwa or not (and aiui, they
> should be), pthread_key_delete is currently wrong.

Oh! you are right. The change may make us lose jemalloc's TSD for the recreated thread in the threads that are cloned from Nuwa. Sounds pretty undesirable.
Comment on attachment 8548728 [details] [diff] [review]
An alternative fix

Obsolete the patch that is incorrect.
Attachment #8548728 - Attachment is obsolete: true
(In reply to Thinker Li [:sinker] from comment #6)
> Comment on attachment 8548728 [details] [diff] [review]
> An alternative fix
> 
> Review of attachment 8548728 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> In fact, the number of these keys are small.  A linear search at a static
> array is even faster than hash.  With a reasonable assertion on the maximum
> size, I would suggest to use a simple array XD

I'm not sure we should set a limit on the number of threads a process can have.
Assignee: erahm → mh+mozilla
Attachment #8548550 - Attachment is obsolete: true
Attachment #8549203 - Flags: review?(tlee)
Here is a version that, more than compile, links.
Attachment #8549203 - Attachment is obsolete: true
Attachment #8549203 - Flags: review?(tlee)
Attachment #8549204 - Flags: review?(tlee)
Using AlignedStorage
Attachment #8549204 - Attachment is obsolete: true
Attachment #8549204 - Flags: review?(tlee)
Attachment #8549208 - Flags: review?(tlee)
Comment on attachment 8549208 [details] [diff] [review]
Use a LinkedList to store TLS keys for Nuwa, and allocate the first element in a static buffer

Review of attachment 8549208 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/build/Nuwa.cpp
@@ +227,5 @@
> +  : std::pair<pthread_key_t, void (*)(void*)>(aKey, aDestructor)
> +  {}
> +
> +  static void* operator new(size_t size) {
> +    if (!sUsed)

Not attaching a new patch for this, but this should be if (sUsed) instead of if (!sUsed). Fixed locally.
And avoid leaking the linked list elements at shutdown (and crash on debug builds because the list is not empty). Note that AllThreadListType was leaking its elements. (clear() doesn't delete them)
Attachment #8549208 - Attachment is obsolete: true
Attachment #8549208 - Flags: review?(tlee)
Attachment #8549219 - Flags: review?(tlee)
And this addresses comment 5.
Attachment #8549220 - Flags: review?(tlee)
Comment on attachment 8549219 [details] [diff] [review]
Use a LinkedList to store TLS keys for Nuwa, and allocate the first element in a static buffer

Waldo, how would you feel about AutoCleanLinkedList from this patch being added to LinkedList.h?
Attachment #8549219 - Flags: feedback?(jwalden+bmo)
Blocks: 1121740
(In reply to Mike Hommey [:glandium] from comment #15)
> Created attachment 8549220 [details] [diff] [review]
> Remove TLS keys from bookkeeping in non-Nuwa processes too
> 
> And this addresses comment 5.

Can we store the keys in static allocated array? The maximum number of key per process is bounded by PTHREAD_KEYS_MAX, the value is 1024 on my desktop linux machine.
(In reply to Patrick Wang (Chih-Kai Wang) (:kk1fff) from comment #17)
> (In reply to Mike Hommey [:glandium] from comment #15)
> > Created attachment 8549220 [details] [diff] [review]
> > Remove TLS keys from bookkeeping in non-Nuwa processes too
> > 
> > And this addresses comment 5.
> 
> Can we store the keys in static allocated array? The maximum number of key
> per process is bounded by PTHREAD_KEYS_MAX, the value is 1024 on my desktop
> linux machine.

Do we really need to allocate 8k on arm and 16k on arm64 for something that probably only uses a few items?
Attachment #8549220 - Flags: review?(tlee) → review+
Comment on attachment 8549219 [details] [diff] [review]
Use a LinkedList to store TLS keys for Nuwa, and allocate the first element in a static buffer

Review of attachment 8549219 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/build/Nuwa.cpp
@@ +247,5 @@
> +  static AlignedStorage2<TLSKey> sFirstElement;
> +};
> +
> +bool TLSKey::sUsed = false;
> +AlignedStorage2<TLSKey> TLSKey::sFirstElement;

I think this is a very fragile solution.  It is very likely we need |sSecondElement| in future, and we need to spend a bunch of time to figure it out again.  So, please make sure that the process would crash at first moment once it is happened.
Attachment #8549219 - Flags: review?(tlee) → review-
(In reply to Thinker Li [:sinker] from comment #19)
> Comment on attachment 8549219 [details] [diff] [review]
> Use a LinkedList to store TLS keys for Nuwa, and allocate the first element
> in a static buffer
> 
> Review of attachment 8549219 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: mozglue/build/Nuwa.cpp
> @@ +247,5 @@
> > +  static AlignedStorage2<TLSKey> sFirstElement;
> > +};
> > +
> > +bool TLSKey::sUsed = false;
> > +AlignedStorage2<TLSKey> TLSKey::sFirstElement;
> 
> I think this is a very fragile solution.  It is very likely we need
> |sSecondElement| in future, and we need to spend a bunch of time to figure
> it out again.  So, please make sure that the process would crash at first
> moment once it is happened.

The only way a second element would be needed in the future is if jemalloc changes to require more than one pthread_key during its initialization, which would be detected by a new dead lock happening after a jemalloc upgrade. What would crashing add?
Flags: needinfo?(tlee)
I mean to add some code to check if we need more than one tls key before it is ready, and crash/or abort the process intently if it is happened.  So, next time, we will know it's root cause clearly.
Flags: needinfo?(tlee)
Comment on attachment 8549219 [details] [diff] [review]
Use a LinkedList to store TLS keys for Nuwa, and allocate the first element in a static buffer

Review of attachment 8549219 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/build/Nuwa.cpp
@@ +250,5 @@
> +bool TLSKey::sUsed = false;
> +AlignedStorage2<TLSKey> TLSKey::sFirstElement;
> +
> +template <typename T>
> +struct AutoCleanLinkedList : public LinkedList<T>

Yeah, this seems reasonable to add to LinkedList.h.  Needs a documentation comment, tho, something like:

"""
A LinkedList that will remove (and delete) each element still within itself on destruction.  Note that because each element is deleted, elements must have been allocated using |new|.
"""

@@ +256,5 @@
> +  ~AutoCleanLinkedList() {
> +    for (T* it = LinkedList<T>::getFirst(); it != nullptr;) {
> +      T* tmp = it;
> +      it = it->getNext();
> +      delete tmp;

So as I look at this, I'm a bit dubious about it.  You're not removing any elements from the linked list.  So when the base class destructor's called, won't this assert?  Maybe this is only being used, currently, in cases where it's always emptied before destruction?

The fix for this is, I think:

  while (T* element = this->popFirst()) {
    delete tmp;
  }
Attachment #8549219 - Flags: feedback?(jwalden+bmo) → feedback+
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #22)
> So as I look at this, I'm a bit dubious about it.  You're not removing any
> elements from the linked list.  So when the base class destructor's called,
> won't this assert?  Maybe this is only being used, currently, in cases where
> it's always emptied before destruction?

deleting a LinkedListElement always removes it from the list:

154   ~LinkedListElement()
155   {
156     if (!mIsSentinel && isInList()) {
157       remove();
158     }
159   }
Turns out the popFirst version makes the compiler generate better code.
Attachment #8558857 - Flags: review?(jwalden+bmo)
Depends on: 1129244
Attachment #8558857 - Flags: review?(jwalden+bmo) → review+
No longer depends on: 1129244
Comment on attachment 8558967 [details] [diff] [review]
Use a LinkedList to store TLS keys for Nuwa, and allocate the first element in a static buffer

Review of attachment 8558967 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/build/Nuwa.cpp
@@ +329,5 @@
>  /**
>   * This mutex protects the access to sTLSKeys, which keeps track of existing
>   * TLS Keys.
>   */
> +static pthread_mutex_t sTLSKeyLock = PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP;

Since _NP suffix means non-portable, we should avoid it.  Use PTHREAD_ERRORCHECK_MUTEX_INITIALIZER instead of '_NP' one.

@@ +755,5 @@
>  __wrap_pthread_key_create(pthread_key_t *key, void (*destructor)(void*)) {
>    int rv = REAL(pthread_key_create)(key, destructor);
>    if (rv != 0) {
>      return rv;
>    }

static pid_t lockingthread = 0;
if (lockingthread == gettid()) {
  // nested called, bug 1121269 is happened again!
  abort();
}

@@ +756,5 @@
>    int rv = REAL(pthread_key_create)(key, destructor);
>    if (rv != 0) {
>      return rv;
>    }
> +  MOZ_RELEASE_ASSERT(REAL(pthread_mutex_lock)(&sTLSKeyLock) == 0);

lockingthread = gettid();

@@ +757,5 @@
>    if (rv != 0) {
>      return rv;
>    }
> +  MOZ_RELEASE_ASSERT(REAL(pthread_mutex_lock)(&sTLSKeyLock) == 0);
> +  sTLSKeys.insertBack(new TLSKey(*key, destructor));

lockingthread = 0;
Attachment #8558967 - Flags: review?(tlee) → review-
Comment on attachment 8558967 [details] [diff] [review]
Use a LinkedList to store TLS keys for Nuwa, and allocate the first element in a static buffer

(In reply to Thinker Li [:sinker] from comment #26)
> Comment on attachment 8558967 [details] [diff] [review]
> Use a LinkedList to store TLS keys for Nuwa, and allocate the first element
> in a static buffer
> 
> Review of attachment 8558967 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: mozglue/build/Nuwa.cpp
> @@ +329,5 @@
> >  /**
> >   * This mutex protects the access to sTLSKeys, which keeps track of existing
> >   * TLS Keys.
> >   */
> > +static pthread_mutex_t sTLSKeyLock = PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP;
> 
> Since _NP suffix means non-portable, we should avoid it.  Use
> PTHREAD_ERRORCHECK_MUTEX_INITIALIZER instead of '_NP' one.

$ git grep PTHREAD_ERRORCHECK_MUTEX_INITIALIZER
benchmarks/pthread_benchmark.cpp:  pthread_mutex_t mutex = PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP;
libc/include/pthread.h:#define PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP {__PTHREAD_ERRORCHECK_MUTEX_INIT_VALUE __RESERVED_INITIALIZER}
libc/include/pthread.h:#define PTHREAD_ERRORCHECK_MUTEX_INITIALIZER PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP

There is no non-NP definition in bionic.

> @@ +755,5 @@
> >  __wrap_pthread_key_create(pthread_key_t *key, void (*destructor)(void*)) {
> >    int rv = REAL(pthread_key_create)(key, destructor);
> >    if (rv != 0) {
> >      return rv;
> >    }
> 
> static pid_t lockingthread = 0;
> if (lockingthread == gettid()) {
>   // nested called, bug 1121269 is happened again!
>   abort();
> }

That's what the MOZ_RELEASE_ASSERTs do (pthread_mutex_lock returns an error in that case).
Attachment #8558967 - Flags: review- → review?(tlee)
Comment on attachment 8558967 [details] [diff] [review]
Use a LinkedList to store TLS keys for Nuwa, and allocate the first element in a static buffer

Review of attachment 8558967 [details] [diff] [review]:
-----------------------------------------------------------------

Mike, thanks for your explanation.  I got your point about CHECKERROR.
Attachment #8558967 - Flags: review?(tlee) → review+
(In reply to Mike Hommey [:glandium] from comment #27)
> $ git grep PTHREAD_ERRORCHECK_MUTEX_INITIALIZER
> benchmarks/pthread_benchmark.cpp:  pthread_mutex_t mutex =
> PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP;
> libc/include/pthread.h:#define PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP
> {__PTHREAD_ERRORCHECK_MUTEX_INIT_VALUE __RESERVED_INITIALIZER}
> libc/include/pthread.h:#define PTHREAD_ERRORCHECK_MUTEX_INITIALIZER
> PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP

sigh google... until commit 212e0e38248860b151b28877225629a988d95b58 in platform/bionic, it was PTHREAD_ERRORCHECK_MUTEX_INITIALIZER. I also missed that there actually is a PTHREAD_ERRORCHECK_MUTEX_INITIALIZER define, but under a "TODO: remove this namespace pollution".

Anyways, needed a fixup:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0351fbdeb37c
[Blocking Requested - why for this release]: Bug 1119157, which is 2.1+, depends on this bug.
blocking-b2g: --- → 2.1?
blocking-b2g: 2.1? → 2.1+
Both pushes look good (relative to the current state of those branches). Please nominate the patches here for b2g34 and b2g37 approval when you get a chance, Mike :)
Flags: needinfo?(mh+mozilla)
Target Milestone: --- → 2.2 S6 (20feb)
AIUI, only https://hg.mozilla.org/try/rev/1c561ed8c1da is needed for b2g34 and b2g37, am I right?
Flags: needinfo?(mh+mozilla) → needinfo?(cyu)
(In reply to Mike Hommey [:glandium] from comment #35)
> AIUI, only https://hg.mozilla.org/try/rev/1c561ed8c1da is needed for b2g34
> and b2g37, am I right?

Yes, but it depends on the other 2 patches (in the context). You are suggesting uplifting this revision only?
Flags: needinfo?(cyu)
(In reply to Cervantes Yu from comment #36)
> (In reply to Mike Hommey [:glandium] from comment #35)
> > AIUI, only https://hg.mozilla.org/try/rev/1c561ed8c1da is needed for b2g34
> > and b2g37, am I right?
> 
> Yes, but it depends on the other 2 patches (in the context).

AFAIR, besides the context mismatch, the patch is independent from the others in this bug.

> You are suggesting uplifting this revision only?

Essentially, yes.
Comment on attachment 8549220 [details] [diff] [review]
Remove TLS keys from bookkeeping in non-Nuwa processes too

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 771765
User impact if declined: bug 1119157 is dependent on this patch.
Testing completed: automatic tests on try.
Risk to taking this patch (and alternatives if risky): low. This patch doesn't contain observable behavior changes.
Attachment #8549220 - Flags: approval-mozilla-b2g37?
Attachment #8549220 - Flags: approval-mozilla-b2g34?
Attachment #8549220 - Flags: approval-mozilla-b2g37?
Attachment #8549220 - Flags: approval-mozilla-b2g37+
Attachment #8549220 - Flags: approval-mozilla-b2g34?
Attachment #8549220 - Flags: approval-mozilla-b2g34+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: