Closed Bug 1038854 Opened 5 years ago Closed 5 years ago

App does not always start

Categories

(Core :: DOM: Content Processes, defect, P1, blocker)

32 Branch
ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED FIXED
mozilla34
blocking-b2g 2.0+
Tracking Status
firefox32 --- wontfix
firefox33 --- wontfix
firefox34 --- fixed
b2g-v1.2 --- unaffected
b2g-v1.3 --- fixed
b2g-v1.3T --- fixed
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: sotaro, Assigned: sinker)

References

Details

(Whiteboard: [caf priority: p2][CR 698221])

Attachments

(7 files, 10 obsolete files)

473 bytes, text/plain
Details
14.22 KB, image/png
Details
2.09 MB, text/plain
Details
26.87 KB, text/plain
Details
5.88 KB, patch
ting
: review+
Details | Diff | Splinter Review
3.11 KB, patch
sinker
: review+
Details | Diff | Splinter Review
6.73 KB, patch
Details | Diff | Splinter Review
Attached file test.py
On 237MB RAM flame, there seems a case that application is not started. I faced this problem during investigating Bug 1037360. This problem seems independent from Bug 1037360.

SRT
[1] Run attached script on v2.0 flame with 273MB RAM.

Expected result
- App start and kill forever by the script

Actual result
- App failed to start up. The app's icon is shown on the screen.

Prerequisite
- v2.0 flame with 273MB RAM.
- It seems better to apply Bug 1037360 fix.
Depends on: 1037360
Nominate to "b2g-v2.0+".
blocking-b2g: --- → 2.0?
This problem might be related to Bug 1036670.
See Also: → 1036670
Attached file logcat
Logcat log until the problem happened.
b2g-ps command result was strange, there were two preallocated process.

----------------------------
sotaro@eideticker-tor-3:/mnt/sdb/sotaro/b2g_v20_flame/B2G$ adb shell b2g-ps
APPLICATION    SEC USER     PID   PPID  VSIZE  RSS     WCHAN    PC         NAME
b2g              0 root      279   1     190668 47740 ffffffff b6ea367c S /system/b2g/b2g
(Nuwa)           0 root      854   279   55004  3080  ffffffff b6eba67c S /system/b2g/plugin-container
Homescreen       2 u0_a936   936   279   108280 19676 ffffffff b6e5867c S /system/b2g/plugin-container
(Preallocated a  0 u0_a11234 11234 854   59164  8584  ffffffff b6eba848 S /system/b2g/plugin-container
(Preallocated a  2 u0_a11247 11247 854   61156  15672 ffffffff b6eba67c S /system/b2g/plugin-container
Hi Sotaro... looks like you are working on this?  If so could you please assign it to yourself?
Passing this onto you sotaro, for now, Feel free re-assign as neeeded.
(Just a note from CAF's triage they are seeing this on their side as well, Inder is going to add more details here)
Assignee: nobody → sotaro.ikeda.g
blocking-b2g: 2.0? → 2.0+
I un-assign myself. I just reported the bug. I do not working for this. I have another bugs and there is no evidence that this is graphic bug now.
Assignee: sotaro.ikeda.g → nobody
(In reply to Sotaro Ikeda [:sotaro] from comment #3)
> This problem might be related to Bug 1036670.

NI on ting here to help investigate given it might be related to 1036670
Flags: needinfo?(tchou)
Component: General → Performance
bug 1036670 haven't entered homescreen yet, which couldn't launch an app, should be different from this one.
Flags: needinfo?(tchou)
Sotaro, could you please teach me how to run the test script on Flame? I couldn't make it works:

ting@sweet:~/w/fx/os/flame$ ~/w/pyenv/bin/python test.py
Opening Phone
Traceback (most recent call last):
  File "test.py", line 15, in <module>
    app = a.launch("phone")
  File "/home/ting/w/fx/os/gaia/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 49, in launch
    assert result, "Failed to launch app with name '%s'" % name
AssertionError: Failed to launch app with name 'phone'
NI for comment 11.
Flags: needinfo?(sotaro.ikeda.g)
Passing this on ting, as he is helping investigate this based on our offline discussion.
Assignee: nobody → tchou
Thanks for :askeing's help, marionette doesn't work since I have an previous installed gaia-ui-test package, which does not match to the version of client.
Flags: needinfo?(sotaro.ikeda.g)
The test script iterates over 200 times, but hasn't reproduced it. I am not sure does this relate to bug 1033618. I am using:

  gecko: db6779a27019c0d1ec63fc501afdf250639fd468 (v32.0a2)
  gaia:  404edc9230fb925a31845622a99de84fd35d1c7b

Sotaro, what's your reproduce rate and commit hash?
Flags: needinfo?(sotaro.ikeda.g)
(In reply to Ting-Yu Chou [:ting] from comment #15)
> The test script iterates over 200 times, but hasn't reproduced it. I am not
> sure does this relate to bug 1033618. I am using:
> 
>   gecko: db6779a27019c0d1ec63fc501afdf250639fd468 (v32.0a2)
>   gaia:  404edc9230fb925a31845622a99de84fd35d1c7b
> 
> Sotaro, what's your reproduce rate and commit hash?

I runed the script for several hours.

 gecko: 323e7d8e8884be8da8341c0e2687e81962a774f4
 gaia: 2c6c413ed729d465c52d6c2d5d458e2eee79e956
Flags: needinfo?(sotaro.ikeda.g)
In my case, the problem happened steadily, though it needs time until the problem happens.
Severity: normal → blocker
Keywords: perf
Priority: -- → P1
Whiteboard: [c=memory p= s= u=2.0]
Status: NEW → ASSIGNED
Just reproduced, there were 2 preallocated process as comment 5, one is blocked in HelperThreadState().lock() from js::CancelOffThreadIonCompile(), one is epoll waiting in nsAppShell::ProcessNextNativeEvent().

Will check why is the former one blocks there.
There are 2 HelperThread (Analysis Helper), both are waiting on the condition |producerWakeup| associated to GlobalHelperThreadState's mutex. The mutex should be unlocked, but its value is 2 and blocks the AutoLockHelperThreadState of CancelOffThreadIonCompile().

Am checking how could this happen.
Keywords: footprint
Whiteboard: [c=memory p= s= u=2.0] → [c=memory p= s= u=2.0][MemShrink]
o this is a Nuwa issue.

When one freezed HelperThread has been resumed, the second one going to be resumed may get EBUSY from pthread_mutex_trylock() in __wrap_pthread_cond_wait(). It then raises the flag |reacquireMutex| and let main thread to call pthread_mutex_lock() on the mutex. But there's no correspond unlock call and the mutex is not a recursive one, so later a call to pthread_mutex_lock() on the same mutex will be blocked, no matter it's from main thread or any other one.

I am not sure why the mutex needs to be locked by main thread, per spec of pthread_cond_wait() and pthread_cond_timedwait(), the valid state can only be maintained if the recreated thread owns the lock.

I am verifying my patch now, will come out later.
Component: Performance → DOM: Content Processes
Product: Firefox OS → Core
Version: unspecified → 32 Branch
(In reply to Ting-Yu Chou [:ting] from comment #20)
> I am not sure why the mutex needs to be locked by main thread, per spec of
> pthread_cond_wait() and pthread_cond_timedwait(), the valid state can only
> be maintained if the recreated thread owns the lock.

Cervantes just told me the reason why, it is for the case when thread A owns mutex M, thread B is recreated and not interrupted by context switch, and goes to pthread_cond_wait() to unlock M. In this case, the state of M for thread A will be incorrect, so thread B asks main thread to lock it.

The implementation also assumes there's no undefined behavior for mutex lock/unlock when the calling thread does not own it.

I need to investigate this furthur for the patch.
Whiteboard: [c=memory p= s= u=2.0][MemShrink] → [c=memory p= s= u=2.0][MemShrink] [CR 698221]
If this is a bug in Nuwa then it's not a MemShrink issue per se.
Whiteboard: [c=memory p= s= u=2.0][MemShrink] [CR 698221] → [c=memory p= s= u=2.0][CR 698221]
Whiteboard: [c=memory p= s= u=2.0][CR 698221] → [caf priority: p2][c=memory p= s= u=2.0][CR 698221]
After double check, the RECREATE_GATE_VIP() in __wrap_pthread_cond_wait() should prevent recreated thread to lock any mutex before all the threads are recreated. So comment 20 is not all correct:

> When one freezed HelperThread has been resumed, the second one going to be
> resumed may get EBUSY from pthread_mutex_trylock() in
> __wrap_pthread_cond_wait().

It is not the first recreated HelperThread locks the mutex to cause EBUSY for the second one.

The first recreated HelperThread actually enters REAL(pthread_cond_wait) and blocks there, the mutex value should be decreased to 0. But later when second HelperThread is recreating, pthread_mutex_trylock() returns EBUSY, which it shouldn't. I thought it may be some other thread locks the mutex, but I didn't see this from logging.

I added this code to pthread.c:

diff --git a/libc/bionic/pthread.c b/libc/bionic/pthread.c
index e30fa9d..f18d2f9 100644
--- a/libc/bionic/pthread.c
+++ b/libc/bionic/pthread.c
@@ -1122,8 +1122,11 @@ int __pthread_cond_timedwait_relative(pthread_cond_t *con
 {
     int  status;
     int  oldvalue = cond->value;
+    int  oldmvalue = mutex->value;
 
     pthread_mutex_unlock(mutex);
+    if (mutex->value >= oldmvalue)
+        raise(SIGTRAP);
     status = __futex_wait_ex(&cond->value, COND_IS_SHARED(cond), oldvalue, relt
     pthread_mutex_lock(mutex);

and I can hit the SIGTRAP, will check this furthur.
This is a 2.0 blocker but I don't see much happening in the last 5 days.  Can I please get an update?
I am sorry, but I still haven't figured out how does it behave incorrectly, am still working on it.
Flags: needinfo?(tchou)
a. The mutex value of helperLock is not synchronized between main thread, HelperThread-A, and HelperThread-B, which causes HelperThread-B failing pthread_mutex_trylock(). This probably relates to memory barrier usage of pthread implementation.
b. Main thread will be blocked for locking reacquireMutex if it is owned by another thread, and may cause deadlock. This is a different issue, will file a bug later.

When a) happens, there are 6 possible sequences for main thread (M) recreating the 2 HelperThread (A, B) from __wrap_pthread_cond_wait(), +/- means mutex lock/unlock:

  1. A+ A- B- M+,
  2. A+ A- M+ B-,
  3. A+ B- A- M+,
  4. A+ B- M+ A-,
  5. A+ M+ A- B-,
  6. A+ M+ B- A-

For case 1 and 3, the mutex state will be locked uncontended after M+, which causes this issue. To fix it, we can either:

  1. make sure mutex value is synchronized and a) is not happeend, or
  2. make sure pthread_mutex_lock() is called before __real_pthread_cond_wait() which goes to pthread_mutex_unlock()

I have tried to add dsb before returning from _normal_unlock(), but it doesn't work for solution 1. And by now, I have two ideas for solution 2:

  replace pthread_mutex_trylock() by pthread_mutex_lock() and get rid of reacquireMutex, or
  delay __real_pthread_cond_wait() until main thread owns reacquireMutex
(In reply to Ting-Yu Chou [:ting] from comment #26)
> doesn't work for solution 1. And by now, I have two ideas for solution 2:
> 
>   replace pthread_mutex_trylock() by pthread_mutex_lock() and get rid of
> reacquireMutex, or
>   delay __real_pthread_cond_wait() until main thread owns reacquireMutex

Talked to Thinker and Cervantes, actually both are not ok as they cause b) of comment 26.

Cervantes came up a better solution which to recreate HelperThread-B after HelperThread-A enters __real_pthread_cond_wait(), I am verifying it now.
Does this bug still repro when you test it on a Flame set to 319 MB of memory?
Flags: needinfo?(tchou)
Yes, unfortunately :(
Flags: needinfo?(tchou)
Corrections:

(In reply to Ting-Yu Chou [:ting] from comment #26)
> a. The mutex value of helperLock is not synchronized between main thread,
> HelperThread-A, and HelperThread-B, which causes HelperThread-B failing
> pthread_mutex_trylock(). This probably relates to memory barrier usage of
> pthread implementation.

Actually HelperThread-B fails pthread_mutex_trylock() because HelperThread-B is recreated even HelperThread-A hasn't enter wrapped pthread condition wait which HelperThread-A still owns the mutex. Note context switch may be happened after HelperThread-A passes RECREATE_CONTINUE(), and before __real_pthread_cond_wait().

> b. Main thread will be blocked for locking reacquireMutex if it is owned by
> another thread, and may cause deadlock. This is a different issue, will file
> a bug later.

I was wrong, the deadlock is not existed with current implementation, it happens only if I replace trylock by lock. No bug need to be filed.
Comment on attachment 8464458 [details] [diff] [review]
patch

The patch guarantees if a thread is recreating from pthread condition wait wrapper, it will enter wrapped function to unlock the mutex before main thread creating next one, which there won't be sequences like:

  A+ A- B- M+,
  A+ B- A- M+

Unlock and lock a unlocked mutex will make it locked uncontended, which is not what we want.

A+ = HelperThread-A calls pthread_mutex_lock(m)
A- = HelperThread-A calls pthread_mutex_unlock(m)
M+ = Main thread calls pthread_mutex_lock(m)
Attachment #8464458 - Flags: review?(khuey)
Retitle since this unrelated to the memory configuration of the device.
Summary: App does not always start on 273MB flame → App does not always start
Comment on attachment 8464458 [details] [diff] [review]
patch

Review of attachment 8464458 [details] [diff] [review]:
-----------------------------------------------------------------

This is subtle.  Excellent catch.

::: mozglue/build/Nuwa.cpp
@@ +139,5 @@
>    void *recrArg;
>  
>    TLSInfoList tlsInfo;
>  
> +  bool isCondMutexBalanced;

I think condMutexNeedsBalancing might be a more accurate name for this.

@@ +140,5 @@
>  
>    TLSInfoList tlsInfo;
>  
> +  bool isCondMutexBalanced;
> +  pthread_mutex_t *condMutex;

Reverse the order of these and place the following comment above them (based on your comment in the last hunk).

We must ensure that the recreated thread has entered pthread_cond_wait() or similar functions before proceeding to recreate the next one. Otherwise, if the next thread depends on the same mutex, it may be used in an incorrect state.  To do this, the main thread must unconditionally acquire the mutex.  The mutex is unconditionally released when the recreated thread enters pthread_cond_wait().  The recreated thread may have locked the mutex itself (if the pthread_mutex_trylock succeeded) or another thread may have already held the lock.  If the recreated thread did lock the mutex we must balance that with another unlock on the main thread, which is signaled by isCondMutexBalanced/condMutexNeedsBalancing/whatever you name it.

@@ +1416,5 @@
>        RECREATE_WAIT();
> +      if (tinfo->condMutex) {
> +        // Ensure the recreating thread has entered wrapped pthread condition
> +        // wait function before recreating next one. Otherwise if next thread
> +        // depends on the same mutex, it may be used with an incorrect state.

And then here just write

Synchronize with the recreated thread in pthread_cond_wait().
Attachment #8464458 - Flags: review?(khuey) → review+
Attached patch patchSplinter Review
Carry r+ from comment 34.

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #34)
> Comment on attachment 8464458 [details] [diff] [review]
> patch
> 
> Review of attachment 8464458 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This is subtle.  Excellent catch.

Yes, it's neat, credit goes to Cervantes.

>
> ::: mozglue/build/Nuwa.cpp
> @@ +139,5 @@
> >    void *recrArg;
> >  
> >    TLSInfoList tlsInfo;
> >  
> > +  bool isCondMutexBalanced;
> 
> I think condMutexNeedsBalancing might be a more accurate name for this.

Changed.

>
> @@ +140,5 @@
> >  
> >    TLSInfoList tlsInfo;
> >  
> > +  bool isCondMutexBalanced;
> > +  pthread_mutex_t *condMutex;
> 
> Reverse the order of these and place the following comment above them (based
> on your comment in the last hunk).
> 
> We must ensure that the recreated thread has entered pthread_cond_wait() or
> similar functions before proceeding to recreate the next one. Otherwise, if
> the next thread depends on the same mutex, it may be used in an incorrect
> state.  To do this, the main thread must unconditionally acquire the mutex. 
> The mutex is unconditionally released when the recreated thread enters
> pthread_cond_wait().  The recreated thread may have locked the mutex itself
> (if the pthread_mutex_trylock succeeded) or another thread may have already
> held the lock.  If the recreated thread did lock the mutex we must balance
> that with another unlock on the main thread, which is signaled by
> isCondMutexBalanced/condMutexNeedsBalancing/whatever you name it.

Done.

> 
> @@ +1416,5 @@
> >        RECREATE_WAIT();
> > +      if (tinfo->condMutex) {
> > +        // Ensure the recreating thread has entered wrapped pthread condition
> > +        // wait function before recreating next one. Otherwise if next thread
> > +        // depends on the same mutex, it may be used with an incorrect state.
> 
> And then here just write
> 
> Synchronize with the recreated thread in pthread_cond_wait().

Done.
Attachment #8464458 - Attachment is obsolete: true
Attachment #8465155 - Flags: review+
(In reply to Ting-Yu Chou [:ting] from comment #35)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #34)
> > This is subtle.  Excellent catch.
> Yes, it's neat, credit goes to Cervantes.

I misread the catch as patch, thanks :)
Out of curiosity do you know which lock this was happening to?  That would be useful information for deciding whether we need to backport this to 1.4/1.3t/etc.
https://hg.mozilla.org/integration/mozilla-inbound/rev/f73cd738c1fe
Whiteboard: [caf priority: p2][c=memory p= s= u=2.0][CR 698221] → [caf priority: p2][CR 698221]
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #37)
> Out of curiosity do you know which lock this was happening to?  That would
> be useful information for deciding whether we need to backport this to
> 1.4/1.3t/etc.

From comment 19:

> There are 2 HelperThread (Analysis Helper), both are waiting on the
> condition |producerWakeup| associated to GlobalHelperThreadState's mutex.
> The mutex should be unlocked, but its value is 2 and blocks the
> AutoLockHelperThreadState of CancelOffThreadIonCompile().

GlobalHelperThreadState's mutex is within PRLock helperLock.
The problem happens in worker threads, and we create worker threads based on the number of CPU cores. I think this will happen to previous branches that runs on dual-core CPUs. So I think the answer is: yes we need to backport this patch.
https://hg.mozilla.org/mozilla-central/rev/f73cd738c1fe
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
Backed out per IRC.

https://hg.mozilla.org/releases/mozilla-b2g32_v2_0/rev/2631725af366
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla34 → ---
Depends on: 1046997
I suspect the problem here is that if pthread_cond_wait returns before the main thread attempts to lock the mutex in question the recreated thread will end up in RECREATE_GATE_VIP holding the mutex and the main thread will be blocked attempting to lock the mutex.  I should have caught that during review.
What if the main thread sets condMutex to null after successfully acquiring the lock.  Then in the RECREATE_GATE_VIP branch on the recreated thread we can check to see if condMutex is null.  If it is *not*, we unlock the mutex in question to allow the main thread to acquire it and we set condMutexNeedsBalancing to false (not in this order!) unconditionally so the main thread does locks but does not unlock the mutex.
Sorry about making troubles, comment 45 should be the case, but please allow me to take some time to think comment 46.
Flags: needinfo?(tchou)
Don't worry about it.  I've done worse :P
Status: REOPENED → ASSIGNED
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #45)
> I suspect the problem here is that if pthread_cond_wait returns before the
> main thread attempts to lock the mutex in question the recreated thread will
> end up in RECREATE_GATE_VIP holding the mutex and the main thread will be
> blocked attempting to lock the mutex.  I should have caught that during
> review.

It sounds like we tried to serialize some actions between the recreated thread and the main thread, but introduced another bug that also needs serialization. I can't think of any condition under which pthread_cond_wait() could return before the main thread attempts to lock the mutex. After all the main thread is recreating thread, and the recreated threads are all gated, waiting for main thread's command to open the gate. It's probably pthread_cond_timedwait(), where the forked process returned immediately from it without really waiting.
Yes, it's probably pthread_cond_timedwait().  But waiting on a condvar is allowed to wake up spuriously.  See http://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_cond_wait.html.  This is why correct usage of a condvar is always:

Thread 1

while (!signalled) {
  condvar.Wait();
}

Thread 2

signalled = true;
condvar.Notify();

So I think whatever solution we use for pthread_cond_timedwait() we should use for all of the functions.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #46)
> What if the main thread sets condMutex to null after successfully acquiring
> the lock.  Then in the RECREATE_GATE_VIP branch on the recreated thread we
> can check to see if condMutex is null.  If it is *not*, we unlock the mutex
> in question to allow the main thread to acquire it and we set
> condMutexNeedsBalancing to false (not in this order!) unconditionally so the
> main thread does locks but does not unlock the mutex.

I think this can't fix the problem. The key to this deadlock is that once pthread_cond_wait() wakes up too quickly, the main thread has no chance to unblock from the attempt to acquire condMutex. The deadlock still happens. Moreover, we already break the ground rule of pthread mutexes by lock/unlock in different threads, I don't want to introduce yet another one to fix this.

How about this:
main thread uses pthread_mutex_trylock() with condMutex and check for the case in comment #45. If this happens, we break from the loop and continue to run.
Attached patch Alternative fix (obsolete) — Splinter Review
This is the alternative to the patch part 1 and 2, provided by Thinker: we just treat freeze point 2 as if it were freeze point 1 for pthread_cond_wait() and similar functions. Because we don't unlock the mutex by __real_pthread_cond_wait(), we won't mess up the state of the mutex.
Comment on attachment 8466139 [details] [diff] [review]
Alternative fix

This doesn't work deadlock in the recreated process.
Attachment #8466139 - Attachment is obsolete: true
Attachment #8466132 - Flags: review?(khuey)
Thinker, your approach to move RECREATE_GATE_VIP() up before real pthread_cond_wait() doesn't work because the main thread will have deadlock in acquiring the mutex associated with the condvar. This is why we originally need to distinguish freeze point 1 and 2 and introduce tinfo->reacquireMutex: the mutex will be in different states.

If we can't follow the route to move RECREATE_GATE_VIP() up and pretend it were freeze point 1, we need to take the forward fix although it makes things more complicated.
Flags: needinfo?(tlee)
I am thinking is it possible to let main thread lock the condMutex before recreated thread is running, this way we don't need trylock() in recreated thread, also main thread does not need to balance the mutex afterward.
Comment on attachment 8466139 [details] [diff] [review]
Alternative fix

Review of attachment 8466139 [details] [diff] [review]:
-----------------------------------------------------------------

You need to rollback this side-effect.

::: mozglue/build/Nuwa.cpp
@@ +1030,1 @@
>    }

if (freezeCountChg) {
    freezePoint1 = true;
}

@@ +1030,2 @@
>    }
>    rv = REAL(pthread_cond_wait)(cond, mtx);

if (freezcountChg) {
    freezePoint1 = false;
}
(In reply to Ting-Yu Chou [:ting] from comment #57)
> I am thinking is it possible to let main thread lock the condMutex before
> recreated thread is running, this way we don't need trylock() in recreated
> thread, also main thread does not need to balance the mutex afterward.

You mean explicit checkpoints.  I don't encourage to use it except no any better solution.
Flags: needinfo?(tlee)
(In reply to Thinker Li [:sinker] from comment #58)
> Comment on attachment 8466139 [details] [diff] [review]
> Alternative fix
> 
> Review of attachment 8466139 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> You need to rollback this side-effect.
Sorry!  I missed something.
Attached patch fix-pthread_cond_wait.diff (obsolete) — Splinter Review
With this patch, we could even remove reacquireMutex.
Attached patch fix-pthread_cond_wait.diff (obsolete) — Splinter Review
Attachment #8466255 - Attachment is obsolete: true
Attached patch fix-pthread_cond_wait.diff (obsolete) — Splinter Review
Attachment #8466261 - Attachment is obsolete: true
Attached patch Proposed fixSplinter Review
This is my proposal.  I think this is nicer than the trylock/yield loop in Cervantes's patch.  And I have no clue how Thinker's patch is supposed to work.
Attachment #8466296 - Flags: feedback?(tlee)
Attachment #8466296 - Flags: feedback?(cyu)
Comment on attachment 8466263 [details] [diff] [review]
fix-pthread_cond_wait.diff

Review of attachment 8466263 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/build/Nuwa.cpp
@@ +1025,5 @@
> +
> +      rv = REAL(pthread_cond_wait)(cond, &_mtx);
> +      // Acquire the mutex that should be acquired by the above line but
> +      // not.
> +      REAL(pthread_mutex_lock)(mtx);

The idea is the calling, before being frozen, on pthread_cond_wait() had already released |mtx|.  So, we don't need to release it again.  But, pthread_cond_wait() request for passing a locked mutex.  So, this patch gives pthread_cond_wait() another locked mutex, |_mtx|.  So, pthread_cond_wait() would release |_mtx| for entering, and acquire |_mtx| back for returning.

For the callers of the wrapper, |mtx| should be acquired back for returning from pthread_cond_wait().  That is why to lock |mtx| here.

With this model, we don't need to play the reacquire trick any more.
Comment on attachment 8466296 [details] [diff] [review]
Proposed fix

Review of attachment 8466296 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/build/Nuwa.cpp
@@ +1456,1 @@
>          if (tinfo->condMutexNeedsBalancing) {

Since modifying condMutex and testing condMutexNeedsBalancing are not atomic, there is racing.
Attachment #8466296 - Flags: feedback?(tlee)
(In reply to Thinker Li [:sinker] from comment #66)
> Comment on attachment 8466296 [details] [diff] [review]
> Proposed fix
> 
> Review of attachment 8466296 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: mozglue/build/Nuwa.cpp
> @@ +1456,1 @@
> >          if (tinfo->condMutexNeedsBalancing) {
> 
> Since modifying condMutex and testing condMutexNeedsBalancing are not
> atomic, there is racing.

After passing RECREATE_CONTINUE/RECREATE_WAIT condMutexNeedsBalancing is never read or written without holding condMutex.  Similarly after RECREATE_CONTINE/RECREATE_WAIT we do *read* condMutex without any synchronization on the main thread but the main thread is also the only thread that writes to condMutex, which it only does with condMutex held.  And the other thread that reads condMutex, the recreated thread, must hold the lock to do so.  So I don't see how this races.
(In reply to Thinker Li [:sinker] from comment #65)
> Comment on attachment 8466263 [details] [diff] [review]
> fix-pthread_cond_wait.diff
> 
> Review of attachment 8466263 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: mozglue/build/Nuwa.cpp
> @@ +1025,5 @@
> > +
> > +      rv = REAL(pthread_cond_wait)(cond, &_mtx);
> > +      // Acquire the mutex that should be acquired by the above line but
> > +      // not.
> > +      REAL(pthread_mutex_lock)(mtx);
> 
> The idea is the calling, before being frozen, on pthread_cond_wait() had
> already released |mtx|.  So, we don't need to release it again.  But,
> pthread_cond_wait() request for passing a locked mutex.  So, this patch
> gives pthread_cond_wait() another locked mutex, |_mtx|.  So,
> pthread_cond_wait() would release |_mtx| for entering, and acquire |_mtx|
> back for returning.
> 
> For the callers of the wrapper, |mtx| should be acquired back for returning
> from pthread_cond_wait().  That is why to lock |mtx| here.
> 
> With this model, we don't need to play the reacquire trick any more.

Ok, I understand what you are saying.  There's no need for _mtx; you can pass nullptr to pthread_cond_wait.  That would make it much easier to see what's going on.

I haven't completely convinced myself it is correct, but it looks good.  It's 3 AM here though so I'll think about it some more tomorrow.
Comment on attachment 8466132 [details] [diff] [review]
Forward fix to the patch

I think we want either Thinker's patch or my patch.
Attachment #8466132 - Flags: review?(khuey) → review-
Comment on attachment 8466296 [details] [diff] [review]
Proposed fix

Review of attachment 8466296 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/build/Nuwa.cpp
@@ +1133,5 @@
>    }
>    rv = REAL(__pthread_cond_timedwait)(cond, mtx, abstime, clock);
>    if (recreated && mtx) {
> +    if (tinfo->condMutex) {
> +      tinfo->condMutexNeedsBalancing = false;

Here 1.1) reads |condMutex| and 1.2) writes |condMutexNeedsBalancing|.  It is possible to preempt the thread between reading and writing.

In RecreateThreads(), it 2.1) writes |condMutex|, then 2.2) reads |condMutexNeedsBalancing|.

The following case would cause problems.

  ... -> 1.1 -> 2.1 -> 2.2 -> 1.2 -> ...

pthread_mutex_unlock() would be called twice on the mutex given by |mtx|.

Do I miss any point?
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #67)
> And the other thread that reads condMutex, the recreated thread, must hold
> the lock to do so.  So I don't see how this races.
Ok!  I get the point, you are right.  pthread_cond_timedwait() need to contest for the mutex with the main thread.
(In reply to Thinker Li [:sinker] from comment #70)
> Comment on attachment 8466296 [details] [diff] [review]
> Proposed fix
> 
> Review of attachment 8466296 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: mozglue/build/Nuwa.cpp
> @@ +1133,5 @@
> >    }
> >    rv = REAL(__pthread_cond_timedwait)(cond, mtx, abstime, clock);
> >    if (recreated && mtx) {
> > +    if (tinfo->condMutex) {
> > +      tinfo->condMutexNeedsBalancing = false;
> 
> Here 1.1) reads |condMutex| and 1.2) writes |condMutexNeedsBalancing|.  It
> is possible to preempt the thread between reading and writing.
> 
> In RecreateThreads(), it 2.1) writes |condMutex|, then 2.2) reads
> |condMutexNeedsBalancing|.
> 
> The following case would cause problems.
> 
>   ... -> 1.1 -> 2.1 -> 2.2 -> 1.2 -> ...
> 
> pthread_mutex_unlock() would be called twice on the mutex given by |mtx|.
> 
> Do I miss any point?

You're missing that both 1.1 and 1.2 happen after pthread_cond_wait returns, so the lock is held, and both 2.1 and 2.2 happen after calling pthread_mutex_lock(tinfo->condMutex), so the lock is held.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #68)
> Ok, I understand what you are saying.  There's no need for _mtx; you can
> pass nullptr to pthread_cond_wait.
I am not sure we could pass a nullptr here.  At beginning, I also think so, but after checking the man page I am not sure we could do that, since I don't see any about passing a nullptr in the man page.
(In reply to Thinker Li [:sinker] from comment #73)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #68)
> > Ok, I understand what you are saying.  There's no need for _mtx; you can
> > pass nullptr to pthread_cond_wait.
> I am not sure we could pass a nullptr here.  At beginning, I also think so,
> but after checking the man page I am not sure we could do that, since I
> don't see any about passing a nullptr in the man page.

The man page also states that using more than one mutex with the condvar results in undefined behavior (the trick to reacquire condMutex is also undefined behavior though).
Comment on attachment 8466296 [details] [diff] [review]
Proposed fix

Review of attachment 8466296 [details] [diff] [review]:
-----------------------------------------------------------------

I think this patch should work it always passes information between threads with condMutex locked.
Attachment #8466296 - Flags: feedback?(cyu) → feedback+
So after thinking about it some more I am pretty convinced that your patch works.  The only potential problem I see is that we will no longer correctly handle the a case where the thread had passed freeze point 1 but not yet released the mutex and blocked in pthread_cond_wait when fork() was called correctly (because we will try to acquire the mutex twice).  But we have similar issues elsewhere (e.g. if pthread_mutex_lock has returned and acquired the lock but we have not reached freeze point 2) so I'm not concerned about it.  I do wonder if this is the only case where we lock/unlock mutexes from the "wrong" thread.  If so, I guess this doesn't change the amount of undefined behavior we rely on.

As far as I am concerned you own this code, so you can decide if you want to use your approach or Ting's patch plus mine.  If you want to use yours, you need to do at least the following:

1. If pthread_cond_wait requires a mutex, lock and unlock _mtx around the added pthread_cond_wait call.  We should also file a bug on removing our null-checking for mtx.
2. If pthread_cond_wait does not require a mutex, remove _mtx.
3. Remove the existing infrastructure to support reacquireMutex and the comments about it.
4. Implement this in all of the pthread_cond_wait functions.

I would r+ a patch with that (and good comments explaining what we're doing).
Flags: needinfo?(tlee)
Whatever we end up doing here we're going to want to backport all the way to 1.3.
Reassign to Thinker as he is creating another patch, which is different from attachment 8466263 [details] [diff] [review].
Assignee: tchou → tlee
--
Attachment #8466967 [details] [diff] - Attachment is obsolete: true
Attachment #8466967 - Attachment is obsolete: true
I am thinking about a simple alternative: all the hassles with trylock, reacquiring mutex is that pthread_cond_wait() releases the lock before entering the sleeping state, and that the mutex with the condvar could be later acquired by another thread.

What if we just maintain the assertion that pthread_cond_wait() should be called with the locked mutex if the thread was frozen in pthread_cond_wait()? As long as thread recreation is never blocked, all threads that are depending on the mutex will be resynced using the mutex. The effect will be just like a small hiccup in thread scheduling.
Flags: needinfo?(tlee)
(In reply to Cervantes Yu from comment #82)
> Created attachment 8467012 [details]
> The patch to resync before pthread_cond_wait()

OK, the problem with it is that we have no control over which thread acquires the mutex first. If another thread acquires the mutex and pthread_cond_signal()s the thread that should pthread_cond_wait() but haven't yet, we'll lose the signal. For this to work, we need to further wrap pthread_cond_signal() and pthread_cond_broadcast(), which seems to further complicate the design.
Attachment #8467012 - Attachment is obsolete: true
--
Attachment #8467008 [details] [diff] - Attachment is obsolete: true
Attachment #8467008 - Attachment is obsolete: true
Comment on attachment 8467083 [details] [diff] [review]
Make mutex acquiring being managed and in ordered, v3

This proposal seems less hacky and simple in logic by avoiding the problem caused by reacquiring in a different thread.  I add an explanation of the proposal just before the class |mutex_lock_request| in the patch.
Attachment #8467083 - Flags: feedback?(khuey)
Attachment #8467083 - Flags: feedback?(cyu)
Comment on attachment 8467083 [details] [diff] [review]
Make mutex acquiring being managed and in ordered, v3

Review of attachment 8467083 [details] [diff] [review]:
-----------------------------------------------------------------

This is too risky to take for 2.0.  The patch has obvious syntax errors which tells me it's totally untested.  If you want to make these changes let's do it in a bug that is not a blocker.
Attachment #8467083 - Flags: feedback?(khuey) → feedback-
Comment on attachment 8466296 [details] [diff] [review]
Proposed fix

Let's do the simple thing here.
Attachment #8466296 - Flags: review?(tlee)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #87)
> Comment on attachment 8466296 [details] [diff] [review]
> Proposed fix
> 
> Let's do the simple thing here.

Could you please rebase your patch for v2.0 ?
This is a combination of Ting-Yu's patch and my patch, rebased for 2.0.
Flags: needinfo?(khuey)
Comment on attachment 8467083 [details] [diff] [review]
Make mutex acquiring being managed and in ordered, v3

The approach is too risky for 2.0 branch.

Since the patch provides a new way to gate the threads that are recreated, it'd better that we revisit the original gating macros, including the VIP version and non-VIP version. Maybe we can remove the gating macros and just use the newer functions. But we need to do it in a separate bug.
Attachment #8467083 - Flags: feedback?(cyu) → feedback-
Comment on attachment 8466296 [details] [diff] [review]
Proposed fix

Review of attachment 8466296 [details] [diff] [review]:
-----------------------------------------------------------------

There is good clarification of this patch here.
Attachment #8466296 - Flags: review?(tlee) → review+
We should let this bake on trunk for a couple days before uplifting.
https://hg.mozilla.org/mozilla-central/rev/a1c2de94aef4
Status: ASSIGNED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
Whiteboard: [caf priority: p2][CR 698221] → [caf priority: p2][CR 698221][NO_UPLIFT]
Nobody has yelled at me for blowing up the phone, so let's uplift.
Whiteboard: [caf priority: p2][CR 698221][NO_UPLIFT] → [caf priority: p2][CR 698221]
Comment on attachment 8467382 [details] [diff] [review]
Proposed patch for 2.0

NOTE: This flag is now for security issues only. Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Nuwa
User impact if declined: We might fail to start apps if we lose a race condition.
Testing completed: Baked on m-c for a day or two, I've been using it locally on my phone for 5 days with no issues.  About to go to 2.0 too.
Risk to taking this patch (and alternatives if risky): Medium risk, only because this is scary code.  I'm pretty confident in this patch.  The alternative is to ignore this and potentially deadlock when launching apps.
String or UUID changes made by this patch: None

We should take this on 1.4 and 1.3 for anyone who wants to do builds off of those branches to pick up this fix for a regression introduced in 1.3.
Attachment #8467382 - Flags: approval-mozilla-b2g30?
Attachment #8467382 - Flags: approval-mozilla-b2g28?
SPRD probably wants this for 1.4 and 1.3t.
Attachment #8467382 - Flags: approval-mozilla-b2g30?
Attachment #8467382 - Flags: approval-mozilla-b2g30+
Attachment #8467382 - Flags: approval-mozilla-b2g28?
Attachment #8467382 - Flags: approval-mozilla-b2g28+
You need to log in before you can comment on or make changes to this bug.