Closed
Bug 1119157
Opened 10 years ago
Closed 10 years ago
pthread_attr_getstack is sometimes incorrect on Firefox OS device worker threads, possibly due to nuwa TLS issue. Leads to intermittent LoadScript failure in workers "InternalError: too much recursion"
Categories
(Core :: IPC, defect)
Tracking
()
People
(Reporter: asuth, Assigned: cyu)
References
Details
Attachments
(3 files, 5 obsolete files)
gdb backtrace from periodic sync with "break js_ReportOverRecursed" using self-made v2.2 trunk build
4.83 KB,
text/plain
|
Details | |
8.38 KB,
patch
|
Details | Diff | Splinter Review | |
9.71 KB,
patch
|
cyu
:
review+
bajaj
:
approval-mozilla-b2g34+
bajaj
:
approval-mozilla-b2g37+
|
Details | Diff | Splinter Review |
The FxOS email app is periodically experiencing worker thread startup problems that suggest some type of worker runtime initialization race and/or some kind of horrible nuwa problem. Since workers had some changes recently with the PBackground stuff, I'm sticking this in workers initially, but it's possible it might need to be turfed to nuwa or the JS engine or something at some point.
We are tracking the two reported variants of this problem on bug 1118711 and bug 1115039, although we'll probably end up just using bug 1118711.
== Short story ==
- js::frontend::Parser<js::frontend::FullParseHandler>::statement is calling JS_CHECK_RECURSION on a worker and is getting it wrong and exploding
- The stack for that thread per 'sp' and /proc/ID/smaps is:
b2501000-b2600000 rw-p 00000000 00:00 0 [stack:23042]
- Our sp at the time of a gdb breakpoint on js_ReportOverRecursed getting hit is 0xb25fe518 which is less than 7k into the stack. The stack seems coherent.
- The context->runtime_->nativeStackBase is 0xb3380000 and *ALL* of the 3 stack limits in *context->perThreadData and the manually computed (runtime_ + RuntimeMainThreadOffset) limit are 0xb3300001.
- The 0xb3380000 seems to correspond to this entry in /proc/ID/smaps:
b3360000-b3380000 rw-p 00000000 00:00 0 [anon:nuwa-thread-stack]
Size: 128 kB
Rss: 4 kB
Pss: 4 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 4 kB
Referenced: 4 kB
Anonymous: 4 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
- The thread limit value of 0xb3300001 is very suspicious on many levels; it seems to correspond to this smaps entry:
b3219000-b3317000 r--p 00000000 00:0b 5979 /dev/binder
== Interesting Stuff ==
- In https://bugzilla.mozilla.org/show_bug.cgi?id=1115039#c10 it was reported that this problem only happens on nightly user builds but not tinderbox or engineering builds. It's not clear what the sample size was there, so take things with a grain of salt.
- On bug 1118711 the same "too much recursion" problem seems to happen reliably for the marionette test automation, but I have no proof it's the worker exploding there due to a lack of logging output.
- On bug 1115039 this was reproduced by using mozAlarms and periodic sync. I wouldn't read too much into the mozAlarms thing on its own. The two largest factors that could explain the repro are that:
- We don't load most of the UI when started up headless by mozAlarms, so we will be spinning the worker up significantly earlier.
- Our periodic sync mechanism can be set via secret debug menu to have periodic sync run every 20 seconds. It is obviously a lot easier to reproduce something when it's running that often. (And when it fires, periodic sync breaks, so it's the last interesting thing from the email app that will happen in the logcat.) See https://wiki.mozilla.org/Gaia/Email/SecretDebugMode#Getting_to_the_Secret_Debug_Menu if you want super fast periodic sync.
Note that my gdb strategy was to use b2g-ps to locate the pre-allocated process and attach it and set a breakpoint on js_ReportOverRecursed and then continue. Email would then launch into that process and either succeed or have that bad thing happen. If it succeeded I would have to repeat the process. It might be smarter to put something that causes b2g to dump core inside of js_ReportOverRecursed instead. Note that I also used the 60 second sync interval instead of 20 seconds so I had ample time to manually repeat stuff over and over.
Reporter | ||
Comment 1•10 years ago
|
||
Er, and some additional potentially relevant notes:
- The reason this might be intermittent might just be because of randomized address space stuff. Specifically, the way the limit check works is just a straight up comparison between the stack pointer and the limit pointer (it's all in jsfriendapi.h and jspubtd.h). If the limit pointer is consistent gibberish that comes from NuWa, that could explain some stuff.
- We find out about this error at all through our window.onerror function we added.
- The worker is loading the "worker-bootstrap.js" that is built as part of the email app's build process. It's not tiny, but that shouldn't matter. srcBuf has a length_ of 993757.
- I short-changed the worker's actual stack 'smaps' entry. Also indicating our stack is fine and barely used:
b2501000-b2600000 rw-p 00000000 00:00 0 [stack:23042]
Size: 1020 kB
Rss: 8 kB
Pss: 8 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 8 kB
Referenced: 8 kB
Anonymous: 8 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
Reporter | ||
Comment 2•10 years ago
|
||
Quick logcat instrumentation suggests this is not an obvious nuwa problem. CreateJSContextForWorker is calling JS_SetNativeStackQuota with the expected 512k stack limit. Which is being called by WorkerThreadPrimaryRunnable::Run, it then calls DoRunLoop, and then WorkerPrivate::DoRunLoop processes the event and that's when the sadness happens. I'll try and look further into the derived limit mismatch and all that.
Comment 3•10 years ago
|
||
Andrew, any progress on? This causes sometimes failures in the gaia-ui tests when trying to launch the email app (bug 1118711).
Flags: needinfo?(bugmail)
Reporter | ||
Comment 4•10 years ago
|
||
No progress, but some more progress soon! I put this on the back-burner for a bit while dealing with v2.1/v2.2 email blockers and code reviews, hoping that this problem would disappear in the natural course of platform cleanups. (There has been a lot of recent changes to workers, intermittent test failures being fixed, etc.) I've got some time again and I also got a separate ping from :gerard-majax today who confirmed that on a non-flame JB device he was also seeing this, so I'll try digging into this again over the next few days. Of course, if fancy DOM worker people want to take over, I will not stop them!
Flags: needinfo?(bugmail)
This sounds like JS or NUWA... I don't think this is a worker issue.
Reporter | ||
Updated•10 years ago
|
Component: DOM: Workers → JavaScript Engine
Reporter | ||
Comment 6•10 years ago
|
||
I tracked this down further to pthread_attr_getstack lying to us. Looking at the bionic source code, it appears that the critical logic here should be the bionic TLS (disambiguating: thread-local storage). The smaps indicates that again we're pointed at "[anon:nuwa-thread-stack]" and it looks like nuwa does tons of messing with the TLS infrastructure, so I'm going to blame nuwa now.
This is occurring for me on my flame-kk v18d device, but :gerard-majax also indicated it happened to him on an Xperia ZR using JB.
=== log from bad run (lies!) ===
02-09 07:46:29.727 I/Werker ( 4989): GetNativeStackBaseImpl pthread_attr_getstack: our stack: 0xb27ffa3c got base: 0xb31e0000 got size: 131072
02-09 07:46:29.727 I/Werker ( 4989): Working initializing stack with size 524288
02-09 07:46:29.737 I/Werker ( 4989): SetNativeStackQuotaAndLimit-: runtime: b2833000 native base: b3200000 stackSize 524288 kind 0 stored at 0xb283308c
02-09 07:46:29.737 I/Werker ( 4989): SetNativeStackQuotaAndLimit-: runtime: b2833000 native base: b3200000 stackSize 524288 kind 1 stored at 0xb2833090
02-09 07:46:29.737 I/Werker ( 4989): SetNativeStackQuotaAndLimit-: runtime: b2833000 native base: b3200000 stackSize 524288 kind 2 stored at 0xb2833094
02-09 07:46:29.777 I/Werker ( 4989): SADNESS my stack: 0xb27fe51c check value 0xb3180001 at addr 0xb283308c
02-09 07:46:29.777 I/Werker ( 4989): SADNESS my stack: 0xb27fe51c check value 0xb3180001 at addr 0xb2833090
02-09 07:46:29.777 I/Werker ( 4989): SADNESS my stack: 0xb27fe51c check value 0xb3180001 at addr 0xb2833094
=== log from good run (no lies.) ===
02-09 07:46:07.363 I/Werker ( 4685): GetNativeStackBaseImpl pthread_attr_getstack: our stack: 0xb286fa3c got base: 0xb2770000 got size: 1048576
02-09 07:46:07.363 I/Werker ( 4685): Working initializing stack with size 524288
02-09 07:46:07.363 I/Werker ( 4685): SetNativeStackQuotaAndLimit-: runtime: b29ec000 native base: b2870000 stackSize 524288 kind 0 stored at 0xb29ec08c
02-09 07:46:07.363 I/Werker ( 4685): SetNativeStackQuotaAndLimit-: runtime: b29ec000 native base: b2870000 stackSize 524288 kind 1 stored at 0xb29ec090
02-09 07:46:07.363 I/Werker ( 4685): SetNativeStackQuotaAndLimit-: runtime: b29ec000 native base: b2870000 stackSize 524288 kind 2 stored at 0xb29ec094
=== the smaps entry for the memory range in question from the bad run ===
b31df000-b31e0000 ---p 00000000 00:00 0 [anon:nuwa-thread-stack]
Size: 4 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
Name: [anon:nuwa-thread-stack]
b31e0000-b3200000 rw-p 00000000 00:00 0 [anon:nuwa-thread-stack]
Size: 128 kB
Rss: 4 kB
Pss: 4 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 4 kB
Referenced: 4 kB
Anonymous: 4 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
Component: JavaScript Engine → IPC
Depends on: 846670
Summary: Intermittent LoadScript failure in workers due to apparently bugged stack check manifesting in "InternalError: too much recursion" error. → pthread_attr_getstack is sometimes incorrect on Firefox OS device worker threads, possibly due to nuwa TLS issue. Leads to intermittent LoadScript failure in workers "InternalError: too much recursion"
Reporter | ||
Comment 7•10 years ago
|
||
Requesting blocking because it's bad if the probability of a worker actually starting up correctly is less than 100%. Also scary thread-local-storage issues are scary. I presume the core operation of the phone (emergency calls) are safe since all the really important stuff happens in the main process which I assume doesn't get nuwa-fied.
blocking-b2g: --- → 2.2?
Reporter | ||
Comment 8•10 years ago
|
||
Here's the logging I used. My repro method this time was to keep doing "stop b2g" "start b2g" with email set to a 20-second mozAlarms-based periodic sync to save me tapping on the device, although you can also tap on the device. (Note that email's periodic sync on trunk and then v2.2 will very shortly be based on the request-sync API, which limits sync to every 100 seconds, so you'll probably need to tap manually or want to keep your version of email prior to the landing of bug 1098289.)
Reporter | ||
Comment 9•10 years ago
|
||
[Blocking Requested - why for this release]: Actually, per https://bugzilla.mozilla.org/show_bug.cgi?id=1126204#c4 this affects v2.1 too. (v2.0 is believed not to be affected.)
blocking-b2g: 2.2? → 2.1?
Comment 10•10 years ago
|
||
Running into this for gallery: https://bugzilla.mozilla.org/show_bug.cgi?id=1127117#c4
Comment 11•10 years ago
|
||
(In reply to Andrew Sutherland [:asuth] from comment #0)
> b2501000-b2600000 rw-p 00000000 00:00 0 [stack:23042]
FYI, this may not be entirely what it looks like. The way the [stack:…] annotation works is that the kernel iterates the thread group and, if it finds a thread whose last known SP is inside the VM area, reports that area as that thread's stack.
In particular, this doesn't prevent a "stack" from being merged with an adjacent VM area, if that would otherwise happen. There are some comments in SystemMemoryReporter.cpp about this.
> b3360000-b3380000 rw-p 00000000 00:00 0 [anon:nuwa-thread-stack]
This is due to an Android kernel patch which allows setting labels on anonymous memory; one of the useful effects is that they won't be merged with an adjacent anonymous VM area with a different (or no) label.
But... if this is being reported as [anon:…] instead of [stack:…] then that means there isn't a live thread's stack pointer there, I think?
Reporter | ||
Comment 12•10 years ago
|
||
Super interesting stuff!
(In reply to Jed Davis [:jld] from comment #11)
> But... if this is being reported as [anon:…] instead of [stack:…] then that
> means there isn't a live thread's stack pointer there, I think?
This does seem to be the case. I just skimmed the kernel source from my build, and show_map_vma in task_mmu.c is indeed calling vm_is_stack which just walks all the tasks and looks for a thread with an active stack pointer in that range. It gets "[stack]"/"[stack:%d]" in that case. Failing that, it calls seq_print_vma_name which does the "[anon:" stuff. I suppose this means the real stack may also be annotated with the same annotation at https://dxr.mozilla.org/mozilla-central/source/mozglue/build/Nuwa.cpp#507, it's just not getting used.
And I'm pretty confident that the smaps retrieval was done with the process suspended with gdb attached at the js_ReportOverRecursed breakpoint so it's not like I captured smaps way too late. Which is not to say that another thread couldn't have been racing the creation of the worker thread and its JS engine initialization, just that it was already dead by the time the breakpoint was hit. Specifically "info threads" in comment 0 was indicating every other thread was being boring:
Id Target Id Frame
20 Thread 22953.23038 __futex_syscall3 () at bionic/libc/arch-arm/bionic/futex_arm.S:39
19 Thread 22953.23037 __futex_syscall3 () at bionic/libc/arch-arm/bionic/futex_arm.S:39
18 Thread 22953.23036 __futex_syscall3 () at bionic/libc/arch-arm/bionic/futex_arm.S:39
17 Thread 22953.22971 __ioctl () at bionic/libc/arch-arm/syscalls/__ioctl.S:9
16 Thread 22953.22970 __ioctl () at bionic/libc/arch-arm/syscalls/__ioctl.S:9
15 Thread 22953.22967 __futex_syscall3 () at bionic/libc/arch-arm/bionic/futex_arm.S:39
14 Thread 22953.22966 __futex_syscall3 () at bionic/libc/arch-arm/bionic/futex_arm.S:39
13 Thread 22953.22965 __futex_syscall3 () at bionic/libc/arch-arm/bionic/futex_arm.S:39
12 Thread 22953.22964 poll () at bionic/libc/arch-arm/syscalls/poll.S:9
11 Thread 22953.22963 __futex_syscall3 () at bionic/libc/arch-arm/bionic/futex_arm.S:39
10 Thread 22953.22962 poll () at bionic/libc/arch-arm/syscalls/poll.S:9
9 Thread 22953.22961 __futex_syscall3 () at bionic/libc/arch-arm/bionic/futex_arm.S:39
8 Thread 22953.22960 __futex_syscall3 () at bionic/libc/arch-arm/bionic/futex_arm.S:39
7 Thread 22953.22959 __futex_syscall3 () at bionic/libc/arch-arm/bionic/futex_arm.S:39
6 Thread 22953.22958 __futex_syscall3 () at bionic/libc/arch-arm/bionic/futex_arm.S:39
5 Thread 22953.22957 __futex_syscall3 () at bionic/libc/arch-arm/bionic/futex_arm.S:39
4 Thread 22953.22956 __futex_syscall3 () at bionic/libc/arch-arm/bionic/futex_arm.S:39
3 Thread 22953.22955 epoll_wait () at bionic/libc/arch-arm/syscalls/epoll_wait.S:10
* 2 Thread 22953.23042 js_ReportOverRecursed (cx=0xb30c2870) at ../../../gecko/js/src/jscntxt.cpp:433
1 Thread 22953.22953 epoll_wait () at bionic/libc/arch-arm/syscalls/epoll_wait.S:10
Comment 13•10 years ago
|
||
Also, Nuwa does some… interesting things here. If sAllThreads loses sync with the actual state somehow — especially if pid reuse and/or pthread_t reuse is involved — it looks plausible that strange things could happen, like pthread_self() in a post-forking thread returning the pthread_t for a pre-forking thread that has exited.
Comment 14•10 years ago
|
||
I found it could be Nuwa reserves much less memory than required by jscontext.
Nuwa reserves 128K for each thread, but jscontext expects there is 512k (4 * 128 * 1024).
see:
- https://dxr.mozilla.org/mozilla-central/source/mozglue/build/Nuwa.cpp#111
- https://dxr.mozilla.org/mozilla-central/source/dom/workers/RuntimeService.cpp#95
Wow! And note that comment, it says "half the C stack size", which is defined here: https://dxr.mozilla.org/mozilla-central/source/dom/workers/WorkerThread.cpp#29
Reporter | ||
Comment 16•10 years ago
|
||
(In reply to Thinker Li [:sinker] from comment #14)
> I found it could be Nuwa reserves much less memory than required by
> jscontext.
> Nuwa reserves 128K for each thread, but jscontext expects there is 512k (4 *
> 128 * 1024).
I think there's some confusion here. The actual worker stack is fine, it's a full megabyte in size and we're barely using any of it at the time we freak out. The problem is that pthread_attr_getstack is lying/wrong and so the stack-check is checking a stack pointer from stack A against the threshold from stack B. Since stack A in its entirety is numerically below stack B in the address space, there is no way for the stack check to succeed.
Specifically, in this line:
02-09 07:46:29.727 I/Werker ( 4989): GetNativeStackBaseImpl pthread_attr_getstack: our stack: 0xb27ffa3c got base: 0xb31e0000 got size: 131072
The value 0xb27ffa3c is a memory location on our stack (we declared a variable on the stack and took its address with &). But pthread_attr_getstack is returning a stackBase of 0xb31e0000 and size of 0x20000. Because we're on a (JS_STACK_GROWTH_DIRECTION < 0) platformm js::GetNativeStackBaseImpl will add the two together getting us 0xb3200000.
So the wrong stack pthread_attr_getstack is telling us about is:
(guard page, NuWa adds an extra page size for this)
b31df000-b31e0000 ---p 00000000 00:00 0 [anon:nuwa-thread-stack]
Size: 4 kB
Rss: 0 kB
(actual stack)
b31e0000-b3200000 rw-p 00000000 00:00 0 [anon:nuwa-thread-stack]
Size: 128 kB
Rss: 4 kB
Pss: 4 kB
And the correct stack it's not telling us about is:
(guard page, no one adds the guard size on to the stack size so it gets eaten from the stack)
b2700000-b2701000 ---p 00000000 00:00 0
Size: 4 kB
(actual stack)
b2701000-b2800000 rw-p 00000000 00:00 0 [stack:5061]
Size: 1020 kB
Rss: 12 kB
Pss: 12 kB
Reporter | ||
Comment 17•10 years ago
|
||
Further elaboration on comments 14 and 15:
- The worker thread in this case is freshly created, and it appears that the real pthread_create is called. (This is why I had initially ruled out nuwa in comment 2 as an obvious cause.)
- Yes, NuWa seems to totally ignore the requested stack size of any threads created by NuWa and this seems like a serious bug, just not this bug. Specifically:
- __wrap_pthread_create completely ignores the "attr" argument which is the only place the caller indicates the desired stack size to pthread_create and friends.
- the thread NuWa invokes the passed-in start_routine on is the one it obviously creates with the NUWA_STACK_SIZE stack. There doesn't seem to be any complicated magic parallel stacks or anything. The only other real thread created is the cleaner thread which is definitely just doing cleanup stuff. The cleaner thread gets the default stack size though, which is ~1MiB on flame, which is ironic/amusing.
Reporter | ||
Comment 18•10 years ago
|
||
(In reply to Jed Davis [:jld] from comment #13)
> Also, Nuwa does some… interesting things here. If sAllThreads loses sync
> with the actual state somehow — especially if pid reuse and/or pthread_t
> reuse is involved — it looks plausible that strange things could happen,
> like pthread_self() in a post-forking thread returning the pthread_t for a
> pre-forking thread that has exited.
Yeah, by inspection, I think we're looking at pthread_t reuse triggering this symptom when GetCurThreadInfo() tries to find pthread_self() for us (because !defined(HAVE_THREAD_TLS_KEYWORD)). Under the hood, the bionic pthreads runtime is just using calloc() to allocate pthread_t's, so especially if gecko is using a different allocator like jemalloc, it seems like reuse is potentially quite likely.
And the underlying problem is that thread_info_cleanup is never invoked in the child/forked/non-nuwa processes. The code uses pthread_cleanup_push(thread_cleanup), but that just links the function call into a linked list cleanup_stack on the origThreadID pthread_t. The actual running thread that terminates is the recreatedThreadID whose cleanup_stack linked list is empty. So there is never any cleanup.
This explains why we/I see the (non-guard) "anon:nuwa-thread-stack" entries in the smaps file at all. If things were working right, they should all be "[stack:NNNNN]" or no longer present because they were unmapped.
There are two main things that need to happen to fix this:
1) The existing thread_cleanup mechanism is unsound and needs to be addressed. It spins up a thread that does pthread_kill(*thread) to figure out when the thread dies so it knows it's safe to munmap the stack. But pthread_t is a block of memory that gets freed either by thread_join or by pthread_exit when the thread is closing down. So this, with high probability, relies on a use-after-free if it's going to work correctly.
The correct thing for NuWa to do here is to wrap pthread_join for nuwa-managed thread so it can run its own cleanup logic when pthread_join returns. Similarly, pthread_detach needs to be converted into a call to pthread_join.
2) Even once we stop doing use-after-free, by the time our pthread_join call returns, the pthread_t structure has been freed, which means some other thread could have already allocated a new thread resulting in a pthread_t collision again. So we need to mark the thread_info_t structure with a "dying" flag that will cause GetCurThreadInfo to ignore the entry. And we need to do this at a time after user code has finished running. We could use pthread_cleanup_push, but...
In terms of using pthread_cleanup_push in the post-fork recreated thread, the manpage for pthread_cleanup_push says that longjmp does undefined things if you call pthread_cleanup_push after filling a buffer with setjmp. So technically the reconstructed thread needs to just directly call thread_cleanup itself. Practically speaking for our bionic implementation, it's okay for us to do it because it really is just linking into a linked list.
...and it's also not clear that we really need the pthread_cleanup_* stuff anyways. It only matters if code on the thread would unexpectedly call pthread_exit, or if someone else would call pthread_cancel to kill the thread. I don't think either of these things happen in our codebase. We can certainly just directly trigger our cleanup code in _thread_create_startup on control-flow return rather than involving pthread_cleanup_push.
Assignee | ||
Comment 19•10 years ago
|
||
I find this bug pretty reproducible on my nexus 5. When js_ReportOverRecursed() is called, its thread info content:
(gdb) p *$1
$4 = {
... (omitted)
origNativeThreadID = 820,
recreatedNativeThreadID = 1441,
nativeThreadName = "ImageDecoder #2"
}
which shows that the thread thinks it the "ImageDecoder #2" thread. Thread with tid=820 in the Nuwa process is recreated as thread with tid=1441, but tid 1441 is no longer present in the email app. We need to check why the thread_info_t instance is not cleaned up.
Assignee: nobody → cyu
Assignee | ||
Comment 20•10 years ago
|
||
(In reply to Cervantes Yu from comment #19)
> which shows that the thread thinks it the "ImageDecoder #2" thread. Thread
> with tid=820 in the Nuwa process is recreated as thread with tid=1441, but
> tid 1441 is no longer present in the email app. We need to check why the
> thread_info_t instance is not cleaned up.
thread_info_t is not cleaned up because we just don't clean it up in the forked processes. I agree with comment #18 that it might be a better solution to clean up thread_info_t in the pthread_join() wrapper to make the life cycle of thread_info_t as consistent as pthread_t. We will also need to sync up functions that destroys pthread_t with functions that creates it to avoid possible races.
Assignee | ||
Comment 21•10 years ago
|
||
The strategy to fix the bug:
* We detach thread_info_t for threads recreated from Nuwa when the thread is about to exit (in detach_thread_info()) and put it in another list that is invisible from CUR_THREAD_INFO used by the wrapper functions.
* Joinable threads must be joined, so we clean up the thread_into_t structure when __wrap_pthread_join() is called.
* Detached threads will be cleaned up when we are about to create new threads:
** Before we fork() a new process.
** Before we call REAL(pthread_create()).
* Note that we still use a loop to check whether the target thread is still alive. We do this because bionic pthread implementation before Android L implements pthread_join() with pthread_cond_wait(). The joined thread can still touch the stack after it wakes up the joining thread and catastrophe might result when the joining thread frees the thread stack. We send signal 0 to the joined thread to check whether it's alive. To prevent from the race that might result in reuse of pthread_t, we send the signal using tgkill() instead of pthread_kill().
TODO:
* The timing of cleaning up thread_info_t for detached threads is still not good enough. The thread_info_t can stay in the list sExitingThreads for a long time without being cleaned up when the process doesn't create threads. Then we might encounter a bug that EnsureThreadExited checks against a live thread that reuses tid. I think we still need to create a detached thread to clean up an exiting detached thread. And if this happens in the Nuwa process, we need to temporarily block the fork request until we finished reaping the exiting thread.
Attachment #8565363 -
Flags: feedback?(khuey)
Attachment #8565363 -
Flags: feedback?(bugmail)
Reporter | ||
Comment 22•10 years ago
|
||
Comment on attachment 8565363 [details] [diff] [review]
Clean up thread info correctly (WIP)
Review of attachment 8565363 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks for aggressively pursuing a fix for this! Assuming your TODO plans for the detached thread cleanup are like what I mentioned in my review comment, I think we're on the road to having this safely fixed! Hooray! :)
::: mozglue/build/Nuwa.cpp
@@ +625,5 @@
> + pthread_mutex_unlock(&sThreadCountLock);
> +
> + for (thread_info_t *tinfo = exitingThreads.getFirst();
> + tinfo;
> + tinfo = tinfo->getNext()) {
(mooted by later comments, hopefully:) The use of a loop and getNext here seems to be suggesting that sExitingThreads.popFirst does something different than it actually does. As I read the implementation of popFirst, it returns the first element in the linked list after removing it from sExitingThreads. So you're just getting the first element, not the entire list. See https://dxr.mozilla.org/mozilla-central/source/mfbt/LinkedList.h#343
@@ +627,5 @@
> + for (thread_info_t *tinfo = exitingThreads.getFirst();
> + tinfo;
> + tinfo = tinfo->getNext()) {
> + int detachState = 0;
> + if (pthread_getattr_np(tinfo->recreatedThreadID, &tinfo->threadAttr)) {
(mooted by later comments, hopefully:) The use of recreatedThreadID here does not seem safe. Your TODO comment in comment 21 references this, but I would suggest as a style guideline including a scary comment like "// TODO: this is still unsafe and may result in a use-after-free!!!" even for in-progress patches.
@@ +632,5 @@
> + // We fail to get thread attribute. Just skip this thread.
> + continue;
> + }
> +
> + if (pthread_attr_getdetachstate(&tinfo->threadAttr, &detachState) ||
This is also definitely too late to be processing PTHREAD_CREATE_DETACHED.
It seems like the pseudo-code we want for all cleanup handlers is:
1) Call detach_thread_info (you have this already)
2) If the thread is/was originally PTHREAD_CREATE_DETACHED (which we would convert to PTHREAD_CREATE_JOINABLE during init) or pthread_detach has since been called on the thread, then spawn a new thread that calls pthread_join on the thread and then runs EnsureThreadExited(tinfo) and thread_info_cleanup(tinfo).
This of course requires adding code to the pthread_create wrapper to filter the attribute and to pthread_detach to add the attribute to indicate the caller meant to detach from the thread.
And then you can remove this method and the calls to it.
@@ +704,5 @@
> if (STACK_SENTINEL(reserved) != STACK_SENTINEL_VALUE(reserved)) {
> abort(); // Did not reserve enough stack space.
> }
>
> + tinfo = CUR_THREAD_INFO;
I know this is existing code, but "arg" already has the correct value for tinfo. It seems like it would make more sense to just cast "arg" like thread_recreate_startup does.
Attachment #8565363 -
Flags: feedback?(bugmail) → feedback+
Andrew, you're approved for 2.1+ blocking. Could you please include risks/areas to test for QA as well as a 2.1 version of the patch?
blocking-b2g: 2.1? → 2.1+
Given that it's CNY I'm not planning to look at this until next week, since cyu@ won't be around to address comments until then.
Reporter | ||
Comment 25•10 years ago
|
||
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #23)
> Andrew, you're approved for 2.1+ blocking. Could you please include
> risks/areas to test for QA as well as a 2.1 version of the patch?
The risk to any patch would be a regression in NuWa. Since all device and emulator tests use NuWa to prime the pre-allocated processes, the main things to look for would be the pre-allocated processes constantly dying and being restarted (check logcat or b2g-ps).
In terms of determining if the problem is fixed, the thing to do would be to verify that none of the bugs that this blocks reproduce after a fix. Email may be the easiest to reproduce on since it starts up its worker at startup. Although a human can notice after a few seconds that things are broken, just doing "adb logcat | grep 'too much recursion'" should show you if/when the problem is recurring. One would keep triggering email then force-closing it. One would also want to stop b2g and start b2g again to make sure the nuwa process potentially ended up in slightly different configurations.
Note that the fix isn't completed yet, although it is being worked (apart from CNY). (Updating to ASSIGNED, accordingly.)
Status: NEW → ASSIGNED
Comment 26•10 years ago
|
||
FWIW I tested it and it fixes the issue for me in the FxOS Studio [1] (installing a theme was broken without this patch).
I haven't noticed any issue so far.
[1] https://github.com/fxos/studio
Assignee | ||
Comment 27•10 years ago
|
||
Change to v1:
* Use a detached thread to clean up thread_info_t for detached thread.
TODO:
Still need to address potential use-after-free of pthread_t.
Attachment #8565363 -
Attachment is obsolete: true
Attachment #8565363 -
Flags: feedback?(khuey)
Assignee | ||
Comment 28•10 years ago
|
||
(In reply to Andrew Sutherland [:asuth] from comment #22)
> Comment on attachment 8565363 [details] [diff] [review]
> Clean up thread info correctly (WIP)
>
> Review of attachment 8565363 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> Thanks for aggressively pursuing a fix for this! Assuming your TODO plans
> for the detached thread cleanup are like what I mentioned in my review
> comment, I think we're on the road to having this safely fixed! Hooray! :)
>
> ::: mozglue/build/Nuwa.cpp
> @@ +625,5 @@
> > + pthread_mutex_unlock(&sThreadCountLock);
> > +
> > + for (thread_info_t *tinfo = exitingThreads.getFirst();
> > + tinfo;
> > + tinfo = tinfo->getNext()) {
>
> (mooted by later comments, hopefully:) The use of a loop and getNext here
> seems to be suggesting that sExitingThreads.popFirst does something
> different than it actually does. As I read the implementation of popFirst,
> it returns the first element in the linked list after removing it from
> sExitingThreads. So you're just getting the first element, not the entire
> list. See
> https://dxr.mozilla.org/mozilla-central/source/mfbt/LinkedList.h#343
>
It's move constructor done in the wrong way :). The updated patch doesn't need to iterate over sExitingThreads.
> @@ +627,5 @@
> > + for (thread_info_t *tinfo = exitingThreads.getFirst();
> > + tinfo;
> > + tinfo = tinfo->getNext()) {
> > + int detachState = 0;
> > + if (pthread_getattr_np(tinfo->recreatedThreadID, &tinfo->threadAttr)) {
>
> (mooted by later comments, hopefully:) The use of recreatedThreadID here
> does not seem safe. Your TODO comment in comment 21 references this, but I
> would suggest as a style guideline including a scary comment like "// TODO:
> this is still unsafe and may result in a use-after-free!!!" even for
> in-progress patches.
>
I'll address this in the next revision.
>
> @@ +704,5 @@
> > if (STACK_SENTINEL(reserved) != STACK_SENTINEL_VALUE(reserved)) {
> > abort(); // Did not reserve enough stack space.
> > }
> >
> > + tinfo = CUR_THREAD_INFO;
>
> I know this is existing code, but "arg" already has the correct value for
> tinfo. It seems like it would make more sense to just cast "arg" like
> thread_recreate_startup does.
It's kind of tricky here. Arg could be passed using register r0 on ARM, but there is longjmp() in between which is hard to tell to the compiler. To be safe I'd rather reget tinfo after we longjmp() back here.
Assignee | ||
Comment 29•10 years ago
|
||
I'll address the unaddressed issues in comment #22 in the next revision.
Reporter | ||
Comment 30•10 years ago
|
||
(In reply to Cervantes Yu from comment #28)
> > I know this is existing code, but "arg" already has the correct value for
> > tinfo. It seems like it would make more sense to just cast "arg" like
> > thread_recreate_startup does.
>
> It's kind of tricky here. Arg could be passed using register r0 on ARM, but
> there is longjmp() in between which is hard to tell to the compiler. To be
> safe I'd rather reget tinfo after we longjmp() back here.
I don't think we need to worry about this. The ARM calling convention uses r0-r3 for function call arguments and r0 for the returned value. The call to setjmp() is a function call, so the caller needs to save their contents to other registers to stack. And looking at the source for _setjmp.S, it does "stmia r1, {r4-r14}". Since r0-r3 are call arguments and r15 is the program counter, this seems like proper state preservation in keeping with the definition of setjmp() and longjmp()'s behavior. (The MIPS _setjmp.S likewise appears to thoroughly save register state.)
If we've witnessed implementation bugs on certain platforms, I think it would make sense to include a comment there calling out the exact observed behaviour of the bug and ideally a bug number and explain why we're doing something like that. But given the complexity of the nuwa implementation already, I think it just makes things more confusing if we're second-guessing the parts of the compiler/runtime that are not known to be broken.
Reporter | ||
Comment 31•10 years ago
|
||
(In reply to Andrew Sutherland [:asuth] from comment #30)
> The call to
> setjmp() is a function call, so the caller needs to save their contents to
> other registers to stack.
I of course meant: to other registers or to the stack.
Assignee | ||
Comment 32•10 years ago
|
||
(In reply to Andrew Sutherland [:asuth] from comment #30)
> I don't think we need to worry about this. The ARM calling convention uses
> r0-r3 for function call arguments and r0 for the returned value. The call to
> setjmp() is a function call, so the caller needs to save their contents to
> other registers to stack. And looking at the source for _setjmp.S, it does
> "stmia r1, {r4-r14}". Since r0-r3 are call arguments and r15 is the
> program counter, this seems like proper state preservation in keeping with
> the definition of setjmp() and longjmp()'s behavior. (The MIPS _setjmp.S
> likewise appears to thoroughly save register state.)
Sorry that comment #28 is misleading. The main reason is that thread_create_startup() and thread_recreate_startup() uses the same area of the malloc()'d stack. The argument of thread_create_startup() is (very likely to be) saved on the stack before calling _thread_create_startup(), which is could be ruined by thread_recreate_startup() in the forked process. So when we return from _thread_create_startup(), we can't assume arg still contains the original value. The same apply to any local state before _thread_create_startup().
BTW I found that it's not necessary to move the pthread_cleanup_push/pop() call so I'll leave this function as original.
Assignee | ||
Comment 33•10 years ago
|
||
(In reply to Cervantes Yu from comment #32)
> BTW I found that it's not necessary to move the pthread_cleanup_push/pop()
> call so I'll leave this function as original.
Scratch this for this comment is incorrect.
Assignee | ||
Comment 34•10 years ago
|
||
This revision should address the race and use-after-free issues.
We get thread attribute right before the thread exits. For detached thread we create a detached thread to clean it up. The cleanup function waits for the exiting thread using a loop of tgkill()/sched_yield(). Then it cleans up the thread_info_t instance for the exited thread.
Time measurements shows that the sched_yield() loop takes < 1ms (only several us on nexus 5). I think it's because the loop is always called against a joinable thread.
For a joinable thread, we get the thread_info_t instance, pthread_join() the thread, wait for it to really exit and then clean up the thread_info_t instance.
I think this should work for both detached and joiable thread and we don't need to wrap pthread_detach().
Attachment #8571326 -
Attachment is obsolete: true
Attachment #8572501 -
Flags: review?(khuey)
Attachment #8572501 -
Flags: review?(bugmail)
Assignee | ||
Comment 35•10 years ago
|
||
This patch doesn't appear to pass the Nuwa tests.
Reporter | ||
Comment 37•10 years ago
|
||
Comment on attachment 8572501 [details] [diff] [review]
bug_1119157.patch
I agree that the revised logic avoids the need to wrap pthread_detach. We are vulnerable to a thread-id re-use race, but it's hard to see how we can avoid that, and it certainly doesn't seem particularly probable. And thank you for adding the comment about the stack rationale in thread_create_startup
I'm concerned about the use of pthread_cleanup_push and pthread_cleanup_pop for a few reasons. If we look at the bionic implementation on flame-kk, it works like this:
=== bionic/libc/include/pthread.h
/* Believe or not, the definitions of pthread_cleanup_push and
* pthread_cleanup_pop below are correct. Posix states that these
* can be implemented as macros that might introduce opening and
* closing braces, and that using setjmp/longjmp/return/break/continue
* between them results in undefined behaviour.
*
* And indeed, GLibc and other C libraries use a similar definition
*/
#define pthread_cleanup_push(routine, arg) \
do { \
__pthread_cleanup_t __cleanup; \
__pthread_cleanup_push( &__cleanup, (routine), (arg) ); \
#define pthread_cleanup_pop(execute) \
__pthread_cleanup_pop( &__cleanup, (execute)); \
} while (0);
=== bionic/libc/bionic/pthread.c
/* CAVEAT: our implementation of pthread_cleanup_push/pop doesn't support C++ exceptions
* and thread cancelation
*/
void __pthread_cleanup_push( __pthread_cleanup_t* c,
__pthread_cleanup_func_t routine,
void* arg )
{
pthread_internal_t* thread = __get_thread();
c->__cleanup_routine = routine;
c->__cleanup_arg = arg;
c->__cleanup_prev = thread->cleanup_stack;
thread->cleanup_stack = c;
}
void __pthread_cleanup_pop( __pthread_cleanup_t* c, int execute )
{
pthread_internal_t* thread = __get_thread();
thread->cleanup_stack = c->__cleanup_prev;
if (execute)
c->__cleanup_routine(c->__cleanup_arg);
}
===
Specific concerns:
- Since we're concerned about the structure of the stack frame of thread_create_startup, if we use pthread_cleanup_push, it's probably better if it happens in a guaranteed non-overlapping stack frame, such as _thread_create_startup.
- We still have the problem that in post-fork nuwa, pthread_cleanup_pop is thinking about a different pthread_t. Under thie flame-kk implementation, the entire cleanup mechanism will not be operational until a call to pthread_cleanup_pop re-establishes the linked list chain. Under a different implementation, the result could be an infinite loop or a failure to run the cleanup handler.
pthread_cleanup_push/pop is effectively broken by NuWa. For it to work right, I think NuWa would need/want to wrap it (and its macros) based on its current strategy of wrapping. I'm not sure this is really worth it, since at least in mozilla-central per dxr I see no calls to pthread_cancel or pthread_exit. I do see a call to pthread_cleanup_push in GonkHal.cpp (https://dxr.mozilla.org/mozilla-central/source/hal/gonk/GonkHal.cpp#1016) however, which may or may not be concerning. Of course, there may be things linked into gecko on NuWa platforms where this matters.
For simplicity/sanity, I think it might make more sense to simply directly call invalidate_thread_info instead of using pthread_cleanup_push/pop.
(And I'm also concerned about the test failures too, of course :)
Attachment #8572501 -
Flags: ui-review+
Attachment #8572501 -
Flags: review?(bugmail)
Attachment #8572501 -
Flags: review-
Reporter | ||
Updated•10 years ago
|
Attachment #8572501 -
Flags: ui-review+ → feedback+
Assignee | ||
Comment 38•10 years ago
|
||
(In reply to Andrew Sutherland [:asuth] from comment #37)
> Comment on attachment 8572501 [details] [diff] [review]
> bug_1119157.patch
>
> Specific concerns:
>
> - Since we're concerned about the structure of the stack frame of
> thread_create_startup, if we use pthread_cleanup_push, it's probably better
> if it happens in a guaranteed non-overlapping stack frame, such as
> _thread_create_startup.
>
We need to move invalidate_thread_info to thread_create_startup() or CUR_THREAD_INFO will get nullptr and the content process will then crashes in using it to longjmp():
http://hg.mozilla.org/mozilla-central/file/0351fbdeb37c/mozglue/build/Nuwa.cpp#l675
> - We still have the problem that in post-fork nuwa, pthread_cleanup_pop is
> thinking about a different pthread_t. Under thie flame-kk implementation,
> the entire cleanup mechanism will not be operational until a call to
> pthread_cleanup_pop re-establishes the linked list chain. Under a different
> implementation, the result could be an infinite loop or a failure to run the
> cleanup handler.
>
In bionic's implementation, this happens to work because the stack of pthread cleanup handlers are maintained in chaining the __pthread_cleanup_t instances. pthread_internal_t only maintains a pointer to the stack top. In the content process, we create a new thread instance, which will lose the link to the stack top, which will make pthread_exit() fail to run the cleanup handlers. But the cleanup variables are still chained so when we call pthread_cleanup_pop() it should still run the cleanup handler (if the __pthread_cleanup_t instance is not corrupted).
> pthread_cleanup_push/pop is effectively broken by NuWa. For it to work
> right, I think NuWa would need/want to wrap it (and its macros) based on its
> current strategy of wrapping. I'm not sure this is really worth it, since
> at least in mozilla-central per dxr I see no calls to pthread_cancel or
> pthread_exit. I do see a call to pthread_cleanup_push in GonkHal.cpp
> (https://dxr.mozilla.org/mozilla-central/source/hal/gonk/GonkHal.cpp#1016)
This runs in the chrome process so I don't think it to be a problem.
> however, which may or may not be concerning. Of course, there may be things
> linked into gecko on NuWa platforms where this matters.
>
> For simplicity/sanity, I think it might make more sense to simply directly
> call invalidate_thread_info instead of using pthread_cleanup_push/pop.
>
Agreed. We don't expect any thread to do pthread_exit() or pthread_cancel() in gecko, and bionic's implementation isn't exception-safe. I will just call the cleanup handler directly to make it simpler.
> (And I'm also concerned about the test failures too, of course :)
I will look into it.
Assignee | ||
Comment 39•10 years ago
|
||
Let's see if it passes the tests.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1d52b0ec46d4
Attachment #8572501 -
Attachment is obsolete: true
Attachment #8572501 -
Flags: review?(khuey)
Reporter | ||
Comment 40•10 years ago
|
||
(In reply to Cervantes Yu from comment #38)
> In bionic's implementation, this happens to work because the stack of
> pthread cleanup handlers are maintained in chaining the __pthread_cleanup_t
> instances. pthread_internal_t only maintains a pointer to the stack top. In
> the content process, we create a new thread instance, which will lose the
> link to the stack top, which will make pthread_exit() fail to run the
> cleanup handlers. But the cleanup variables are still chained so when we
> call pthread_cleanup_pop() it should still run the cleanup handler (if the
> __pthread_cleanup_t instance is not corrupted).
I agree with this analysis; things do end-up working for our very relaxed needs, although they end up basically just being function calls in a post-fork NuWa. My concern was mainly that the design decisions for NuWa seem to be based around wrapping pthreads rather than assuming a specific internal implementation and monkeypatching things. Since the pthreads implementation is allowed an incredible amount of implementation latitude by the spec, if we care about pthread_cleanup_push/pop, it would make sense to also wrap them.
(I don't think it's worth the effort, of course, and I think we're agreed here too :)
Assignee | ||
Comment 41•10 years ago
|
||
(In reply to Cervantes Yu from comment #39)
> Created attachment 8575242 [details] [diff] [review]
> The updated patch.
>
> Let's see if it passes the tests.
>
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=1d52b0ec46d4
Still failed. I am trying to reproduce the failure locally but it passed locally.
Comment 42•10 years ago
|
||
If it would help to have another test case to reproduce this "too much recursion" Worker error, I've got a work in progress gallery patch here: https://github.com/davidflanagan/gaia/tree/gallery-remove-webgl/apps/gallery
This branch modifies the Gallery app to do image editing with JS in a worker rather than using webgl. This bug occurs intermittently. It is easy to reproduce if I change this line in apps/gallery/js/ImageEditor.js:
```
const NUM_THREADS = 2;
```
to
```
const NUM_THREADS = 16;
```
Then, when I edit an image and save it, I often see the too much recursion error in the console output (and the save fails and the gallery editor gets messed up)
Comment 43•10 years ago
|
||
Actually, my wip patch that demonstrates this bug is more easily available via this PR instead of that github branch: https://github.com/mozilla-b2g/gaia/pull/26866
Assignee | ||
Comment 45•10 years ago
|
||
I am finally able to reproduce the crash in mochitest locally. I will look into it.
Assignee | ||
Comment 46•10 years ago
|
||
Conclusion first: we probably shouldn't try to free thread stack in the forked process (as Nuwa originally did).
The call stack of the crash (which matches the decoded one on the try server)
Program received signal SIGSEGV, Segmentation fault.
pthread_key_delete (key=14) at bionic/libc/bionic/pthread.c:1731
1731 thr->tls[key] = NULL;
(gdb) bt
#0 pthread_key_delete (key=14) at bionic/libc/bionic/pthread.c:1731
#1 0x4002bbc0 in __wrap_pthread_key_delete (key=14) at /mnt/SSD/data/hg/mozilla-central/mozglue/build/Nuwa.cpp:818
#2 0x40d4d620 in base::ThreadLocalPlatform::FreeSlot (slot=@0x431c783c: 14) at /mnt/SSD/data/hg/mozilla-central/ipc/chromium/src/base/thread_local_posix.cc:21
#3 0x40d506d2 in base::ThreadLocalPointer<MessageLoop>::~ThreadLocalPointer (this=0x431c783c <_ZZL11get_tls_ptrvE7tls_ptr>, __in_chrg=<optimized out>) at /mnt/SSD/data/hg/mozilla-central/ipc/chromium/src/base/thread_local.h:81
#4 0x40096be4 in __cxa_finalize (dso=0x0) at bionic/libc/stdlib/atexit.c:158
#5 0x40096f80 in exit (status=0) at bionic/libc/stdlib/exit.c:57
#6 0x40096f80 in exit (status=0) at bionic/libc/stdlib/exit.c:57
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
And we are accessing the pthread_internal_t instance in the Nuwa process. This process has pid=1037, but the pthread_internal_t instance is for tid=1036:
(gdb) p thr->tls
$5 = (void **) 0x44960f00
(gdb) p thr->tls[14]
Cannot access memory at address 0x44960f38
(gdb) p getpid()
$6 = 1037
(gdb) p *thr
$4 = {
next = 0x2d6e8,
pref = 0x2e7d0,
attr = {
flags = 2,
stack_base = 0x44941000,
stack_size = 131072,
guard_size = 4096,
sched_policy = 0,
sched_priority = 0
},
kernel_id = 1036,
join_cond = {
value = 0
},
join_count = 0,
return_value = 0x0,
intern = 1,
cleanup_stack = 0x0,
tls = 0x44960f00
}
(gdb)
tid 1036 is the ImageDecoder thread in the Nuwa process:
APPLICATION SEC USER PID PPID VSIZE RSS WCHAN PC NAME
b2g 0 root 631 627 168568 90188 ffffffff 400855a8 S /system/b2g/b2gg
(Nuwa) 0 root 638 631 79228 34140 ffffffff 400855a8 S /system/b2g/b2g
Mochitest 2 app_880 880 631 88684 49688 ffffffff 40084c70 S /system/b2g/plugin-container
(Preallocated a 2 app_1037 1037 638 75612 26024 ffffffff 00010078 T /system/b2g/b2g
(Nuwa) 0 root 1091 631 78328 39968 ffffffff 400855a8 S /system/b2g/plugin-container
(Preallocated a 0 app_1189 1189 1091 74456 24920 ffffffff 40084c70 S /system/b2g/plugin-container
1|root@android:/proc/638/task/1036 # cat stat
1036 (ImageDecoder #1) S 631 627 627 34819 627 4194368 30 0 0 0 1 3 0 0 20 0 15 0 29008 81129472 8535 4294967295 32768 172752 3203586672 1150684040 1074288040 0 0 69634 36072 3221596976 0 0 -1 0 0 0 0 0 0
The TLS is implemented using the stack space. After the thread exits, the stack is freed, but the atexit() handler registered in the Nuwa process doesn't know that. So when it tries to run the exit handler to touch the freed memory area, it crashes. That's why we didn't see it on release build.
I think we might leave a comment there saying we intentionally let the thread stack leak (at least in the debug build) if there is no good solution to this.
Assignee | ||
Comment 47•10 years ago
|
||
This should be green: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bca695060392
Assignee | ||
Comment 48•10 years ago
|
||
Changes to the previous revision:
* Fix the crash in mochitest-11: for debug build, don't free the thread stack in the forked process.
Attachment #8575242 -
Attachment is obsolete: true
Attachment #8577975 -
Flags: review?(khuey)
Attachment #8577975 -
Flags: review?(bugmail)
Reporter | ||
Comment 49•10 years ago
|
||
Comment on attachment 8577975 [details] [diff] [review]
Clean up thread info correctly.
Hm, yeah, bionic's gThreadList is a real achilles heal of the NuWa implementation since it will remember all of the live threads at the time of forking and never forget them.
Note that while the crash may only be reported for DEBUG builds:
- pthread_key_destroy is fundamentally unsafe in post-fork NuWa processes where any threads were active at the time of the fork. The good news is code would have to be pretty wacky to ever call this.
- Unless we've changed the non-debug runtime so it doesn't actually run static destructors, then message_loop.cc's ThreadLocalPointer<MessageLoop> will always exist and be destroyed at shutdown. It seems like this will result in b2g processes always crashing.
This means that the only courses of action that are safe are to either:
- Have pthread_key_delete be a no-op in post-fork processes.
- Always leave the TLS portion of the created-in-the-prefork-process stack around. It should be safe to unmap the rest of the stack.
It's arguably safer to leak a single page per-pre-fork-nuwa-thread than to leak TLS keys/slots, even though use of pthread_key_destroy isn't something that should actually happen dynamically.
Based on the number of situations like this for NuWa it seems a lot like if there is ever effort to fix things, it could make more sense to just have NuWa assume specific pthreads implementations rather than wrapping everything required for true correctness (which is not currently done). Then NuWa could be mainly coordinating the fork point and resurrecting the existing pthread_t threads. (Or at least have simpler/safer hacks.)
r=asuth with one of those solutions above. Note that a spin-off bug still is needed for the comment 14 issue. It could also be nice to have a meta-ish bug that notes the limitations of the current implementation and the trade-offs to refactoring to be pthread-implementation-specific.
Attachment #8577975 -
Flags: review?(bugmail) → review+
Assignee | ||
Comment 50•10 years ago
|
||
(In reply to Andrew Sutherland [:asuth] from comment #49)
> Comment on attachment 8577975 [details] [diff] [review]
> Clean up thread info correctly.
>
> Hm, yeah, bionic's gThreadList is a real achilles heal of the NuWa
> implementation since it will remember all of the live threads at the time of
> forking and never forget them.
>
Yes. In comparison, glibc registers atfork handler to clean up auto-allocated thread stacks. It doesn't have gThreadList (or g_thread_list in L's bionic implementation) inherited in the forked process.
> Note that while the crash may only be reported for DEBUG builds:
> - pthread_key_destroy is fundamentally unsafe in post-fork NuWa processes
> where any threads were active at the time of the fork. The good news is
> code would have to be pretty wacky to ever call this.
> - Unless we've changed the non-debug runtime so it doesn't actually run
> static destructors, then message_loop.cc's ThreadLocalPointer<MessageLoop>
> will always exist and be destroyed at shutdown. It seems like this will
> result in b2g processes always crashing.
>
In release builds, we call_exit() to close the process, and it doesn't look like we are going to change that soon. But I agree that we need to make it robust against exit() call.
> This means that the only courses of action that are safe are to either:
> - Have pthread_key_delete be a no-op in post-fork processes.
> - Always leave the TLS portion of the created-in-the-prefork-process stack
> around. It should be safe to unmap the rest of the stack.
>
> It's arguably safer to leak a single page per-pre-fork-nuwa-thread than to
> leak TLS keys/slots, even though use of pthread_key_destroy isn't something
> that should actually happen dynamically.
>
Unmap all except the TLS page appears that we lean ourselves too closely to the implementation (although it's already so in the current Nuwa implementation). We don't want to worsen the situation. I'll take the pthread_key_delete() path.
Thinker has another (really hacky :) ) idea: Before fork(), the Nuwa process longjmp() to thread_create_startup() and return from it for each thread. We then pthread_join() the thread to let pthread clean up the pthread_t instance. Then there is only thread stack left to be inherited to the child process. Then we recreate the threads with the stacks.
The good part of this is that this allows us to clean the pthread_t instances inside pthread implementation. But I am not sure it is worthy that we spend so much effort to work around bionic's lack of cleanup action at fork.
> Based on the number of situations like this for NuWa it seems a lot like if
> there is ever effort to fix things, it could make more sense to just have
> NuWa assume specific pthreads implementations rather than wrapping
> everything required for true correctness (which is not currently done).
> Then NuWa could be mainly coordinating the fork point and resurrecting the
> existing pthread_t threads. (Or at least have simpler/safer hacks.)
>
A sane way to do this is to place a copy of pthread implementation inside gecko, just like we do to jemalloc and chromium IPC code, but this makes room for divergence from upstream. Isn't that overkill?
> r=asuth with one of those solutions above. Note that a spin-off bug still
> is needed for the comment 14 issue. It could also be nice to have a
> meta-ish bug that notes the limitations of the current implementation and
> the trade-offs to refactoring to be pthread-implementation-specific.
Reporter | ||
Comment 51•10 years ago
|
||
(In reply to Cervantes Yu from comment #50)
> A sane way to do this is to place a copy of pthread implementation inside
> gecko, just like we do to jemalloc and chromium IPC code, but this makes
> room for divergence from upstream. Isn't that overkill?
I think the cost/benefit analysis is a call for you and the other maintainers of NuWa to make :) I think the ideal thing would be if we could talk with the bionic developers about our use-case and do as much work in upstream as possible so that we could greatly reduce what NuWa has to do in our codebase. But unless we keep finding bugs in NuWa or it results in significant engineering difficulties for the rest of gecko, it's probably not worth overhauling anything.
Assignee | ||
Comment 52•10 years ago
|
||
Attachment #8577975 -
Attachment is obsolete: true
Attachment #8577975 -
Flags: review?(khuey)
Assignee | ||
Updated•10 years ago
|
Attachment #8579176 -
Flags: review+
Assignee | ||
Comment 53•10 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
status-firefox39:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
Assignee | ||
Comment 55•10 years ago
|
||
Comment on attachment 8579176 [details] [diff] [review]
Clean up thread info correctly. r=asuth
NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.
[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 999323. We tried to perform cleanup action in the Nuwa-forked child processes (which is necessary), but didn't do it correctly and made room for race condition with the threading constructs we use.
User impact if declined: the threading library calls in the process behaves incorrectly. One user observable symptom is that apps using JS worker (e.g. Email, Keyboard) fails to function properly.
Testing completed: Manually on a device and automatic regression tests on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8789ebb51c49
Risk to taking this patch (and alternatives if risky): Low to medium. This patch fixes a side effect in the cleanup action we do for processes/threads that are cloned from the Nuwa process due to some race condition. If it contains unforeseen side effect, the threads cloned from Nuwa could exhibit other kinds of strange behavior.
Attachment #8579176 -
Flags: approval-mozilla-b2g34?
Reporter | ||
Updated•10 years ago
|
Comment 56•10 years ago
|
||
Just a side note to add: for the past couple days, sometimes when I encounter the Too Much Recursion issue with Keyboard app, the Keyboard app actually crashed (and I'm greeted with the "Send report?" prompt).
Reporter | ||
Comment 57•10 years ago
|
||
Comment on attachment 8579176 [details] [diff] [review]
Clean up thread info correctly. r=asuth
(we want this on v2.2 too, see the above request for v2.1/b2g34)
Attachment #8579176 -
Flags: approval-mozilla-b2g37?
Reporter | ||
Comment 58•10 years ago
|
||
(In reply to John Lu [:mnjul] [MoCoTPE] from comment #56)
> Just a side note to add: for the past couple days, sometimes when I
> encounter the Too Much Recursion issue with Keyboard app, the Keyboard app
> actually crashed (and I'm greeted with the "Send report?" prompt).
Can you clarify if these builds had the fix that landed in mozilla-central in comment 54?
Comment 59•10 years ago
|
||
(In reply to Andrew Sutherland [:asuth] from comment #58)
> (In reply to John Lu [:mnjul] [MoCoTPE] from comment #56)
> > Just a side note to add: for the past couple days, sometimes when I
> > encounter the Too Much Recursion issue with Keyboard app, the Keyboard app
> > actually crashed (and I'm greeted with the "Send report?" prompt).
>
> Can you clarify if these builds had the fix that landed in mozilla-central
> in comment 54?
The crashing builds do *not* have the fix; they're from couple days ago (and the fix only landed onto m-c less than 24hrs ago).
I have *not* encountered the Too Much Recursion issue or Keyboard app crashes with m-c builds containing the fix. :)
Assignee | ||
Comment 60•10 years ago
|
||
NI :bajaj for uplift approval request.
Bhavana, we are seeing this bug in several bugs. May we have your review for uplift? Thanks.
Flags: needinfo?(bbajaj)
Comment 61•10 years ago
|
||
Comment on attachment 8579176 [details] [diff] [review]
Clean up thread info correctly. r=asuth
Apologies on the delay, I was trying to sync with :asuth, to confirm the issue is resolved and he's verified it ;) I missed him on irc, so NI'ing him here and approving in parallel.
Flags: needinfo?(bbajaj)
Attachment #8579176 -
Flags: approval-mozilla-b2g37?
Attachment #8579176 -
Flags: approval-mozilla-b2g37+
Attachment #8579176 -
Flags: approval-mozilla-b2g34?
Attachment #8579176 -
Flags: approval-mozilla-b2g34+
Updated•10 years ago
|
Flags: needinfo?(bugmail)
Reporter | ||
Comment 62•10 years ago
|
||
(In reply to bhavana bajaj [:bajaj] from comment #61)
> Apologies on the delay, I was trying to sync with :asuth, to confirm the
> issue is resolved and he's verified it ;) I missed him on irc, so NI'ing
> him here and approving in parallel.
My on-device testing is unable to reproduce the problem after a number of runs and restarts of b2g in a post-fix build. Given :mnjul's feedback in comment 59 and no new bugs being filed on this, I think we should indeed be good to uplift.
Flags: needinfo?(bugmail)
Comment 63•10 years ago
|
||
https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/a147aac92cb6
https://hg.mozilla.org/releases/mozilla-b2g34_v2_1/rev/a8cdbafef664
status-firefox37:
--- → wontfix
status-firefox38:
--- → wontfix
Comment 64•10 years ago
|
||
status-b2g-v2.1S:
--- → fixed
Comment 65•10 years ago
|
||
Backed out from v2.2 for test_NuwaProcessDeadlock.html crashes. Interestingly, v2.1 appears to be OK.
https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/7a9f2a248e57
https://treeherder.mozilla.org/logviewer.html#?job_id=80836&repo=mozilla-b2g37_v2_2
Assignee | ||
Comment 66•10 years ago
|
||
Interesting, it's still pthread_key_delete() when the process exits, but in another test case (dom/test/json/test_json.html). I'll check why it still crashes.
Assignee | ||
Comment 67•10 years ago
|
||
It turns out that we miss the dependent patch: https://hg.mozilla.org/mozilla-central/rev/30cc8c82eb21 , to make the rebase of __wrap_pthread_key_delete() apply correctly. 2.1 also has the problem and needs to be fixed, too.
Comment 68•10 years ago
|
||
Try confirms green with bug 1121269 included :)
Comment 69•10 years ago
|
||
You need to log in
before you can comment on or make changes to this bug.
Description
•