Closed Bug 335018 Opened 18 years ago Closed 18 years ago

valgrind warnings due to ordering dependency of thread local storage destructors [@ ntdll.dll] [@ js_SearchScope]

Categories

(Core :: JavaScript Engine, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dbaron, Assigned: feng.qian.moz)

References

Details

(4 keywords)

Crash Data

Attachments

(3 files, 1 obsolete file)

Bug seen in:  Firefox trunk build, within past few days

Steps to reproduce:
 1. install an extension that needs upgrading, e.g., version 1.0 of https://addons.mozill a.org/firefox/2407/history/
 2. exit Firefox
 3. start Firefox
 4. Tools -> Extensions
 5. "Find Updates"
 6. "Update Now" on that extension

I get the attached valgrind warnings.  This seems to me to be showing that there's an unenforced ordering dependency between the thread local storage destructors in JavaScript and XPConnect: the XPConnect one needs to run first, but it is being run second.
With my patch for bug 344277 applied, both trunk and branch crash after the microsummary service handles a bad cert error and then the user quits the app.  Here are my valgrind warnings, which dbaron says demonstrate the same problem as reported in this bug.
A Firefox 2 blocker depends on this, so it seems like it should block 1.8.1.
Flags: blocking1.8.1?
(It also probably means that XPInstall is much less stable since this bug got pulled in with the JS 1.7 landing.)
Need to get traction on this ASAP.  Feng/Brendan/Blake, what's the best path forward here?
Flags: blocking1.8.1? → blocking1.8.1+
(In reply to comment #4)
> (It also probably means that XPInstall is much less stable since this bug got
> pulled in with the JS 1.7 landing.)
> 

This bug was introduced by fix of bug 312238. I take it. 
Status: NEW → ASSIGNED
Flags: blocking1.8.1+ → blocking1.8.1?
Assignee: general → feng.qian.moz
Status: ASSIGNED → NEW
Flags: blocking1.8.1? → blocking1.8.1+
I was wondering if we couldn't just make XPConnect eagerly allocate a TPI instead of doing it lazily -- and allocate eagerly before it calls JS_NewRuntime.  If NSPR runs the destructors in FIFO fashion, that would fix it.

/be
(In reply to comment #7)
> I was wondering if we couldn't just make XPConnect eagerly allocate a TPI
> instead of doing it lazily -- and allocate eagerly before it calls
> JS_NewRuntime.  If NSPR runs the destructors in FIFO fashion, that would fix
> it.

Brenden, this would rely on NSPR runs the destructors in FIFO fashion. I couldn't find such guarantee from NSPR documentation.

Another fix I propose is to link all allocated JSThread blocks and free them in JS_DestoryRuntime, at that point, all context should be freed. This will need a lock when a JSThread object is created and to be linked in the list.

> /be
> 

Flags: blocking1.8.1+ → blocking1.8.1?
NSPR is old and quite API-stable.  I hope Wan-Teh can say something here about guaranteeing FIFO thread private index allocation/destruction.

KISS really wants a small fix here that relies on well-ordered destruction, IMHO.

If we can't count on this, another thought I had that would not require a runtime lock would be to add JS API that XPConnect can use to piggy-back its thread-local storage on top of JSThread.  XPConnect would stop using PR_NewThreadPrivateIndex, etc., and instead use JS_NewThreadPrivateIndex, etc., parameterized by runtime and allocating from a small table in JSThread.

/be
(In reply to comment #9)
> thread-local storage on top of JSThread.  XPConnect would stop using
> PR_NewThreadPrivateIndex, etc., and instead use JS_NewThreadPrivateIndex, etc.,
> parameterized by runtime and allocating from a small table in JSThread.

Oops, I meant "parameterized by context", from which the API implementation can get cx->thread.

/be
Flags: blocking1.8.1? → blocking1.8.1+
Attached patch patch (v1) (obsolete) — Splinter Review
This patch eliminate the usage of deconstructor callbacks in SpideMonkey. All JSThread structs are linked in JSRuntime. When JSRuntime is destroyed, JSThread's are freed. It assumes all contexts are freed before JSRuntime is freed, so that none usage of JSThread after that.
Attachment #230191 - Flags: review?
dbaron and Myk, can you verify my patch using valgrind if it is easy for you to do  so? I don't have valgrind installed yet.
Comment on attachment 230191 [details] [diff] [review]
patch (v1)

This is ok for Mozilla, probably, but it'll leak JSThreads badly in embeddings (tellme.com, whose multithreaded VXML service handled all AT&T/Cingular 411/1-800-555-1212 automated voice response -- in SpiderMonkey!) that join or exit threads a lot over the life of a long-lived process/runtime.

Still looking for the simplest fix that works and can be relied on to work in the future -- where is wtc? ;-).

/be
Attachment #230191 - Flags: review? → review-
Brendan,

Here is the NSPR function that invokes the thread local storage destructors:
http://lxr.mozilla.org/nspr/ident?i=_PR_DestroyThreadPrivate

Note that it runs the destructors in a nested loop.  You can take advantage
of that to enforce the desired ordering of your two destructors.  For example,
destructor A can do nothing if it detects that destructor B hasn't been called
yet, and wait until the second invocation to do the destruction.

I don't know what you mean by the FIFO order.  What's "First In"?  NSPR
invokes the destructors in the order of the indices for thread local storage.
I believe this is what you want.  I guess you just want NSPR to commit to
this order in its documentation.
First in == first allocated via PR_NewThreadPrivateIndex.

First out == first destructor invoked.

Yeah, documentation is pretty much all I would want.

/be
Blocks: 344214
*** Bug 344214 has been marked as a duplicate of this bug. ***
Feng, if you think we can rely on NSPR calling thread-private destructor callbacks in the same order their indexes were added, then we just need a small change to xpconnect.  Otherwise, we need an alternative that avoids bloating JSThreads (in some embeddings, this is tantamount to a bad leak) in the runtime.  Either one of:

- The JS_NewThreadPrivateIndex, etc. JS API that copies NSPR's API but implements in a small TPI-indexed array in JSThread, so that XPConnect thread-local storage lifetime depends on JS thread-local storage lifetime, as proposed in comment 9.

- Your patch amended to GC threads that have empty contextLists.

I still prefer the simplest possible patch, provided we can count on NSPR not changing its behavior (and we can -- documentation would just formalize this). What do you think?

/be
(In reply to comment #13)
> (From update of attachment 230191 [details] [diff] [review] [edit])
> This is ok for Mozilla, probably, but it'll leak JSThreads badly in embeddings
> (tellme.com, whose multithreaded VXML service handled all AT&T/Cingular
> 411/1-800-555-1212 automated voice response -- in SpiderMonkey!)

Off-topic:

Myk asked whether I meant to write "handled" -- I didn't.  AFAIK, Cingular 411 and AT&T 1.800 directory assistance still use tellme.com's VXML service, which is built on multi-threaded SpiderMonkey.

I asked an insider about that the other day, cuz I used it and the AVR worked great, but I was stuck with a $3 charge.  What a rip!  But my source said Google, Yahoo!, and MSFT are all coming out with free 411 services.  Good news.  I hope someone still uses thread-safe SpiderMonkey ;-).

/be
Attached patch patch (v2)Splinter Review
Brendan, I think your suggestion in comment 7 with detailed documentation is a better solution. This is another patch does the simple fix. While submitting this patch, I am still trying to reproduce the crash using the testcase from bug 344214. 

Martijn Wargers, could you please test this patch in your 1.8.1 build? Thanks.
Attachment #230191 - Attachment is obsolete: true
Attachment #230441 - Flags: review?
(In reply to comment #19)
> Created an attachment (id=230441) [edit]
> patch (v2)

> Martijn Wargers, could you please test this patch in your 1.8.1 build? Thanks.

No, that patch doesn't fix the crash with the testcase from bug 344214.

Attachment #230441 - Flags: review? → review?(brendan)
Comment on attachment 230441 [details] [diff] [review]
patch (v2)

It didn't fix the problem, withdraw the patch
Attachment #230441 - Attachment is obsolete: true
Attachment #230441 - Flags: review?(brendan)
(In reply to comment #20)
> (In reply to comment #19)
> > Created an attachment (id=230441) [edit]
> > patch (v2)
> 
> > Martijn Wargers, could you please test this patch in your 1.8.1 build? Thanks.
> 
> No, that patch doesn't fix the crash with the testcase from bug 344214.

Martijn, on my Linux build, not crashes happened after applying patch v2. 
I saw 4 tread private destructors called for each thread:
1) xpcom/threads/nsThread.cpp nsThread::Exit
2) xpcom/base/nsExceptionService.cpp nsExceptionService::ThreadDestruct
   which throws following exception from line 97

###!!! ASSERTION: nsExceptionManager not thread-safe: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file /usr/local/google/home/fqian/workspace/firefox_src_bug335018/mozilla/xpcom/base/nsExceptionService.cpp, line 97

3) js/src/xpconnect/src/xpcthreadcontext.cpp xpc_ThreadDataDtorCB
4) js/src/jscntxt.c js_ThreadDestructorCB

I saw the same exceptions thrown with patch v1. It seems to me unreleated to this bug.

Attachment #230441 - Flags: review?(brendan)
Attachment #230441 - Attachment is obsolete: false
Comment on attachment 230441 [details] [diff] [review]
patch (v2)

>+        // We rely on the implementation of NSPR that calls destructors at 
>+        // the same order of calling PR_NewThreadPrivateIndex.

Nit: "in the same order...."

Looks good, should work.  Simplest patch, keeps GetData intact so it could be used lazily.

Anyone still having problems confirming that this patch fixes the valgrind warnings?

/be
Attachment #230441 - Flags: review?(brendan) → review+
> Anyone still having problems confirming that this patch fixes the valgrind
> warnings?

I haven't run it with valgrind yet, but I've tested it against the crash exposed by my patch for bug 344277.  With this patch, I no longer crash.
Comment on attachment 230441 [details] [diff] [review]
patch (v2)

It seems that you can also add a reference count to the
thread local storage indexed by JSRuntime::threadTPIndex,
and have the destructors for XPCPerThreadData::gTLSIndex
and JSRuntime::threadTPIndex decrement the reference count,
destroying thread local storage when the reference count
becomes 0.

The order in which NSPR calls the thread local storage
destructors is not specified and the current implementation
made an arbitrary choice to call the destructors in increasing
index order.  Since this order is as reasonable as any other
order, we are not going to change it even though it is not
documented.  It's fine by me to document this order officially.
But I encourage you to come up with a fix that doesn't depend
on this order.
(In reply to comment #25)
> (From update of attachment 230441 [details] [diff] [review] [edit])
> It seems that you can also add a reference count to the
> thread local storage indexed by JSRuntime::threadTPIndex,
> and have the destructors for XPCPerThreadData::gTLSIndex
> and JSRuntime::threadTPIndex decrement the reference count,
> destroying thread local storage when the reference count
> becomes 0.

We could do that, but I still don't see the point of the extra code if we can count on destruction being well-ordered.

> The order in which NSPR calls the thread local storage
> destructors is not specified and the current implementation
> made an arbitrary choice to call the destructors in increasing
> index order.  Since this order is as reasonable as any other
> order, we are not going to change it even though it is not
> documented.  It's fine by me to document this order officially.
> But I encourage you to come up with a fix that doesn't depend
> on this order.

Maybe later ;-).

/be
(In reply to comment #25)
> (From update of attachment 230441 [details] [diff] [review] [edit])
> It seems that you can also add a reference count to the
> thread local storage indexed by JSRuntime::threadTPIndex,
> and have the destructors for XPCPerThreadData::gTLSIndex
> and JSRuntime::threadTPIndex decrement the reference count,
> destroying thread local storage when the reference count
> becomes 0.

It is hard to make refcount work with TLS for JSRuntime::threadTPIndex, I think.  Code can call JS_SetContextThread at any time.  

> 
> The order in which NSPR calls the thread local storage
> destructors is not specified and the current implementation
> made an arbitrary choice to call the destructors in increasing
> index order.  Since this order is as reasonable as any other
> order, we are not going to change it even though it is not
> documented.  It's fine by me to document this order officially.
> But I encourage you to come up with a fix that doesn't depend
> on this order.
> 
(In reply to comment #26)
> (In reply to comment #25)
> > (From update of attachment 230441 [details] [diff] [review] [edit] [edit])
> > It seems that you can also add a reference count to the
> > thread local storage indexed by JSRuntime::threadTPIndex,
> > and have the destructors for XPCPerThreadData::gTLSIndex
> > and JSRuntime::threadTPIndex decrement the reference count,
> > destroying thread local storage when the reference count
> > becomes 0.
> 
> We could do that, but I still don't see the point of the extra code if we can
> count on destruction being well-ordered.

To say a bit more: if we did refcount JSThread, and the reference from its TPI went away before other refs, it's possible that the JSThread might still be in use by XPConnect when JS has not only stopped using it, but possibly allocated a new TPI and started using it.  Some embeddings go through "zero contexts" and even "zero runtimes" alive in a process.  Could this lead to bugs?  I don't know, but it seems better to arrange for JS TPI lifetime to equal or exceed XPConnect TPI lifetime.

Mid-air with feng, who just said the same thing more concisely ;-).

/be
Has anyone confirmed that this fixes the problem on Windows?
*** Bug 344214 has been marked as a duplicate of this bug. ***
No longer blocks: 344214
Anything else this needs, or can the patch be checked in?
(In reply to comment #31)
> Anything else this needs, or can the patch be checked in?

Martijn Wargers said it still crashes on Windows. I am waiting for a Windows machine with VC7 installed to debug. Anyone else tested the latest patch on Windows?

Ok, (In reply to comment #32)
> (In reply to comment #31)
> > Anything else this needs, or can the patch be checked in?
> 
> Martijn Wargers said it still crashes on Windows. I am waiting for a Windows
> machine with VC7 installed to debug. Anyone else tested the latest patch on
> Windows?

Ok, just saw Martijn Wargers' comments on bug 344214. No crash on Windows. Can someone help me check in the patch? 

I checked the patch in.
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Comment on attachment 230441 [details] [diff] [review]
patch (v2)

a=dbaron for checkin to MOZILLA_1_8_BRANCH.  Please mark fixed1.8.1 once you have checked in there.
Attachment #230441 - Flags: approval1.8.1+
Fix checked into the 1.8 branch.
Keywords: fixed1.8.1
Keywords: crash, topcrash
Summary: valgrind warnings due to ordering dependency of thread local storage destructors → valgrind warnings due to ordering dependency of thread local storage destructors [@ ntdll.dll] [@ js_SearchScope]
Flags: in-testsuite-
Keywords: valgrind
Crash Signature: [@ ntdll.dll] [@ js_SearchScope]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: