335018 - valgrind warnings due to ordering dependency of thread local storage destructors [@ ntdll.dll] [@ js_SearchScope]

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Description

•

18 years ago

Bug seen in:  Firefox trunk build, within past few days

Steps to reproduce:
 1. install an extension that needs upgrading, e.g., version 1.0 of https://addons.mozill a.org/firefox/2407/history/
 2. exit Firefox
 3. start Firefox
 4. Tools -> Extensions
 5. "Find Updates"
 6. "Update Now" on that extension

I get the attached valgrind warnings.  This seems to me to be showing that there's an unenforced ordering dependency between the thread local storage destructors in JavaScript and XPConnect: the XPConnect one needs to run first, but it is being run second.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 1

•

18 years ago

Attached file valgrind warnings — Details

Myk Melez [:myk] [@mykmelez]

Updated

•

18 years ago

Blocks: 344277

Myk Melez [:myk] [@mykmelez]

Comment 2

•

18 years ago

Attached patch valgrind warnings - another set — Details — Splinter Review

With my patch for bug 344277 applied, both trunk and branch crash after the microsummary service handles a bad cert error and then the user quits the app.  Here are my valgrind warnings, which dbaron says demonstrate the same problem as reported in this bug.

Myk Melez [:myk] [@mykmelez]

Comment 3

•

18 years ago

A Firefox 2 blocker depends on this, so it seems like it should block 1.8.1.

Flags: blocking1.8.1?

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 4

•

18 years ago

(It also probably means that XPInstall is much less stable since this bug got pulled in with the JS 1.7 landing.)

Mike Connor [:mconnor]

Comment 5

•

18 years ago

Need to get traction on this ASAP.  Feng/Brendan/Blake, what's the best path forward here?

Flags: blocking1.8.1? → blocking1.8.1+

Feng Qian (Google)

Assignee

Comment 6

•

18 years ago

(In reply to comment #4)
> (It also probably means that XPInstall is much less stable since this bug got
> pulled in with the JS 1.7 landing.)
> 

This bug was introduced by fix of bug 312238. I take it.

Status: NEW → ASSIGNED

Flags: blocking1.8.1+ → blocking1.8.1?

Mike Connor [:mconnor]

Updated

•

18 years ago

Assignee: general → feng.qian.moz

Status: ASSIGNED → NEW

Flags: blocking1.8.1? → blocking1.8.1+

Brendan Eich [:brendan]

Comment 7

•

18 years ago

I was wondering if we couldn't just make XPConnect eagerly allocate a TPI instead of doing it lazily -- and allocate eagerly before it calls JS_NewRuntime.  If NSPR runs the destructors in FIFO fashion, that would fix it.

/be

Feng Qian (Google)

Assignee

Comment 8

•

18 years ago

(In reply to comment #7)
> I was wondering if we couldn't just make XPConnect eagerly allocate a TPI
> instead of doing it lazily -- and allocate eagerly before it calls
> JS_NewRuntime.  If NSPR runs the destructors in FIFO fashion, that would fix
> it.

Brenden, this would rely on NSPR runs the destructors in FIFO fashion. I couldn't find such guarantee from NSPR documentation.

Another fix I propose is to link all allocated JSThread blocks and free them in JS_DestoryRuntime, at that point, all context should be freed. This will need a lock when a JSThread object is created and to be linked in the list.

> /be
>

Flags: blocking1.8.1+ → blocking1.8.1?

Brendan Eich [:brendan]

Comment 9

•

18 years ago

NSPR is old and quite API-stable.  I hope Wan-Teh can say something here about guaranteeing FIFO thread private index allocation/destruction.

KISS really wants a small fix here that relies on well-ordered destruction, IMHO.

If we can't count on this, another thought I had that would not require a runtime lock would be to add JS API that XPConnect can use to piggy-back its thread-local storage on top of JSThread.  XPConnect would stop using PR_NewThreadPrivateIndex, etc., and instead use JS_NewThreadPrivateIndex, etc., parameterized by runtime and allocating from a small table in JSThread.

/be

Brendan Eich [:brendan]

Comment 10

•

18 years ago

(In reply to comment #9)
> thread-local storage on top of JSThread.  XPConnect would stop using
> PR_NewThreadPrivateIndex, etc., and instead use JS_NewThreadPrivateIndex, etc.,
> parameterized by runtime and allocating from a small table in JSThread.

Oops, I meant "parameterized by context", from which the API implementation can get cx->thread.

/be

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Updated

•

18 years ago

Flags: blocking1.8.1? → blocking1.8.1+

Feng Qian (Google)

Assignee

Comment 11

•

18 years ago

Attached patch patch (v1) (obsolete) — Details — Splinter Review

This patch eliminate the usage of deconstructor callbacks in SpideMonkey. All JSThread structs are linked in JSRuntime. When JSRuntime is destroyed, JSThread's are freed. It assumes all contexts are freed before JSRuntime is freed, so that none usage of JSThread after that.

Attachment #230191 - Flags: review?

Feng Qian (Google)

Assignee

Comment 12

•

18 years ago

dbaron and Myk, can you verify my patch using valgrind if it is easy for you to do  so? I don't have valgrind installed yet.

Brendan Eich [:brendan]

Comment 13

•

18 years ago

Comment on attachment 230191 [details] [diff] [review]
patch (v1)

This is ok for Mozilla, probably, but it'll leak JSThreads badly in embeddings (tellme.com, whose multithreaded VXML service handled all AT&T/Cingular 411/1-800-555-1212 automated voice response -- in SpiderMonkey!) that join or exit threads a lot over the life of a long-lived process/runtime.

Still looking for the simplest fix that works and can be relied on to work in the future -- where is wtc? ;-).

/be

Attachment #230191 - Flags: review? → review-

Wan-Teh Chang

Comment 14

•

18 years ago

Brendan,

Here is the NSPR function that invokes the thread local storage destructors:
http://lxr.mozilla.org/nspr/ident?i=_PR_DestroyThreadPrivate

Note that it runs the destructors in a nested loop.  You can take advantage
of that to enforce the desired ordering of your two destructors.  For example,
destructor A can do nothing if it detects that destructor B hasn't been called
yet, and wait until the second invocation to do the destruction.

I don't know what you mean by the FIFO order.  What's "First In"?  NSPR
invokes the destructors in the order of the indices for thread local storage.
I believe this is what you want.  I guess you just want NSPR to commit to
this order in its documentation.

Brendan Eich [:brendan]

Comment 15

•

18 years ago

First in == first allocated via PR_NewThreadPrivateIndex.

First out == first destructor invoked.

Yeah, documentation is pretty much all I would want.

/be

Martijn Wargers (dead)

Updated

•

18 years ago

Blocks: 344214

Brendan Eich [:brendan]

Comment 16

•

18 years ago

*** Bug 344214 has been marked as a duplicate of this bug. ***

Brendan Eich [:brendan]

Comment 17

•

18 years ago

Feng, if you think we can rely on NSPR calling thread-private destructor callbacks in the same order their indexes were added, then we just need a small change to xpconnect.  Otherwise, we need an alternative that avoids bloating JSThreads (in some embeddings, this is tantamount to a bad leak) in the runtime.  Either one of:

- The JS_NewThreadPrivateIndex, etc. JS API that copies NSPR's API but implements in a small TPI-indexed array in JSThread, so that XPConnect thread-local storage lifetime depends on JS thread-local storage lifetime, as proposed in comment 9.

- Your patch amended to GC threads that have empty contextLists.

I still prefer the simplest possible patch, provided we can count on NSPR not changing its behavior (and we can -- documentation would just formalize this). What do you think?

/be

Brendan Eich [:brendan]

Comment 18

•

18 years ago

(In reply to comment #13)
> (From update of attachment 230191 [details] [diff] [review] [edit])
> This is ok for Mozilla, probably, but it'll leak JSThreads badly in embeddings
> (tellme.com, whose multithreaded VXML service handled all AT&T/Cingular
> 411/1-800-555-1212 automated voice response -- in SpiderMonkey!)

Off-topic:

Myk asked whether I meant to write "handled" -- I didn't.  AFAIK, Cingular 411 and AT&T 1.800 directory assistance still use tellme.com's VXML service, which is built on multi-threaded SpiderMonkey.

I asked an insider about that the other day, cuz I used it and the AVR worked great, but I was stuck with a $3 charge.  What a rip!  But my source said Google, Yahoo!, and MSFT are all coming out with free 411 services.  Good news.  I hope someone still uses thread-safe SpiderMonkey ;-).

/be

Feng Qian (Google)

Assignee

Comment 19

•

18 years ago

Attached patch patch (v2) — Details — Splinter Review

Brendan, I think your suggestion in comment 7 with detailed documentation is a better solution. This is another patch does the simple fix. While submitting this patch, I am still trying to reproduce the crash using the testcase from bug 344214. 

Martijn Wargers, could you please test this patch in your 1.8.1 build? Thanks.

Attachment #230191 - Attachment is obsolete: true

Attachment #230441 - Flags: review?

Martijn Wargers (dead)

Comment 20

•

18 years ago

(In reply to comment #19)
> Created an attachment (id=230441) [edit]
> patch (v2)

> Martijn Wargers, could you please test this patch in your 1.8.1 build? Thanks.

No, that patch doesn't fix the crash with the testcase from bug 344214.

Feng Qian (Google)

Assignee

Updated

•

18 years ago

Attachment #230441 - Flags: review? → review?(brendan)

Feng Qian (Google)

Assignee

Comment 21

•

18 years ago

Comment on attachment 230441 [details] [diff] [review]
patch (v2)

It didn't fix the problem, withdraw the patch

Attachment #230441 - Attachment is obsolete: true

Attachment #230441 - Flags: review?(brendan)

Feng Qian (Google)

Assignee

Comment 22

•

18 years ago

(In reply to comment #20)
> (In reply to comment #19)
> > Created an attachment (id=230441) [edit]
> > patch (v2)
> 
> > Martijn Wargers, could you please test this patch in your 1.8.1 build? Thanks.
> 
> No, that patch doesn't fix the crash with the testcase from bug 344214.

Martijn, on my Linux build, not crashes happened after applying patch v2. 
I saw 4 tread private destructors called for each thread:
1) xpcom/threads/nsThread.cpp nsThread::Exit
2) xpcom/base/nsExceptionService.cpp nsExceptionService::ThreadDestruct
   which throws following exception from line 97

###!!! ASSERTION: nsExceptionManager not thread-safe: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file /usr/local/google/home/fqian/workspace/firefox_src_bug335018/mozilla/xpcom/base/nsExceptionService.cpp, line 97

3) js/src/xpconnect/src/xpcthreadcontext.cpp xpc_ThreadDataDtorCB
4) js/src/jscntxt.c js_ThreadDestructorCB

I saw the same exceptions thrown with patch v1. It seems to me unreleated to this bug.

Feng Qian (Google)

Assignee

Updated

•

18 years ago

Attachment #230441 - Flags: review?(brendan)

Feng Qian (Google)

Assignee

Updated

•

18 years ago

Attachment #230441 - Attachment is obsolete: false

Brendan Eich [:brendan]

Comment 23

•

18 years ago

Comment on attachment 230441 [details] [diff] [review]
patch (v2)

>+        // We rely on the implementation of NSPR that calls destructors at 
>+        // the same order of calling PR_NewThreadPrivateIndex.

Nit: "in the same order...."

Looks good, should work.  Simplest patch, keeps GetData intact so it could be used lazily.

Anyone still having problems confirming that this patch fixes the valgrind warnings?

/be

Attachment #230441 - Flags: review?(brendan) → review+

Myk Melez [:myk] [@mykmelez]

Comment 24

•

18 years ago

> Anyone still having problems confirming that this patch fixes the valgrind
> warnings?

I haven't run it with valgrind yet, but I've tested it against the crash exposed by my patch for bug 344277.  With this patch, I no longer crash.

Wan-Teh Chang

Comment 25

•

18 years ago

Comment on attachment 230441 [details] [diff] [review]
patch (v2)

It seems that you can also add a reference count to the
thread local storage indexed by JSRuntime::threadTPIndex,
and have the destructors for XPCPerThreadData::gTLSIndex
and JSRuntime::threadTPIndex decrement the reference count,
destroying thread local storage when the reference count
becomes 0.

The order in which NSPR calls the thread local storage
destructors is not specified and the current implementation
made an arbitrary choice to call the destructors in increasing
index order.  Since this order is as reasonable as any other
order, we are not going to change it even though it is not
documented.  It's fine by me to document this order officially.
But I encourage you to come up with a fix that doesn't depend
on this order.

Brendan Eich [:brendan]

Comment 26

•

18 years ago

(In reply to comment #25)
> (From update of attachment 230441 [details] [diff] [review] [edit])
> It seems that you can also add a reference count to the
> thread local storage indexed by JSRuntime::threadTPIndex,
> and have the destructors for XPCPerThreadData::gTLSIndex
> and JSRuntime::threadTPIndex decrement the reference count,
> destroying thread local storage when the reference count
> becomes 0.

We could do that, but I still don't see the point of the extra code if we can count on destruction being well-ordered.

> The order in which NSPR calls the thread local storage
> destructors is not specified and the current implementation
> made an arbitrary choice to call the destructors in increasing
> index order.  Since this order is as reasonable as any other
> order, we are not going to change it even though it is not
> documented.  It's fine by me to document this order officially.
> But I encourage you to come up with a fix that doesn't depend
> on this order.

Maybe later ;-).

/be

Feng Qian (Google)

Assignee

Comment 27

•

18 years ago

(In reply to comment #25)
> (From update of attachment 230441 [details] [diff] [review] [edit])
> It seems that you can also add a reference count to the
> thread local storage indexed by JSRuntime::threadTPIndex,
> and have the destructors for XPCPerThreadData::gTLSIndex
> and JSRuntime::threadTPIndex decrement the reference count,
> destroying thread local storage when the reference count
> becomes 0.

It is hard to make refcount work with TLS for JSRuntime::threadTPIndex, I think.  Code can call JS_SetContextThread at any time.  

> 
> The order in which NSPR calls the thread local storage
> destructors is not specified and the current implementation
> made an arbitrary choice to call the destructors in increasing
> index order.  Since this order is as reasonable as any other
> order, we are not going to change it even though it is not
> documented.  It's fine by me to document this order officially.
> But I encourage you to come up with a fix that doesn't depend
> on this order.
>

Brendan Eich [:brendan]

Comment 28

•

18 years ago

(In reply to comment #26)
> (In reply to comment #25)
> > (From update of attachment 230441 [details] [diff] [review] [edit] [edit])
> > It seems that you can also add a reference count to the
> > thread local storage indexed by JSRuntime::threadTPIndex,
> > and have the destructors for XPCPerThreadData::gTLSIndex
> > and JSRuntime::threadTPIndex decrement the reference count,
> > destroying thread local storage when the reference count
> > becomes 0.
> 
> We could do that, but I still don't see the point of the extra code if we can
> count on destruction being well-ordered.

To say a bit more: if we did refcount JSThread, and the reference from its TPI went away before other refs, it's possible that the JSThread might still be in use by XPConnect when JS has not only stopped using it, but possibly allocated a new TPI and started using it.  Some embeddings go through "zero contexts" and even "zero runtimes" alive in a process.  Could this lead to bugs?  I don't know, but it seems better to arrange for JS TPI lifetime to equal or exceed XPConnect TPI lifetime.

Mid-air with feng, who just said the same thing more concisely ;-).

/be

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 29

•

18 years ago

Has anyone confirmed that this fixes the problem on Windows?

Martijn Wargers (dead)

Comment 30

•

18 years ago

*** Bug 344214 has been marked as a duplicate of this bug. ***

No longer blocks: 344214

Myk Melez [:myk] [@mykmelez]

Comment 31

•

18 years ago

Anything else this needs, or can the patch be checked in?

Feng Qian (Google)

Assignee

Comment 32

•

18 years ago

(In reply to comment #31)
> Anything else this needs, or can the patch be checked in?

Martijn Wargers said it still crashes on Windows. I am waiting for a Windows machine with VC7 installed to debug. Anyone else tested the latest patch on Windows?

Feng Qian (Google)

Assignee

Comment 33

•

18 years ago

Ok, (In reply to comment #32)
> (In reply to comment #31)
> > Anything else this needs, or can the patch be checked in?
> 
> Martijn Wargers said it still crashes on Windows. I am waiting for a Windows
> machine with VC7 installed to debug. Anyone else tested the latest patch on
> Windows?

Ok, just saw Martijn Wargers' comments on bug 344214. No crash on Windows. Can someone help me check in the patch?

Blake Kaplan (:mrbkap) (inactive)

Comment 34

•

18 years ago

I checked the patch in.

Status: NEW → RESOLVED

Closed: 18 years ago

Resolution: --- → FIXED

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 35

•

18 years ago

Comment on attachment 230441 [details] [diff] [review]
patch (v2)

a=dbaron for checkin to MOZILLA_1_8_BRANCH.  Please mark fixed1.8.1 once you have checked in there.

Attachment #230441 - Flags: approval1.8.1+

Blake Kaplan (:mrbkap) (inactive)

Comment 36

•

18 years ago

Fix checked into the 1.8 branch.

Keywords: fixed1.8.1

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Updated

•

18 years ago

Keywords: crash, topcrash

Summary: valgrind warnings due to ordering dependency of thread local storage destructors → valgrind warnings due to ordering dependency of thread local storage destructors [@ ntdll.dll] [@ js_SearchScope]

Bob Clary [:bc] (inactive)

Updated

•

18 years ago

Flags: in-testsuite-

Reed Loden [:reed]

Updated

•

17 years ago

Keywords: valgrind

Nobody; OK to take it and work on it

Updated

•

13 years ago

Crash Signature: [@ ntdll.dll] [@ js_SearchScope]

valgrind warnings 18 years ago David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla) 6.06 KB, text/plain; charset=utf-8		Details
valgrind warnings - another set 18 years ago Myk Melez [:myk] [@mykmelez] 6.83 KB, patch		Details \| Diff \| Splinter Review
patch (v1) 18 years ago Feng Qian (Google) 5.64 KB, patch	brendan : review-	Details \| Diff \| Splinter Review
patch (v2) 18 years ago Feng Qian (Google) 1.90 KB, patch	brendan : review+ dbaron : approval1.8.1+	Details \| Diff \| Splinter Review