Closed Bug 555124 Opened 14 years ago Closed 14 years ago

Intermittent "ASSERTION: AsyncStatementSpinner not thread-safe: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file .../mozilla/storage/test/test_true_async.cpp, line 162"

Categories

(Toolkit :: Storage, defect)

x86
All
defect
Not set
major

Tracking

()

VERIFIED FIXED
mozilla1.9.3a4

People

(Reporter: sgautherie, Assigned: asuth)

References

(Blocks 1 open bug, )

Details

(Keywords: assertion, intermittent-failure)

First time I notice this...

http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1269557223.1269565797.21996.gz
WINNT 5.2 comm-central-trunk leak test build on 2010/03/25 15:47:03
{
Running binding string params tests...
TEST-PASS | e:/builds/slave/comm-central-trunk-win32-debug/build/mozilla/storage/test/test_binding_params.cpp
39 of 39 tests passed
Finished running binding string params tests.
nsStringStats
 => mAllocCount:           1334
 => mReallocCount:            3
 => mFreeCount:            1334
 => mShareCount:           1293
 => mAdoptCount:              3
 => mAdoptFreeCount:          3
nsXPConnect::CommenceShutdown()
###!!! ASSERTION: AsyncStatementSpinner not thread-safe: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file e:/builds/slave/comm-central-trunk-win32-debug/build/mozilla/storage/test/test_true_async.cpp, line 162

command timed out: 300 seconds without output
program finished with exit code 1
}
Some previous occurrences:

http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1269539345.1269544118.25629.gz
Linux comm-central-trunk leak test build on 2010/03/25 10:49:05
{
Running true async statement tests...
###!!! ASSERTION: AsyncStatementSpinner not thread-safe: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file /builds/slave/comm-central-trunk-linux-debug/build/mozilla/storage/test/test_true_async.cpp, line 162
NS_DebugBreak+0x00000039 [../../dist/bin/libxpcom.so +0x000025C0]
UNKNOWN [../../dist/bin/test_true_async +0x000023EE]
UNKNOWN [/builds/slave/comm-central-trunk-linux-debug/build/objdir/mozilla/dist/bin/components/libstoragecomps.so +0x00026624]
[...]
}

http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1269535451.1269545038.28045.gz
WINNT 5.2 comm-central-trunk leak test build on 2010/03/25 09:44:11
blocking2.0: --- → ?
OS: Windows Server 2003 → All
So, these are rare, or common/reliable but this is just a sampling?
(In reply to comment #2)

I didn't check all the builds since this test check-in,
but as the latter happened 2 days ago only, I'd bet this (new) failure may prove more common than rare.
http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1269574125.1269579921.27576.gz
WINNT 5.2 comm-central-trunk leak test build on 2010/03/25 20:28:45
http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1269577985.1269586937.11460.gz
WINNT 5.2 comm-central-trunk leak test build on 2010/03/25 21:33:05
Some older occurrences:
http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1269500442.1269504437.5743.gz
Linux comm-central-trunk leak test build on 2010/03/25 00:00:42
http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1269492933.1269505800.9197.gz
WINNT 5.2 comm-central-trunk leak test build on 2010/03/24 21:55:33

The newer occurrences:
http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1269505806.1269512128.29140.gz
WINNT 5.2 comm-central-trunk leak test build on 2010/03/25 01:30:06
(In reply to comment #5)
> The newer occurrences:

Older too actually.
So, in terms of the explicit error message, using NS_IMPL_THREADSAFE_ISUPPORTS2 instead of NS_IMPL_ISUPPORTS2 for AsyncStatementSpinner would shut it up, but...

My concern is that it's not immediately clear to me why an addref or release would be happening off of the main thread.  The management of the reference count for mCallback in mozStorageAsyncStatementExecution.cpp is manual and coordinated so that the add/release happens on the main thread only.  I just built firefox with the settings of your builder, and I do not experience the problem.  (Although the deadlock detector does appear to screw up...)

It's also worth noting a brief survey of the mozilla-central builders does not show this problem either...

I'll spin up a SeaMonkey build and see if I can reproduce it.
I can't reproduce it and the logs are sorta useless.  Looking more closely at the output, I see that the failure looks like so:

---
Running binding string params tests...
TEST-PASS | e:/builds/slave/comm-central-trunk-win32-debug/build/mozilla/storage/test/test_binding_params.cpp
39 of 39 tests passed
Finished running binding string params tests.
nsStringStats
 => mAllocCount:           1334
 => mReallocCount:            3
 => mFreeCount:            1334
 => mShareCount:           1293
 => mAdoptCount:              3
 => mAdoptFreeCount:          3
nsXPConnect::CommenceShutdown()
###!!! ASSERTION: AsyncStatementSpinner not thread-safe: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file e:/builds/slave/comm-central-trunk-win32-debug/build/mozilla/storage/test/test_true_async.cpp, line 162
---

However, in my runs and the Firefox runs we see things like this (the following is from http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269613219.1269615928.7733.gz&fulltext=1 ):
---
Running binding string params tests...
TEST-PASS | /builds/slave/mozilla-central-linux64-debug/build/storage/test/test_binding_params.cpp
39 of 39 tests passed
nsXPConnect::CommenceShutdown()
Finished running binding string params tests.
nsStringStats
 => mAllocCount:            395
 => mReallocCount:          223
 => mFreeCount:             395
 => mShareCount:            480
 => mAdoptCount:              2
 => mAdoptFreeCount:          2
Running true async statement tests...
TEST-PASS | /builds/slave/mozilla-central-linux64-debug/build/storage/test/test_true_async.cpp
16 of 16 tests passed
nsXPConnect::CommenceShutdown()
Finished running true async statement tests.
nsStringStats
 => mAllocCount:            391
 => mReallocCount:          223
 => mFreeCount:             391
 => mShareCount:            472
 => mAdoptCount:              2
 => mAdoptFreeCount:          2
---

Which is to say, normally we see "Running true async statement tests..." which ScopedXPCOM prints out.  We also would expect the nsXPConnect::CommenceShutdown to show up before it says "Finished running BLAH".  Unfortunately the nsXPConnect::CommenceShutdown message and the assertions all come out on stderr and everything else is just a printf.

Are the seamonkey test boxes massively overloaded?  Is it possible they are running with fallible malloc and getting into byzantine failure models because they do run out of memory?

The fact that this happens reliably on SeaMonkey test boxes but it doesn't seem to be happening on Firefox or Thunderbird boxes is somewhat concerning to me...
(In reply to comment #8)

> Which is to say, normally we see "Running true async statement tests..." which
> ScopedXPCOM prints out.

I assume it's the "usual" stdout/stderr "race" on Windows (comment 0).
The Linux (comment 1) log has this line.

NB: So far I haven't noticed this bug on MacOSX.

> We also would expect the nsXPConnect::CommenceShutdown
> to show up before it says "Finished running BLAH".  Unfortunately the
> nsXPConnect::CommenceShutdown message and the assertions all come out on stderr
> and everything else is just a printf.

May be the stdout/stderr "race" again?
Or the assertion happens before that?

Could it be possible to add (more) debugging code?

> Are the seamonkey test boxes massively overloaded?  Is it possible they are

Could be: we have limited resources...

> running with fallible malloc and getting into byzantine failure models because
> they do run out of memory?

Could be: I'm not sure where comm-central stands wrt the recent "infallible malloc" landing.

> The fact that this happens reliably on SeaMonkey test boxes but it doesn't seem
> to be happening on Firefox or Thunderbird boxes is somewhat concerning to me...

One could always suspect Parallels!?
We should be running exactly the same as Firefox is running. And I wouldn't just blame Parallels, as we don't have any evidence that it really is to blame. I take that as being an excuse instead of the search for a solution.
(In reply to comment #10)
> We should be running exactly the same as Firefox is running. And I wouldn't
> just blame Parallels, as we don't have any evidence that it really is to blame.
> I take that as being an excuse instead of the search for a solution.
Nobody is blaming anything.  The word use was "suspect".
Also, I want to note that I looked into the code paths pretty extensively and ran through the unit tests with a debugger breakpoint on the AsyncStatementSpinner::Release function to make sure my assumptions were being born out.  I also tried running it locally with processor core affinity as well as under valgrind to serialize the execution in a potentially useful way.  I am not trying to punt on this problem.

My concern is that there is a real oversight in the logic that I'm missing and I don't just want to paper over it by making the nsISupports implementation threadsafe.

The fact that this appears to be happening only on SeaMonkey boxes suggests there is something unique to SeaMonkey's situation and I was hoping it was something known...
Is it possible the seamonkey boxes haven't been clobbered since I landed the patch?  Can a clobber be triggered for them to eliminate that from the equation?
(In reply to comment #12)
> The fact that this appears to be happening only on SeaMonkey boxes suggests
> there is something unique to SeaMonkey's situation and I was hoping it was
> something known...

Nothing well-known at least. All SeaMonkey builds running those tests on trunk currently are debug, by the way, and while Firefox should be running them on debug, I don't think Thunderbird is doing that right now, right?

Still, I don't dismiss that it might be possible that Parallels does strange things, but we haven't seen systematic problems on them - and I just checked the reports listed here and the ones on Windows are from at least 4 different slaves, including both non-Parallels ones (i.e. the ones on ESX), so we can pretty surely close that out as a cause.
Shawn, sorry I if sounded as overreacting, but I guess I've seen Serge blaming/suspecting Parallels a few time too often. Granted, we had problems with the OS X VMs on that platform, but it worked pretty well for Linux and Windows so far (let's not turn this bug into a discussion of what VM/hardware platforms are better or worse, I have now established that it happens on both platforms we have right now).

(In reply to comment #13)
> Is it possible the seamonkey boxes haven't been clobbered since I landed the
> patch?  Can a clobber be triggered for them to eliminate that from the
> equation?

That is surely possible. A clobber can unfortunately not "be triggered" but only done manually. Nobody could yet explain to me easily enough how to get a clobberer server set up and running or even do that for me.

I'll be able to look into it tomorrow if someone pings me on IRC, but right now it's 3am and I should get to bed. ;-)
(Logging into 5 Linux and 5 Windows, potentially even the 5 Mac boxes, issuing a rm -rf on each and waiting for completion takes up too much time for doing it at this hour.)
(In reply to comment #12)
> I am not trying to punt on this problem.

This is appreciated :-)

(In reply to comment #14)
> I've seen Serge blaming/suspecting Parallels a few time too often.

I merely wrote it seemed possible in this case.
(It wasn't clear in my mind atm that we had Windows builds both on & off Parallels.)

> Nobody could yet explain to me easily enough how to get a
> clobberer server set up and running or even do that for me.

Sad. (Bug 489483)
http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1269639253.1269645027.26657.gz
Linux comm-central-trunk leak test build on 2010/03/26 14:34:13
s: cb-seamonkey-linux-02
http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1269640506.1269647603.369.gz
WINNT 5.2 comm-central-trunk leak test build on 2010/03/26 14:55:06
s: cb-seamonkey-win32-01
http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1269668440.1269673249.21010.gz
Linux comm-central-trunk leak test build on 2010/03/26 22:40:40
(In reply to comment #15)
> (In reply to comment #12)
> > Nobody could yet explain to me easily enough how to get a
> > clobberer server set up and running or even do that for me.
> 
> Sad. (Bug 489483)

Bug 555449, actually.

And trunk debug is now clobbered on all Linux and Windows boxes.
(In reply to comment #14)
> All SeaMonkey builds running those tests on trunk currently are debug,
> by the way, and while Firefox should be running them on
> debug, I don't think Thunderbird is doing that right now, right?

Thunderbird (still) runs its tests on a (non-packaged) --disable-debug builds.
So we can indeed compare Firefox and SeaMonkey only atm.
http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1269776975.1269785897.25494.gz
WINNT 5.2 comm-central-trunk leak test build on 2010/03/28 04:49:35
s: cb-seamonkey-win32-02
http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1269791307.1269795204.15834.gz
Linux comm-central-trunk leak test build on 2010/03/28 08:48:27
s: cb-seamonkey-linux-03
http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1269815957.1269821954.12662.gz
WINNT 5.2 comm-central-trunk leak test build on 2010/03/28 15:39:17
s: cb-seamonkey-win32-02
http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1269817338.1269819001.5412.gz
Linux comm-central-trunk leak test build on 2010/03/28 16:02:18
s: cb-sea-linux-tbox
Right, so I think I figured out what is happening

- we call asyncClose.
- AsyncCloseConnection gets run on the async thread.  it re-dispatches itself to the main thread.  *the thread goes to sleep before the processing of the event completes, leaving it alive.*
- AsyncCloseConnection gets run on the main thread, dispatching mCallbackEvent (which it does not forget about).  It runs to completion, the thread's event loop forgets about it.
- mCallbackEvent runs to completion on the main thread, the threads' event loop forgets about it.
- at this point, AsyncCloseConnection is still alive on the async thread only because of the thread.  It has the only live reference to mCallbackEvent who is holding a reference to the AsyncStatementSpinner.
- the async thread resumes enough to complete running AsyncCloseConnection.  AsyncCloseConnection destructs, causing a Release on mCallbackEvent which destructs and calls Release on AsyncStatementSpinner.  (It need not be the last AsyncStatementSpinner reference.)

While AsyncStatementSpinner potentially getting torn down is generally harmless, this ridiculous series of events is capable of causing the Connection to be destroyed on the asynchronous thread which we do not want.

I propose we have AsyncCloseConnection::Run null out mConnection and mCallbackEvent (and I guess mCallingThread for consistency) in its on-main-thread situation so we can ensure that should AsyncCloseConnection::Run get prolonged on the async thread, its death there will be harmless.

sdwilsh, yes/no?
(In reply to comment #23)
> sdwilsh, yes/no?
Wow, good catch.  r=sdwilsh for that change.
proposed fix pushed to mozilla-central:
http://hg.mozilla.org/mozilla-central/rev/8901abb7fddb

I decided not to null out mCallingThread since it is possible that in the pathological case that the async thread could still be in a call to that object.  Even though the chances of the thread object being destroyed are basically nil (storage would need to be used from a non-main thread), this seems like the safest course of action.

I'm going to wait on marking this fixed for a bit until we're sure it's dead.
I'm marking this fixed; I don't see any seamonkey failures with this signature built on/after my push yesterday.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
V.Fixed
Assignee: nobody → bugmail
Status: RESOLVED → VERIFIED
blocking2.0: ? → ---
Target Milestone: --- → mozilla1.9.3a4
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.