Closed Bug 496019 Opened 15 years ago Closed 14 years ago

mozilla::storage::Connection::Close can spin a nested event loop

Categories

(Toolkit :: Storage, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla1.9.3a3
Tracking Status
blocking2.0 --- beta1+
status1.9.2 --- wontfix

People

(Reporter: jruderman, Assigned: sdwilsh)

References

Details

(Keywords: crash, dev-doc-complete)

Attachments

(1 file, 11 obsolete files)

mozilla::storage::Connection::Close can spin a nested event loop.  bent says this is bad.

(I noticed this while looking at a crash stack trace, but I don't know if this bug is directly responsible for the crash.)
The static analysis proposed in bug 477432 would have caught this.
Connection::Close is only intending to shutdown its own async execution thread, which should not have anything on it but async storage things (and no user code).  How did that timer event get in there?
Apparently when we shutdown a thread, it runs the event loop on the thread that is currently running.  We'll have to protect this with a Monitor to make sure we do not spin the main thread's event loop (sadface).
bent: please explain why this is bad (for the record).

shutdown has to spin a loop in case the thread in question has some stuff it needs to send across to the joining thread.

An evil hack which would mostly work is to create a reaper thread and have it do the shutdown call.
the less evil hack here is to use a monitor and prevent the main thread from do anything else while we wait.
(In reply to comment #4)
> bent: please explain why this is bad

It's bad because storage js can cause a nested event loop at any time it feels like it even though we know there are times where a nested event loop is unsafe.

> An evil hack which would mostly work is to create a reaper thread and have it
> do the shutdown call.

Unfortunately sdwilsh says that the thread shutdown is responsible for ensuring that sqlite is done and that the main thread shouldn't be allowed to proceed until that happens.
(In reply to comment #4)
> shutdown has to spin a loop in case the thread in question has some stuff it
> needs to send across to the joining thread.
That is only an issue if we send something to the joining thread synchronously, right?  I'm certainly OK with an invariant saying we can't do that.
Assignee: nobody → sdwilsh
Status: NEW → ASSIGNED
Keywords: crash
Priority: -- → P1
Target Milestone: --- → mozilla1.9.3a1
Attached patch v0.1 (obsolete) — Splinter Review
This patch isn't good enough.  I need to think more about this problem.
This blocks what seems to be a frequent shutdown crash on Tinderbox, so ... blocker!
Flags: blocking1.9.2+
Let's try a different approach.  Track pending statements in the connection, and cancel them at shutdown.
Attached patch v2.0 (obsolete) — Splinter Review
This even works with my test.  Yay!
Attachment #404677 - Attachment is obsolete: true
Attachment #408866 - Flags: review?(bugmail)
Whiteboard: [needs review asuth]
I'm concerned this solution is changing semantics without any justification or rationalization.  The state of the platform is that you won't always get a quit-application-requested; if someone closes the last active XUL window via the 'x' button you will just get a quit-application notification, at least on some platform...

If we always cancel all outstanding asynchronous statements when shutting down the thread, this leaves user-code that wants to make sure its database statements get flushed to disk having to spin their own event loop... or move back to using the synchronous API all the time.

I understand that it is desirable to neither crash nor take forever in shutdown... but I'd rather like it if we could keep the ability to make sure all existing async statements are flushed to disk.  I could imagine there are users that don't need this, so maybe some sort of flag?  (Which doesn't make the patch's life any easier.)

The counterargument could be made we should just fix the quit-application-requested issue in the cases where we can, but at the very least I'd like some very verbose documentation on close and executeAsync explaining the ways in which they can fail to actually execute.
Yeah, this patch isn't desirable during shutdown.  We could add a method that closes the database when all background work is done.  Something like mozIStorageConnection::closeWhenDone()?

Does that alleviate your concerns?
(In reply to comment #13)
> Yeah, this patch isn't desirable during shutdown.  We could add a method that
> closes the database when all background work is done.  Something like
> mozIStorageConnection::closeWhenDone()?
> 
> Does that alleviate your concerns?

Maybe-ish?  Are you proposing that, assuming I can get a reliable quit-application-requested, I would cancel the quit and when closeWhenDone(aCallback) notifies me I would restart the quit?  Short answer if so: yes.


Rationale/long answer:

That arguably would make the most sense, since there are the two separate issues of 1) actually letting SQLite run the statements and 2) generating the callback notifications.

The set of things we can do synchronously during shutdown is:

a) The current situation.  The loop spins and anyone and anything can run.  This provides 1 (intentionally) and 2 (accidentally), with the downside of being extremely dangerous.

b) Spin the loop letting only our async callbacks run.  Not sure if this is entirely feasible, but if it is we get 1 and 2 intentionally.  Sadly it requires diligence and familiarity to not be dangerous which just means dangerous.

c) Spin the loop without letting anyone run.  This gets us 1 but not 2 and I think is what I proposed in comment #12.  This will probably work most of the time in most cases but seems likely to end in tears.

My interpretation of your proposal is to avoid the synchronous issue entirely and just make it easier to provide an asynchronous way to shutdown the connection while guaranteeing that it will eventually shutdown (since storage can easily stop new asynchronous statements from being dispatched).

This sounds reasonable; we (Thunderbird devs) will just need to help out platform.  I find it unlikely Firefox would get itself into the kind of situation where it cares about this shutdown scenario, so this may just need a minor fix to platform and then Thunderbird can use its own fancy-quit logic it already possesses.  (I made the previous post as my plane was about to board and forgot TB is already partially down this (slightly broken) road.)
(In reply to comment #14)
> Maybe-ish?  Are you proposing that, assuming I can get a reliable
> quit-application-requested, I would cancel the quit and when
> closeWhenDone(aCallback) notifies me I would restart the quit?  Short answer if
> so: yes.
Yeah, we could have a callback there, but it would be 100% optional.  Sounds good then?
Whiteboard: [needs review asuth]
(In reply to comment #15)
> Yeah, we could have a callback there, but it would be 100% optional.  Sounds
> good then?

Sounds good.
Attached patch v3.0 (obsolete) — Splinter Review
This ended up being a much bigger patch than originally planned, but this is per discussion in the bug.
Attachment #408866 - Attachment is obsolete: true
Attachment #409806 - Flags: superreview?(vladimir)
Attachment #409806 - Flags: review?(bugmail)
Attachment #408866 - Flags: review?(bugmail)
Whiteboard: [needs review asuth][needs sr vlad]
Comment on attachment 409806 [details] [diff] [review]
v3.0

Looks good to me.
Attachment #409806 - Flags: superreview?(vladimir) → superreview+
Whiteboard: [needs review asuth][needs sr vlad] → [needs review asuth]
I am getting an intermittent test failure in test_storage_connection.js at cleanup() time with this patch:

*** Storage Tests: Trying to close!
TEST-UNEXPECTED-FAIL | (xpcshell/head.js) | [Exception... "Component returned failure code: 0x80630001 [mozIStorageConnection.close]"  nsresult: "0x80630001 (<unknown>)"  location: "JS frame :: /home/visbrero/rev_control/hg/mozilla-central/obj-firefox-normal/_tests/xpcshell/test_storage/unit/head_storage.js :: cleanup :: line 65"  data: no]
That's all sorts of sadfaces since that is NS_ERROR_STORAGE_BUSY.  I'll look at the code tomorrow, but I was not seeing that locally :/
I'm on a laptop with an old-school non-SSD drive... on linux.  So I am probably seeing a different event sequence from you.

It looks like the laptop's system version of gdb has serious issues with trying to debug the amalgamation, breakpoints and stepping don't make a lot of sense.  (Line mapping issues?)  I don't have time to spin an archer gdb build right now, so I'm going to have to punt on the debugging assist...

However, inspection would suggest that the problem is that test_close_succeeds_with_finalized_async_statement issues a "SELECT * FROM TEST" which will be a multi-step execution as long as there is at least one row (which there is)... so the "/* If there are any outstanding VMs, return SQLITE_BUSY. */" case in sqlite3_close probably applies.

I suspect you can increase the probability of your encountering the problem if you insert 10,000 extra rows or something like that.

But the inspection conclusion is that you may need some form of Cancel (for the shutdown case only) that either calls sqlite3_reset itself on the statement or waits for a notification that the async thread has gotten far enough to take care of it itself.
Comment on attachment 409806 [details] [diff] [review]
v3.0

Per my previous comment, I think the central issue of the patch is resurgent and a revised patch is in order.

Given that the problem the patch is trying to solve is more serious than this new wrinkle, I'm okay with an r=asuth for an interim landing of this patch that does not close this bug until the wrinkle is dealt with in an additional patch.
Attachment #409806 - Flags: review?(bugmail) → review-
Whiteboard: [needs review asuth]
Yeah, I came to the same conclusion last night while falling asleep.  I need to think more about the threadsafety bits of this patch before I post a new iteration.  Expect something late today or tomorrow.
And, for what it's worth, we are hitting http://www.sqlite.org/hlr10000.html#H12014
So, another option is to no longer allow the calling of close on a database connection that uses asynchronous statements.  Those consumers would *have* to call closeWhenDone.  Close would return an error when that case arises.

We'd still have the case when the destructor calls close, but that would be OK if we add an internal only method that does the work of close without the check.  The connection is held onto as long as any statement exists, and the statement exists up until we call StatementData::finalize.  At this point, it'd be safe to finalize the asynchronous statement, too.  I think my logic works out here, but I need to sleep on it.  asuth, if you want to error check me too, that'd be awesome.
(In reply to comment #25)
> We'd still have the case when the destructor calls close, but that would be OK
> if we add an internal only method that does the work of close without the
> check.  The connection is held onto as long as any statement exists, and the
> statement exists up until we call StatementData::finalize.  At this point, it'd
> be safe to finalize the asynchronous statement, too.  I think my logic works
> out here, but I need to sleep on it.  asuth, if you want to error check me too,
> that'd be awesome.

AsyncExecuteStatements holds a reference to the Connection too.  In the case where there is no registered callback, the AsyncExecuteStatement is going to live until its Run method completes.  In the case where it has a callback, it will never die because CallbackResultNotifier will never get run and thus never die.

Which is to say, if we forbid people from calling close() when async stuff is involved, it's more likely than not that the connection will never actually be shutdown if they don't do the asyncClose() thing properly.

I don't have a problem with that.

So then the plan is that you will 1) add a comment on close(), 2) make close() explode if an async statement was ever dispatched, and 3) fix the unit test in question to use asyncClose.  That's all pretty straightforward, so r=asuth for that.
Attachment #409806 - Flags: review- → review+
Whiteboard: [patch has reviews, needs minor fixes]
(In reply to comment #26)
> So then the plan is that you will 1) add a comment on close(), 2) make close()
> explode if an async statement was ever dispatched, and 3) fix the unit test in
> question to use asyncClose.  That's all pretty straightforward, so r=asuth for
> that.
Well, you might be able to call close still if there are no more pending async statements, but maybe I shouldn't worry about that.  If I feel the changes are big enough to have review again, I'll request it.
(In reply to comment #27)
> Well, you might be able to call close still if there are no more pending async
> statements, but maybe I shouldn't worry about that.  If I feel the changes are
> big enough to have review again, I'll request it.

I guess it depends on the purpose of the error.  I was thinking the purpose was to say: "Hey, you!  You are using asynchronous statements but are not closing things correctly.  This may work for you now, but this is not the right way to close a connection used for async things, and you will likely get burned if you ever try and quit the program when there are still async statements in flight."

The other potential purpose of the error could be to say: "You currently still have asynchronous statements outstanding, so the close is failing for now, you should maybe call asyncClose or maybe not shutdown yet or something."

The latter error scenario allows for extremely clever code to deal with that rare case, but the counter-argument is that the rareness of the case means it is more likely to simply bite people.

I would vote for the former rather than the latter.
Attached patch v4.0 (obsolete) — Splinter Review
Too much code churn here for me to be comfortable with carrying forward review.  This code is hard :(
Attachment #409806 - Attachment is obsolete: true
Attachment #410323 - Flags: superreview?(vladimir)
Attachment #410323 - Flags: review?(bugmail)
Whiteboard: [patch has reviews, needs minor fixes] → [needs review asuth][needs sr vlad]
Comment on attachment 410323 [details] [diff] [review]
v4.0

you attached v3.0 as v4.0 again.  qrefresh or what not :)
Attachment #410323 - Attachment is obsolete: true
Attachment #410323 - Flags: superreview?(vladimir)
Attachment #410323 - Flags: review?(bugmail)
Attached patch v4.0 (obsolete) — Splinter Review
Ugh, I swear I had qrefreshed.  It's even in my online mq.  This only has three lines of context, but this is probably fine for sr.  I know asuth will use review board and he'll get as much context as he wants.
Attachment #410437 - Flags: superreview?(vladimir)
Attachment #410437 - Flags: review?(bugmail)
Comment on attachment 410437 [details] [diff] [review]
v4.0

Your main doc comment on close() is now contradictory to the new note about exploding if you have ever used an asynchronous statement. ("Closes a database connection and cancels the execution of any pending or in-progress asynchronous statements.")

test_asyncClose_does_not_complete_before_statements has a noble goal but provides no guarantee of actually accomplishing the goal.  Perhaps it should be marked with an XXX comment about it being limited by our inability to compel thread ordering?  (The async statement could easily and reliably run to completion before asyncClose ever gets a chance to run, allowing asyncClose to be written in a way that would complete before the statements.
Attachment #410437 - Flags: review?(bugmail) → review+
(In reply to comment #32)
> (From update of attachment 410437 [details] [diff] [review])
> Your main doc comment on close() is now contradictory to the new note about
> exploding if you have ever used an asynchronous statement. ("Closes a database
> connection and cancels the execution of any pending or in-progress asynchronous
> statements.")
Whoops - I thought I had proofread that.  I'll fix it before checking in.

> test_asyncClose_does_not_complete_before_statements has a noble goal but
> provides no guarantee of actually accomplishing the goal.  Perhaps it should be
> marked with an XXX comment about it being limited by our inability to compel
> thread ordering?  (The async statement could easily and reliably run to
> completion before asyncClose ever gets a chance to run, allowing asyncClose to
> be written in a way that would complete before the statements.
While this is true (and I could devise a way to ensure thread ordering, but it'd have to be a native code test), I think this test will still be sufficient in that if it ever fails (ie, is random), we have a bug.  I could write up the native code test if you want though.
Whiteboard: [needs review asuth][needs sr vlad] → [needs sr vlad]
(In reply to comment #33)
> While this is true (and I could devise a way to ensure thread ordering, but
> it'd have to be a native code test), I think this test will still be sufficient
> in that if it ever fails (ie, is random), we have a bug.  I could write up the
> native code test if you want though.

Yeah, I imagine it wouldn't be too hard to dispatch events to the async thread that block further execution until the main thread releases them via a semaphore or the like.  I don't think this test is very important/worth the effort, and agree it could conceivably do something useful if the storage implementation regressed.

The request for the comment was just because a lot of time if people are reading tests to understand the system or investigate a bug, they may take them at their name and make a wrong conclusion about what the tests actually tests/assume the async dispatching works differently, etc.
Attachment #410437 - Flags: superreview?(vladimir) → superreview+
Blocks: 526601
Whiteboard: [needs sr vlad]
Attached patch v4.1 (obsolete) — Splinter Review
for checkin

Need to wait to land until bug 526601 lands.
Attachment #410437 - Attachment is obsolete: true
Whiteboard: [can land with bug 526601]
http://hg.mozilla.org/mozilla-central/rev/15a2aadd04dd
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Flags: in-testsuite+
OS: Mac OS X → All
Hardware: x86 → All
Resolution: --- → FIXED
Whiteboard: [can land with bug 526601]
Whiteboard: [baking]
weird compiler error on windows, so I had to back this out
http://hg.mozilla.org/mozilla-central/rev/146dbbf6421b
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [baking]
I managed to leave the new idl file in when I backed out, so the new patch will be missing that.
Attached patch v4.2 (obsolete) — Splinter Review
Pushed this one to the try server.  I suspect it will work this time.
Attachment #381160 - Attachment is obsolete: true
Attachment #410579 - Attachment is obsolete: true
Attachment #411539 - Flags: review?(bugmail)
Attachment #411539 - Flags: review?(bugmail) → review+
OK, so this is causing a crash in a places unit test still.  I understand the reasoning, so this patch needs to change, but it'll be a trivial change.  I'll make the diff on top of the currently reviewed patch.
Was able to reproduce the crash in an OPT build only.  What's happening the SQLite database (sqlite3 object) is getting closed on the background thread, while we end up trying to create a statement in the main thread.  I'm trying to establish what is being created, but we need to add protection for this case in our API.  We have two options:

1) Once our event runs on the background thread to close, dispatch an event to the calling thread to actually do the close.  Also apply the restriction that close and asyncClose must be called on the same thread that created the databases.  Otherwise, we'll have to lock all access to mDBConn, which sorts sucks.
2) check mAsyncExecutionThread every time we check mDBConn when entering a method.  This would require obtaining a mutex for every method call, which is sucky.

I'm inclined to go with (1), but looking for feedback.
(In reply to comment #42)
> 1) Once our event runs on the background thread to close, dispatch an event to
> the calling thread to actually do the close.  Also apply the restriction that
> close and asyncClose must be called on the same thread that created the
> databases.  Otherwise, we'll have to lock all access to mDBConn, which sorts
> sucks.

This sounds resasonable (including the invariant).
will this fix bug 525356?
(In reply to comment #44)
> will this fix bug 525356?
It's not clear to me why that bug is failing yet.  I highly doubt that it will fix it though.
And just so everything is crystal clear, the statement that was being created was this one:
http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/tests/sync/head_sync.js#214

In fact, sync/test_database_sync_after_shutdown.js is only passing currently on mozilla-central because we spin the event loop.  It won't pass with the changes I proposed in comment 42, so I'll have to fix that test as well.
Attached patch v4.2 addendum (obsolete) — Splinter Review
Per comment 42.  This is on top of attachment 411539 [details] [diff] [review].
Attachment #412256 - Flags: review?(bugmail)
Whiteboard: [needs review asuth]
All good on the try server this time too.
Comment on attachment 412256 [details] [diff] [review]
v4.2 addendum

I loved the v3.0 patch as much as everyone else, but please stop posting it.

Since your alleged posting of the patch is more recent than anything in your patch queue, I'm going to wait for you to attach your revised patch for realsies.
Attachment #412256 - Attachment is obsolete: true
Attachment #412256 - Flags: review?(bugmail)
(In reply to comment #49)
> (From update of attachment 412256 [details] [diff] [review])
> I loved the v3.0 patch as much as everyone else, but please stop posting it.
Really?  Man...fail on my part.  I'm home sick today, so this will sadly have to wait until tomorrow.
Whiteboard: [needs review asuth] → [needs new patch]
Patch repost ping (assuming you are recovered).
(In reply to comment #51)
> Patch repost ping (assuming you are recovered).
Within the hour!  I'm not fully recovered, but my roommate grabbed my work machine yesteday, so I'll have it up shortly.
Attached patch v4.3 (obsolete) — Splinter Review
No easy way for me to get the added bits now, sorry.
Attachment #411539 - Attachment is obsolete: true
Attachment #413085 - Flags: review?(bugmail)
Attachment #413085 - Flags: review?(bugmail) → review+
Whiteboard: [needs new patch] → [can land]
This no longer blocks as bug 519438 no longer blocks; we'll definitely want this on mozilla-central once that tree re-opens, though, but thankfully we don't have to take the late-compat hit.
blocking2.0: --- → ?
Flags: blocking1.9.2+ → blocking1.9.2-
Whiteboard: [can land]
Blocks: 525356
http://hg.mozilla.org/mozilla-central/rev/f91a016416d1
Status: REOPENED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
No longer blocks: 525356
That log says comm-central... what is the actual test run from?  It would not be surprising that gloda would leak given the change in semantics of close without a change in gloda...
(In reply to comment #57)
> That log says comm-central... what is the actual test run from?  It would not
> be surprising that gloda would leak given the change in semantics of close
> without a change in gloda...
The same thing was showing up on mozilla-central, but this log was shorter and easier to find.  zpao just indicated that maybe this is failure across the board due to build issues.  It's a leak build test, which isn't run on the try server sadly.
Attached patch v4.4 (obsolete) — Splinter Review
Unbit rot v4.3.  This passes tests locally, and I'll try pushing it tomorrow morning.
Attachment #413085 - Attachment is obsolete: true
Looked OK on the try server.
What's the word on this patch, sdwilsh?  Can it land or is it believed there are unit tests in the mozilla tree that blind-fire executeAsync calls without waiting for them to complete and so the unit tests need to be adapted to use asyncClose and spin an event loop waiting for the close so shutdown won't leak?
It should be able to land, I just haven't had time to watch the tree and push this :/
Comment on attachment 419585 [details] [diff] [review]
v4.4

Funny story.  We're not calling nsIThread::shutdown anymore with this patch and that's the only way to actually make the thread go away.

The thread manager has our back and masks the issue because it shuts down all the threads when it gets shutdown.

I think the root motivation of this bug was that we only spin a nested event loop in strictly controlled situations rather than in arbitrary situations at the whim of chrome code.  As such, we should probably be having the AsyncCloseConnection runnable call shutdown.  If we modify it to not assume it has a living connection (and remember the async thread) we can also use it to provide a controlled situation to shutdown the async thread in for Connection::Close too.

Since bug 507414 depends on this patch I will make such changes and provide a revised version of this patch.
Attachment #419585 - Flags: review-
Attached patch v4.5Splinter Review
minor rev on your patch with changes as described.  I'm uploading the straight patch with 3 lines of context in the hope it makes interdiff's life easier.
Attachment #419585 - Attachment is obsolete: true
Attachment #426208 - Flags: review?(sdwilsh)
Comment on attachment 426208 [details] [diff] [review]
v4.5

r=sdwilsh
Attachment #426208 - Flags: review?(sdwilsh) → review+
Blocks: 536978
http://hg.mozilla.org/mozilla-central/rev/1fa9d63fa84b
Status: REOPENED → RESOLVED
Closed: 15 years ago14 years ago
Flags: in-litmus-
Resolution: --- → FIXED
Target Milestone: mozilla1.9.3a1 → mozilla1.9.3a3
Depends on: 551917
blocking2.0: ? → beta1+
After review, this doesn't look like it actually impacts docs. Removing doc needed keyword.
Keywords: dev-doc-needed
(In reply to comment #67)
> After review, this doesn't look like it actually impacts docs. Removing doc
> needed keyword.
Er, it adds a method, namely asyncClose that isn't documented, and better documents close.
Keywords: dev-doc-needed
Hm. Apparently I was looking at the wrong patch when I looked at this yesterday. I'll get that written up.
As best I can tell it looks like v4.4 of this patch landed, not v4.5, so the brokenness described in comment 63 is still with us 2 years later :(
(which is now tracked in bug 802239)
Depends on: 802239
You need to log in before you can comment on or make changes to this bug.