Closed Bug 728653 Opened 12 years ago Closed 12 years ago

Crash in mozilla::storage::Connection::stepStatement @ sqlite3_extended_result_codes

Categories

(Toolkit :: Storage, defect)

13 Branch
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla13

People

(Reporter: scoobidiver, Assigned: espindola)

References

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(1 file, 4 obsolete files)

It's a low volume crash but there's a spike in crashes from 13.0a1/20120218. The regression range is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=2271cb92cc05&tochange=550779e6bab4

Signature 	sqlite3_extended_result_codes More Reports Search
UUID	af5e2efa-2bdf-4183-bcac-cd8622120219
Date Processed	2012-02-19 03:49:57
Uptime	5180
Last Crash	1.6 hours before submission
Install Age	2.3 hours since version was first installed.
Install Time	2012-02-19 01:31:57
Product	Firefox
Version	13.0a1
Build ID	20120218031156
Release Channel	nightly
OS	Windows NT
OS Version	6.1.7601 Service Pack 1
Build Architecture	x86
Build Architecture Info	AuthenticAMD family 15 model 31 stepping 0
Crash Reason	EXCEPTION_ACCESS_VIOLATION_READ
Crash Address	0x44
App Notes 	
AdapterVendorID: 0x10de, AdapterDeviceID: 0x01d3, AdapterSubsysID: 140110b0, AdapterDriverVersion: 8.17.12.8562
D2D? D2D-
D3D10 Layers? D3D10 Layers-
D3D9 Layers? D3D9 Layers+
EMCheckCompatibility	True	
Total Virtual Memory	4294836224
Available Virtual Memory	3806265344
System Memory Use Percentage	83
Available Page File	4242325504
Available Physical Memory	362340352

Frame 	Module 	Signature [Expand] 	Source
0 	mozsqlite3.dll 	sqlite3_extended_result_codes 	db/sqlite3/src/sqlite3.c:114877
1 	xul.dll 	mozilla::storage::Connection::stepStatement 	storage/src/mozStorageConnection.cpp:881
2 	xul.dll 	mozilla::storage::BindingParams::`scalar deleting destructor' 	
3 	mozalloc.dll 	mozalloc.dll@0x10af 	
4 	xul.dll 	mozilla::storage::Statement::ExecuteStep 	storage/src/mozStorageStatement.cpp:618
5 	xul.dll 	mozilla::storage::Statement::Execute 	storage/src/mozStorageStatement.cpp:582
6 	xul.dll 	nsAnnotationService::RemoveAnnotationInternal 	toolkit/components/places/nsAnnotationService.cpp:1437
7 	xul.dll 	nsAnnotationService::RemoveItemAnnotation 	toolkit/components/places/nsAnnotationService.cpp:1465
8 	xul.dll 	NS_InvokeByIndex_P 	xpcom/reflect/xptcall/src/md/win32/xptcinvoke.cpp:102
9 	xul.dll 	XPC_WN_CallMethod 	js/xpconnect/src/XPCWrappedNativeJSOps.cpp:1542
10 	xul.dll 	xul.dll@0x10861f 	
11 	mozjs.dll 	js::InvokeKernel 	js/src/jsinterp.cpp:499
12 	mozjs.dll 	js::Interpret 	js/src/jsinterp.cpp:2694
...

More reports at:
https://crash-stats.mozilla.com/report/list?signature=sqlite3_extended_result_codes
It's currently #2 top browser crasher in 13.0a1 over the last day.

It might be a regression from bug 720258.
Keywords: topcrash
If I should blindly pointing to a changeset in that range I'd probably say bug 721603, so may be a ownership bug that we hit when spinning. May be hard to reproduce. The call comes from js and we don't have the js stack, something invoking removeItemAnnotation on shutdown.

(In reply to Scoobidiver from comment #1)
> It's currently #2 top browser crasher in 13.0a1 over the last day.
> 
> It might be a regression from bug 720258.

hard to tell, though that bug doesn't touch annotations at all.
some stack has "unrelated" je_free or mozalloc calls in the middle, that may confirm looping is related.
We may try to disable spinning for a Nightly and see if the crashes disappear, just to get some additional data.
I also got this crash this morning when I closed Firefox: https://crash-stats.mozilla.com/report/index/bp-22bf4442-c1ad-48cf-84ac-eac322120221
thanks Marco, this confirms it's a crash on shutdown, so my suspect is probably correct.
Blocks: 721603
Assignee: nobody → respindola
Looks like a case where spinning the loop makes we run code we were not running before and that code expects the database to be open. Investigating.
I added asserts to try to make this more reproducible and pushed to

https://tbpl.mozilla.org/?tree=Try&rev=a6fbecc27538
The try looks green :-(

This is my first time trying to debug a nightly crash. Any tips? A way of getting a js stack or at least the values passed to RemoveItemAnnotation would be very helpful.

I tried proof reading the js code that calls it, but quickly got to PlacesUIUtils.jsm which has many posible callers. I downloaded the build from that try run and I am currently trying to change bookmarks, set up sync, etc to try to make it crash.
btw, I agree with mac that this would go away if we stopped spinning the loop. Should we revert that patch for now?
I'd just comment out the Spin() call with a comment pointing to this bug for now, that would be enough right?
(In reply to Rafael Ávila de Espíndola (:espindola) from comment #6)
> Looks like a case where spinning the loop makes we run code we were not
> running before and that code expects the database to be open. Investigating.
This is exactly why we should avoid spinning the event loop at all costs.
> This is exactly why we should avoid spinning the event loop at all costs.

No, there are other reasons, but this is not one of them.

We are just stays alive longer and and code is trying to use the places db after it has been shutdown

I will upload a patch commenting the spin and pointing to this bug.
Comment on attachment 599572 [details] [diff] [review]
disable the spinning for now

Review of attachment 599572 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/places/Database.cpp
@@ +1768,5 @@
>      new BlockingConnectionCloseCallback();
> +  // The spinning is temporarily disabled. See bug 728653.
> +  (void)mMainConn->AsyncClose(NULL);
> +  //  closeListener->Spin();
> +  closeListener->Complete();

just move the comment before the commented out Spin

you can still pass closeListener to AsyncClose and it will notify on complete by itself, why do you have to remove that and call manually? as it is this will notify too early.
Attached patch new patch (obsolete) — Splinter Review
I was afraid of the function returning before the notification, but that is what we used to do, so it is probably the best way of disabling this.
Attachment #599572 - Attachment is obsolete: true
Attachment #599576 - Flags: review?(mak77)
Attachment #599572 - Flags: review?(mak77)
Comment on attachment 599576 [details] [diff] [review]
new patch

Review of attachment 599576 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/places/Database.cpp
@@ +1767,5 @@
>    nsRefPtr<BlockingConnectionCloseCallback> closeListener =
>      new BlockingConnectionCloseCallback();
>    (void)mMainConn->AsyncClose(closeListener);
> +  // The spinning is temporarily disabled. See bug 728653.
> +  //  closeListener->Spin();

really-nit: all those whitespaces look useless!
Attachment #599576 - Flags: review?(mak77) → review+
looking at the stacks, they look like

nsAnnotationService::RemoveItemAnnotation
js garbage
mozilla::net::HttpBaseChannel::DoNotifyListener (calls OnStartRequest)
mozilla::net::HttpAsyncAborter<nsHttpChannel>::HandleAsyncAbort
nsRunnableMethodImpl<void , true>::Run

the last runnable is created by AsyncAbort (via AsyncCall).

And the AsyncClose Run method looks like

    (void)mConnection->internalClose(); <- sets mDBConn to NULL
    if (mCallbackEvent)
      (void)mCallingThread->Dispatch(mCallbackEvent, NS_DISPATCH_NORMAL);

What I think is happening is that the TODO list for our thread looks like

...| AsyncCloseConnection | HttpAsyncAborter<nsHttpChannel>::HandleAsyncAbort |

When AsyncCloseConnection::Run executes, it will set mDBConn to NULL and dispatch the callback. The TODO list now looks like

| HttpAsyncAborter<nsHttpChannel>::HandleAsyncAbort | CallBack

When the HandleAsyncAbort executes mDBConn is already null and we get our segmentation fault.

I can see two ways we can fix this:

* Make sure all http requests are done by the time we try to close the places db (and make sure no new ones are created)
* Make anything that has a callback from an http realize that the db might already be closed.

The second option is probably the best unless we have something that really must be written to disk because an http connection got canceled (really?).

Now, any guesses on which onStartRequest implementation we have on the stack?
Attached patch don't notify during shutdown (obsolete) — Splinter Review
Mak, what do you think?

The idea is that notifications about connections are not interesting for anything we have to record on disk, so we skip them during shutdown.

https://tbpl.mozilla.org/?tree=Try&rev=e27bd3c076a6
Attachment #599716 - Flags: review?(mak77)
https://hg.mozilla.org/mozilla-central/rev/529fdd0c82a1
https://hg.mozilla.org/mozilla-central/rev/44b5705eeb21

Leaving open for unreviewed patch. In the future, please can you add [leave open] to the whiteboard (along the lines of https://wiki.mozilla.org/Tree_Rules/Inbound#Please_do_the_following_after_pushing_to_inbound), since I only happened to spot this by chance + once I create something automated, the whiteboard will be the only thing it looks at.
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla13
(In reply to Rafael Ávila de Espíndola (:espindola) from comment #20)
> looking at the stacks, they look like

So, based on IRC discussion, I think what was hitting annotations on shutdown were livemarks. The description of the events matches perfectly what they do. Bug 613588 removes all of this, though add-ons may still do something similar, so we still need a more general fix.

looks like

> ...| AsyncCloseConnection |
> HttpAsyncAborter<nsHttpChannel>::HandleAsyncAbort |
> 
> When AsyncCloseConnection::Run executes, it will set mDBConn to NULL and
> dispatch the callback. The TODO list now looks like
> 
> | HttpAsyncAborter<nsHttpChannel>::HandleAsyncAbort | CallBack
> 
> When the HandleAsyncAbort executes mDBConn is already null and we get our
> segmentation fault.

This is very plausible, though I don't see how spinning the loop makes this different from before. I suppose that event was executed before too, on merging the threads. Why were we not crashing? And why do we see destructors and free calls in the middle of the stack? My fear is that really we are spinning, and one of the events is spinning as well (my supposition here is Sync).

We may need to add some additional crash protection in Storage in addition to fixes here, to avoid entering SQLite in a bad state.

> I can see two ways we can fix this:
> 
> * Make sure all http requests are done by the time we try to close the
> places db (and make sure no new ones are created)
> * Make anything that has a callback from an http realize that the db might
> already be closed.
> 
> The second option is probably the best unless we have something that really
> must be written to disk because an http connection got canceled (really?).

Well, considered the kind of data we handle at the end of those requests, I see no problem in Places and many other components. Though I can't speak globally for all the codebase, there may be something that relies on these notifications.

Thus I think the healthier approach is to start getting feedback from Necko peers, I don't know that code good enough to tell which are the implications of the change.
> So, based on IRC discussion, I think what was hitting annotations on
> shutdown were livemarks. The description of the events matches perfectly
> what they do. Bug 613588 removes all of this, though add-ons may still do
> something similar, so we still need a more general fix.

So that should fix this. I still think we should handle the general case if we can.
 
> This is very plausible, though I don't see how spinning the loop makes this
> different from before.

It makes us get here. Without it we would quick with the last TODO items scheduled but not executed. In particular, we would not execute the notification that an http connection was canceled.

> I suppose that event was executed before too, on
> merging the threads. Why were we not crashing?

This part I do not know, but before we spinned the loop we would quick without internalClose running, so we know that some items were not being run. The http notifications where probably one of them.

> And why do we see destructors
> and free calls in the middle of the stack?

Like this:

2	xul.dll	mozilla::storage::BindingParams::`vector deleting destructor'	
3	mozalloc.dll	mozalloc.dll@0x10af

Looks like it is just breakpad getting confused with destroying BindingParams. Note that this doesn't show up with other compilers.

> My fear is that really we are
> spinning, and one of the events is spinning as well (my supposition here is
> Sync).

Then the that would show up in the stack. The spin point is way up from that destructor call.

> We may need to add some additional crash protection in Storage in addition
> to fixes here, to avoid entering SQLite in a bad state.

Something like what I added in

https://hg.mozilla.org/try/rev/a6fbecc27538

with the "what do we do" replaced by a return? I would like to avoid that if we can as I am afraid it will just hide harder to find bugs.
 
> Thus I think the healthier approach is to start getting feedback from Necko
> peers, I don't know that code good enough to tell which are the implications
> of the change.

cced biesi.
Attachment #599716 - Flags: feedback?(jduell.mcbugs)
I don't think there is a way to have multiple feedback?, but we need someone to comment on the Necko changes.
Blocks: 662444
No longer blocks: 721603
Attachment #599716 - Flags: feedback?(cbiesinger)
I just realized that the behavior for NeckoChild.cpp is unchanged:

+ HttpChannelChild* httpChannel = new HttpChannelChild(NULL);

....

+ if (mListener && !(mConnMgr && mConnMgr->isShuttingDown())) {

mak, does this really need to block on Necko feedback?
Comment on attachment 599716 [details] [diff] [review]
don't notify during shutdown

This makes me incredibly scared and I really don't want to see this checked in. Not sending any listener notifications seems like a recipe for disaster to me, it is breaking a longstanding necko contract that listener notifications will always be sent. Can you just handle this in the listener somehow?
Attachment #599716 - Flags: feedback?(cbiesinger) → feedback-
(In reply to Rafael Ávila de Espíndola (:espindola) from comment #26)
> mak, does this really need to block on Necko feedback?

I'm not sure what's the question, I surely can't approve any changes at the Necko level.
I'd be fine if we'd figure out a fix in Storage to avoid crashing (and possibly a testcase).
> This makes me incredibly scared and I really don't want to see this checked
> in. Not sending any listener notifications seems like a recipe for disaster
> to me, it is breaking a longstanding necko contract that listener
> notifications will always be sent. Can you just handle this in the listener
> somehow?

May I ask what is the use of getting a notification of a http connection being closed during shutdown? What are you doing with the notification?

I can use a check on the lines of

if (we are shutting down)
   drop write request

in the annotation service. That is effectively what is happening today since we exit without waiting even for the places db connection to be closed. Do you really find that a better solution? If so I can upload a patch for review.
> I'd be fine if we'd figure out a fix in Storage to avoid crashing (and
> possibly a testcase).

I can easily implement the test in comment 29. Are you ok with it? A testcase without stubbing the shutdown process is hard to write as we need to have a connection shutdown propagating to an annotation request just at the right time. From comment 20, the thread TODO list has to look like

...| AsyncCloseConnection | HttpAsyncAborter<nsHttpChannel>::HandleAsyncAbort |
Killing the annotations writes is not a general enough fix imo, it's just one thing out of the thousands add-ons may do, that would still crash in the same exact point. And doesn't prevent anyone from adding other problematic writes in future.  Would not be even useful, since we already fixed this specific crash instance by rewriting livemarks.

We have uncovered a possible crashing point in Storage, and we should stop the crash from Storage, so that trying to use a statement this late would just abruptly fail, without crashing.  Figuring out a stop-crash strategy in Storage would also probably allow to write a proper test.
Attached patch "fix" the problem (obsolete) — Splinter Review
Attachment #599576 - Attachment is obsolete: true
Attachment #599716 - Attachment is obsolete: true
Attachment #601113 - Flags: review?(mak77)
Attachment #599716 - Flags: review?(mak77)
Attachment #599716 - Flags: feedback?(jduell.mcbugs)
(In reply to Christian :Biesinger (don't email me, ping me on IRC) from comment #27)
> Comment on attachment 599716 [details] [diff] [review]
> don't notify during shutdown
> 
> This makes me incredibly scared and I really don't want to see this checked
> in. Not sending any listener notifications seems like a recipe for disaster
> to me, it is breaking a longstanding necko contract that listener
> notifications will always be sent. Can you just handle this in the listener
> somehow?

We need to change how we do shutdown for bug 662444. That's involves changing some semantics. I don't see why it's incredibly scary to avoid telling someone about our connections on shutdown.
So, at the time of the review I wasn't aware that the plan was to shutdown using exit(0), but still, it does not seem like the right fix is to avoid sending status updates when you could easily change the recipient of the update to not crash. Some listeners may want to do stuff, even on shutdown.
(In reply to Christian :Biesinger (don't email me, ping me on IRC) from comment #34)
> So, at the time of the review I wasn't aware that the plan was to shutdown
> using exit(0), but still, it does not seem like the right fix is to avoid
> sending status updates when you could easily change the recipient of the
> update to not crash. Some listeners may want to do stuff, even on shutdown.

*may* is the key. It is possible to imagine something that breaks any interesting patch, but to get to exit(0) will have to change the semantics of which messages are available during shutdown and have code adapt accordingly.

Having said that, lets try to keep this bug to the minimal possible to avoid accessing a null db during shutdown. I don't like the current patch but it seems to be what most prefer, so be it.
Comment on attachment 601113 [details] [diff] [review]
"fix" the problem

Review of attachment 601113 [details] [diff] [review]:
-----------------------------------------------------------------

I sent a mail to sdwilsh, asuth and Rafael with some deeper thoughts on the situation at shutdown and how we may crash, just extracting what happens here, imo

1. all cached statements are finalized
2. asyncClose runs
3. something asks the StatementCache for removeItemAnnotationStatement, it makes a new one
4. internalClose runs and sets mDBConn to null, it fails to close the connection since there is an existing statement, but regardless we set mDBConn to null!
5. something asks the StatementCache for removeItemAnnotationStatement, it exists in the cache, so it is just returned
6. the statements gets executed, Statement::ExecuteStep() is invoked, and it invokes mDBConnection->stepStatement() (the statement internal pointer to the connection is still valid, so this doesn't crash)
7. stepStatement tries to use mDBConn, null dereference, crash 

Based on this I think this patch is correct but wants a much nicer comment explaining how we may enter here with a null mDBConn, that is when closing the connection fails due to non finalized statements and one of this non finalized statements is executed after internalClose.

Though we also want an additional bug to make Storage less lean to this kind of shutdown crashes, since this is one possible crash point but there may be other hitting similar paths. For this I'm asking feedback to long experienced Storage peers.
Attachment #601113 - Flags: review?(mak77) → feedback+
> 1. all cached statements are finalized
> 2. asyncClose runs
> 3. something asks the StatementCache for removeItemAnnotationStatement, it
> makes a new one
> 4. internalClose runs and sets mDBConn to null, it fails to close the
> connection since there is an existing statement, but regardless we set
> mDBConn to null!
> 5. something asks the StatementCache for removeItemAnnotationStatement, it
> exists in the cache, so it is just returned
> 6. the statements gets executed, Statement::ExecuteStep() is invoked, and it
> invokes mDBConnection->stepStatement() (the statement internal pointer to
> the connection is still valid, so this doesn't crash)
> 7. stepStatement tries to use mDBConn, null dereference, crash 

Looks like a reasonable hypotheses and in line with comment 20.

> Based on this I think this patch is correct but wants a much nicer comment
> explaining how we may enter here with a null mDBConn, that is when closing
> the connection fails due to non finalized statements and one of this non
> finalized statements is executed after internalClose.

I still think the patch is wrong, but will add a comment.

> Though we also want an additional bug to make Storage less lean to this kind
> of shutdown crashes, since this is one possible crash point but there may be
> other hitting similar paths. For this I'm asking feedback to long
> experienced Storage peers.
Attachment #601113 - Attachment is obsolete: true
Attachment #601612 - Flags: review?(mak77)
Comment on attachment 601612 [details] [diff] [review]
now with comments

Review of attachment 601612 [details] [diff] [review]:
-----------------------------------------------------------------

::: storage/src/mozStorageConnection.cpp
@@ +886,5 @@
> +  // the list looking like
> +  // AsyncCloseConnection | HttpAsyncAborter<nsHttpChannel>::HandleAsyncAbort
> +  // When AsyncCloseConnection::Run executes the second time (now on the calling
> +  // thread), it will set mDBConn to null. It knows there will be no other
> +  // async user, but a sync one might still try to execute.

I'd prefer something more related to Storage internals, and less verbose.
The bug is always here for more insight.

r=me with a comment like (please fix any english mistake I may have introduced):

// mDBConn may be null if the executing statement has been created and cached
// after a call to asyncClose() but before the connection has been nullified
// by internalClose().  In such a case closing the connection fails due to
// the existence of prepared statements, but mDBConn is set to null regardless.
// This usually happens when other tasks using cached statements are asynchronously
// scheduled for execution and any of them ends up after asyncClose.
// See bug 728653 for details.
Attachment #601612 - Flags: review?(mak77) → review+
btw, it is not a particularly good use of my time to be asked to write a comment just to have it be discarded.

https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=b92e7499019e
right, next time I will directly write a suggested comment in the review, it's a good suggestion.
https://hg.mozilla.org/mozilla-central/rev/b92e7499019e
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: