Last Comment Bug 654573 - Crash [@ mozilla::storage::StorageMemoryReporter::GetPath ] [@ mozilla::storage::Connection::getFilename() ] [@ mozilla::storage::Connection::getFilename ] after accessing about:permissions
: Crash [@ mozilla::storage::StorageMemoryReporter::GetPath ] [@ mozilla::stora...
Status: RESOLVED FIXED
[about:memory, test pilot, and teleme...
: crash, mlk
Product: Toolkit
Classification: Components
Component: Storage (show other bugs)
: Trunk
: x86 All
: -- critical with 2 votes (vote)
: mozilla7
Assigned To: Nicholas Nethercote [:njn]
:
:
Mentors:
: 655170 658649 660213 (view as bug list)
Depends on:
Blocks: 573176 mlk-fx7
  Show dependency treegraph
 
Reported: 2011-05-03 14:53 PDT by Jono Xia
Modified: 2013-12-27 14:34 PST (History)
34 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
fixed


Attachments
valgrind log (2.89 KB, text/plain)
2011-05-03 17:36 PDT, Robert Sayre
no flags Details
Better Valgrind log (52.25 KB, text/plain)
2011-05-29 21:41 PDT, Nicholas Nethercote [:njn]
no flags Details
Patch (v1) (3.94 KB, patch)
2011-05-31 15:08 PDT, :Ehsan Akhgari
n.nethercote: review-
Details | Diff | Splinter Review
logging (268.58 KB, text/plain)
2011-06-07 18:09 PDT, Nicholas Nethercote [:njn]
no flags Details
patch (895 bytes, patch)
2011-06-07 19:26 PDT, Nicholas Nethercote [:njn]
mak77: review+
gavin.sharp: approval‑mozilla‑aurora+
Details | Diff | Splinter Review

Description Jono Xia 2011-05-03 14:53:02 PDT
I just had three crashes in a row on 5.0a2 (Aurora) running Test Pilot 1.1.1.

One was when I hit the "Notify Me" button on the debug page, one was when I hit the "Reload All Studies" button on the debug page, and one was when I wasn't doing anything.

Hitting "reload all studies" or "notify me" does not reliably cause the crash, however.

https://crash-stats.mozilla.com/report/index/bp-ef735228-4aeb-453e-accb-d10232110503

https://crash-stats.mozilla.com/report/index/bp-c63744e2-9ee2-4318-8ada-162a72110503

https://crash-stats.mozilla.com/report/index/4b6394ef-777b-41fa-a79c-6330d2110503

All three crashes happened here:

mozilla::storage::StorageMemoryReporter::GetPath 	storage/src/mozStorageConnection.cpp:751 

In two cases the "crash reason" is EXC_BAD_ACCESS / 0x0000000d
in the last case it is EXC_BAD_ACCESS / KERN_INVALID_ADDRESS
Comment 1 Shawn Wilsher :sdwilsh 2011-05-03 15:14:35 PDT
Can you catch it in a debugger and see if the backtrace in those stacks is right?
Comment 2 Jono Xia 2011-05-03 15:34:49 PDT
The debug page is at chrome://testpilot/content/debug.html   if anyone wants to try reproducing it by clicking the "Notify me" or "reload all studies" buttons.

I had my profile set to use the index-dev.json channel, so it's possible it was one of the unreleased studies that triggered the crash.
Comment 3 Jono Xia 2011-05-03 15:35:29 PDT
(In reply to comment #1)
> Can you catch it in a debugger and see if the backtrace in those stacks is
> right?

I don't know how to do that.  Can you point me at some documentation?
Comment 4 Shawn Wilsher :sdwilsh 2011-05-03 15:45:13 PDT
(In reply to comment #3)
> I don't know how to do that.  Can you point me at some documentation?
Nevermind; probably only useful if you have your own build.  Can you disable any study that does anything with about:memory or the memory reporters please and see if you can reproduce it?
Comment 5 Jono Xia 2011-05-03 16:15:21 PDT
I disabled the the week-in-the-life study (which is the only study that does anything with memory reporters; it basically duplicates the code of about:memory) and haven't been able to get it to crash since, although this isn't conclusive since the crash was only intermittent to begin with.

Loading about:memory causes a crash with an identical stack trace pretty reliably.

This suggests pretty strongly that the the memory reporting code is the culprit.

I think we've seen this problem before.  A very similar crash led us to disable the week-in-the-life study for Mac OS 10.6 users on Firefox 4 beta 10.  (I'm on Mac OS 10.6.5).

I thought the underlying bug was fixed in Fx4b11 or b12.  It is possible there could have been a regression and the bug could have snuck back in to Fx5.0a2 ?
Comment 6 Shawn Wilsher :sdwilsh 2011-05-03 16:22:28 PDT
(In reply to comment #5)
> Loading about:memory causes a crash with an identical stack trace pretty
> reliably.
Interesting...

> I think we've seen this problem before.  A very similar crash led us to disable
> the week-in-the-life study for Mac OS 10.6 users on Firefox 4 beta 10.  (I'm on
> Mac OS 10.6.5).
I've not seen this in storage code before (or at least nobody told me about it).

> I thought the underlying bug was fixed in Fx4b11 or b12.  It is possible there
> could have been a regression and the bug could have snuck back in to Fx5.0a2 ?
Nothing was fixed that I know of that was the same as this, FWIW.
Comment 7 Dave Townsend [:mossop] 2011-05-03 16:25:41 PDT
(In reply to comment #6)
> (In reply to comment #5)
> > Loading about:memory causes a crash with an identical stack trace pretty
> > reliably.
> Interesting...

Does the same happen with test pilot totally disabled? Or any other extensions you might have installed?
Comment 8 Robert Sayre 2011-05-03 17:34:18 PDT
I got this to crash under valgrind, but only with the index-dev.json file selected. Can Jono or anyone else reproduce with the regular index.json file selected?
Comment 9 Robert Sayre 2011-05-03 17:36:04 PDT
Created attachment 529905 [details]
valgrind log

given the stack on the invalid read, I'm guessing this is related to the test pilot study under development that uses about:memory.
Comment 10 Shawn Wilsher :sdwilsh 2011-05-06 09:06:50 PDT
*** Bug 655170 has been marked as a duplicate of this bug. ***
Comment 11 Tony Mechelynck [:tonymec] 2011-05-07 01:35:16 PDT
Duplicate bug 655170 (seeing the same crash after typing about:memory in the URL bar) seems to indicate that this bug is not specific to Test Pilot.

See also bug 655170 comment #2 (debugging the crash source)
Comment 12 Shawn Wilsher :sdwilsh 2011-05-09 09:42:40 PDT
If someone happens to catch this crash in the debugger, I'd love to know.
Comment 13 :Margaret Leibovic 2011-05-20 10:56:26 PDT
This happened in my Nightly build. I have telemetry enabled, so that may be affecting this.

Marking as tracking-firefox6 in case this is caused by telemetry.
Comment 14 Shawn Wilsher :sdwilsh 2011-05-20 14:27:06 PDT
*** Bug 658649 has been marked as a duplicate of this bug. ***
Comment 15 (mostly gone) XtC4UaLL [:xtc4uall] 2011-05-21 03:03:22 PDT
bp-e27557ff-771a-4e5c-b51e-9f0fc2110521 [@ mozilla::storage::Connection::getFilename() ] 
Also have Telemetry enabled.
Comment 16 :Ehsan Akhgari 2011-05-22 09:50:28 PDT
This happened to me too, on Nightly with telemetry enabled.
Comment 17 Hugh Nougher [:Hughman] 2011-05-23 03:05:36 PDT
Just got this in Aurora (before the merge) when reloading about:memory.
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0a2) Gecko/20110521 Firefox/5.0a2

bp-9253ab7a-4cd3-4f24-98fa-c4a252110523
mozilla::storage::Connection::getFilename()
Comment 18 :Margaret Leibovic 2011-05-23 07:23:10 PDT
I just noticed Taras isn't cc'd to this bug, and he should be if it turns out this crash is related to telemetry.
Comment 19 (dormant account) 2011-05-23 09:16:58 PDT
(In reply to comment #13)
> This happened in my Nightly build. I have telemetry enabled, so that may be
> affecting this.
> 
> Marking as tracking-firefox6 in case this is caused by telemetry.

The only connection that telemetry has to storage is when it polls about:memory values(telemetry has to be enabled for that). So given what Jono reported, about:memory is the most likely culprit.
Comment 20 Nicholas Nethercote [:njn] 2011-05-23 16:11:15 PDT
(In reply to comment #19)
>
> Jono reported, about:memory is the most likely culprit.

No, SQLite memory reporters are the culprit.  They're called from aboutMemory.js and from the Test Pilot add-on and from telemetry code.  It looks like Connection::mDatabaseFile is set to a bogus value, and it crashes when it's dereferenced in Connection::getFilename().
Comment 21 Shawn Wilsher :sdwilsh 2011-05-23 16:35:16 PDT
(In reply to comment #20)
> No, SQLite memory reporters are the culprit.  They're called from
> aboutMemory.js and from the Test Pilot add-on and from telemetry code.  It
> looks like Connection::mDatabaseFile is set to a bogus value, and it crashes
> when it's dereferenced in Connection::getFilename().
It's just unclear at this time how mDatabaseFile is getting a bogus value.  It's an nsCOMPtr, which means we don't need to initialize it to NULL, and it is a strong reference (in the case where we have a database file).  It's also only ever set in one spot: https://hg.mozilla.org/mozilla-central/annotate/b26980e3622a/storage/src/mozStorageConnection.cpp#l479
Comment 22 Nicholas Nethercote [:njn] 2011-05-23 17:48:47 PDT
So presumably one of the callers of Connection::initialize() is passing in a bogus aDatabaseFile.
Comment 23 Robert Sayre 2011-05-23 19:05:21 PDT
so, --track-origins should lead right to it. I can investigate tomorrow.
Comment 24 Nicholas Nethercote [:njn] 2011-05-23 19:09:58 PDT
(In reply to comment #23)
> so, --track-origins should lead right to it. I can investigate tomorrow.

If it's undefined, yes.

The hard part of this bug is reproducing, AFAICT.  I've never seen this crash and I look at about:memory a lot.
Comment 25 Nicholas Nethercote [:njn] 2011-05-23 19:17:20 PDT
(In reply to comment #24)
> 
> The hard part of this bug is reproducing, AFAICT.  I've never seen this
> crash and I look at about:memory a lot.

Oh, but I haven't tried it with Test Pilot installed.
Comment 26 Nicholas Nethercote [:njn] 2011-05-23 21:56:41 PDT
I can reproduce *a* crash with Test Pilot installed.  I have to click around in chrome://testpilot/content/debug.html a few times for it to happen.  But the stack trace is useless:

#0  0x00007fe0db94f39d in nanosleep ()
    at ../sysdeps/unix/syscall-template.S:82
#1  0x00007fe0db94f210 in __sleep (seconds=<value optimized out>)
    at ../sysdeps/unix/sysv/linux/sleep.c:138
#2  0x00007fe0dd22c141 in ah_crap_handler (signum=11)
    at /home/njn/moz/ws7/toolkit/xre/nsSigHandlers.cpp:119
#3  0x000000000000000b in ?? ()
#4  0x00007fe0dd230bba in nsProfileLock::FatalSignalHandler (signo=11, 
    info=0x7fff1f6521b0, context=0x7fff1f652080)
    at /home/njn/moz/ws7/d64/toolkit/profile/nsProfileLock.cpp:226
#5  0x0000000000000000 in ?? ()

I tried under Valgrind, it crashed but Valgrind didn't give a warning(!)  But it did say earlier on:

--10138-- WARNING: Serious error when reading debug info
--10138-- When reading debug info from /home/njn/moz/ws7/v64/toolkit/library/libxul.so:
--10138-- Can't make sense of .eh_frame section mapping

which I haven't seen before, and I got heaps of warnings like this:

==10138== Invalid read of size 8
==10138==    at 0x6815B90: ??? (in /home/njn/moz/ws7/v64/toolkit/library/libxul.
so)
==10138==    by 0x64AB9E1: ??? (in /home/njn/moz/ws7/v64/toolkit/library/libxul.
so)
==10138==    by 0x1BDBEE7F: ???
==10138==    by 0x6: ???
==10138==    by 0x19340F9F: ???
==10138==    by 0x193F9C4F: ???
==10138==    by 0x1E40306F: ???
==10138==    by 0x1E3F533F: ???
==10138==  Address 0x1cb093b8 is not stack'd, malloc'd or (recently) free'd

All very uninformative.
Comment 27 Wes Kocher (:KWierso) 2011-05-24 19:47:21 PDT
I had a crash that lists this bug as related: https://crash-stats.mozilla.com/report/index/bp-41d23f4d-2cb5-4aa1-9515-141892110524

I wasn't doing anything in Firefox when it crashed. (Was actually AFK taking out the trash and came back to find the crash reporter screen showing.)

I _was_ looking at the new about:permissions page earlier, but I'm pretty sure the tab was closed before I stepped away from the computer (it wasn't restored when I clicked "Restore previous session").

I don't have Test Pilot installed. Using the latest nightly build from today.

Am I seeing this same bug or something else?

(In reply to comment #20)
> They're called from aboutMemory.js and from the Test Pilot add-on and from telemetry code.

I do have Taras's telemetry extension installed.
Comment 28 Ed Morley [:emorley] 2011-05-25 01:37:04 PDT
Just experienced this with Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a1) Gecko/20110524 Firefox/6.0a1.

bp-b2a31f3b-5460-41d6-b42d-065722110525

Test pilot is not installed, however telemetry is enabled (and the telemetry addon installed).

That session I had not used about:memory, but had looked at about:permissions (though it was closed some time before the crash).

Open tabs at the time of the crash were gmail, bugzilla, Phil's TBPL - with TBPL being the active tab at the time; however Fx was minimised when the crash occurred.
Comment 29 Nicholas Nethercote [:njn] 2011-05-25 03:58:53 PDT
Wes, Ed, thanks for the extra data points.  Comment 20 still holds.
Comment 30 Daniel Cater 2011-05-25 06:55:01 PDT
None of the crash reports already linked to in this bug is on Linux so I thought I'd add one in case it's useful: bp-47f90daf-ef62-421c-a13c-8a26a2110525

The extensions tab says "No extensions were installed" for some reason but this was with Test Pilot enabled and telemetry enabled.

I can fairly confidently say that no tabs were open at the time (except a blank one) and that it crashed while I was away from the computer.
Comment 31 Nicholas Nethercote [:njn] 2011-05-27 15:14:49 PDT
Bug 660213 is very similar, probably has the same root cause, but has a different stack trace so I'm keeping it separate for now.
Comment 32 Daniel Cater 2011-05-27 15:18:44 PDT
bp-039eeeac-713d-4b87-98f2-049f12110527 is another one I had today with a different top frame.
Comment 33 Nicholas Nethercote [:njn] 2011-05-27 16:54:35 PDT
(In reply to comment #32)
> bp-039eeeac-713d-4b87-98f2-049f12110527 is another one I had today with a
> different top frame.

Looks like the same problem -- still a crash that goes through GetPath.
Comment 34 Nicholas Nethercote [:njn] 2011-05-29 21:41:38 PDT
Created attachment 536001 [details]
Better Valgrind log

I got this to crash under Valgrind too by turning on Telemetry;  I couldn't get the Test Pilot crashes to occur.

There are a ton of complaints (see attached file), but they are all very similar to the first one:

 Thread 1:
 Invalid read of size 4
    at 0x4C2D457: pthread_mutex_lock (pthread_mutex_lock.c:84)
    by 0x40C0471: pthreadMutexEnter (sqlite3.c:17074)
    by 0x40B0CBC: sqlite3_mutex_enter (sqlite3.c:16314)
    by 0x40C21EB: sqlite3_db_status (sqlite3.c:12555)
    by 0x5E9EE30: mozilla::storage::StorageMemoryReporter::GetMemoryUsed(long*) (mozStorage
Connection.cpp:407)
    by 0x6107ED1: NS_InvokeByIndex_P (xptcinvoke_x86_64_unix.cpp:195)
    by 0x5C8287E: XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode)
 (xpcwrappednative.cpp:3141)
    by 0x5C8A04F: XPC_WN_GetterSetter(JSContext*, unsigned int, unsigned long*) (xpcprivate
.h:2624)
    by 0x6368060: js::Invoke(JSContext*, js::CallArgs const&, js::ConstructOption) (jscntxt
inlines.h:277)
    by 0x6368A67: js::ExternalInvoke(JSContext*, js::Value const&, js::Value const&, unsign
ed int, js::Value*, js::Value*) (jsinterp.cpp:816)
    by 0x6368B58: js::ExternalGetOrSet(JSContext*, JSObject*, long, js::Value const&, JSAcc
essMode, unsigned int, js::Value*, js::Value*) (jsinterp.cpp:856)
    by 0x63858C4: js::Shape::get(JSContext*, JSObject*, JSObject*, JSObject*, js::Value*) c
onst (jsscopeinlines.h:269)

  Address 0xe783ea0 is 16 bytes inside a block of size 48 free'd
    at 0x402604A: free (vg_replace_malloc.c:370)
    by 0x40C043D: sqlite3MemFree (sqlite3.c:14314)
    by 0x40B1275: sqlite3_free (sqlite3.c:17997)
    by 0x40C0487: pthreadMutexFree (sqlite3.c:17032)
    by 0x40B0CA8: sqlite3_mutex_free (sqlite3.c:16304)
    by 0x40EF7E7: sqlite3_close (sqlite3.c:105765)
    by 0x5E9D672: mozilla::storage::Connection::internalClose() (mozStorageConnection.cpp:7
33)
    by 0x5E9D762: mozilla::storage::Connection::Close() (mozStorageConnection.cpp:784)
    by 0x5E9EC7C: mozilla::storage::Connection::~Connection() (mozStorageConnection.cpp:440
)
    by 0x5E9ED64: mozilla::storage::Connection::Release() (mozStorageConnection.cpp:443)
    by 0x60B60BE: nsCOMPtr_base::assign_with_AddRef(nsISupports*) (nsCOMPtr.h:479)
    by 0x5DC1396: nsUrlClassifierDBServiceWorker::CloseDb() (nsCOMPtr.h:663)



A use-after-free bug.  It appears that a DB connection is closed but the
memory reporter for it is still alive.  Looks like the DB is getting closed when the refcount for the database connection reaches zero.  But Connection::internalClose() unregisters the relevant reporters, so I don't understand the problem.
Comment 35 :Ehsan Akhgari 2011-05-31 14:58:10 PDT
I think I know what causes the bug.  internalClose doesn't remove the memory watchers it unregisters from the array, which causes them to be re-registered if the connection is opened again.  This causes the old reporters to be registered again.  But they are associated with the old sqlite connection, which causes crashes like this, I think.
Comment 36 Shawn Wilsher :sdwilsh 2011-05-31 15:00:25 PDT
(In reply to comment #35)
> I think I know what causes the bug.  internalClose doesn't remove the memory
> watchers it unregisters from the array, which causes them to be re-registered
> if the connection is opened again.  This causes the old reporters to be
> registered again.  But they are associated with the old sqlite connection,
> which causes crashes like this, I think.
Once internalClose is called you cannot reopen a connection using the same Connection object.
Comment 37 :Ehsan Akhgari 2011-05-31 15:08:27 PDT
Created attachment 536431 [details] [diff] [review]
Patch (v1)

This array doesn't even need to be dynamic.  This patch limits the array length to 3, and also fixes the multiple registration problem.
Comment 38 :Ehsan Akhgari 2011-05-31 15:14:11 PDT
(In reply to comment #36)
> (In reply to comment #35)
> > I think I know what causes the bug.  internalClose doesn't remove the memory
> > watchers it unregisters from the array, which causes them to be re-registered
> > if the connection is opened again.  This causes the old reporters to be
> > registered again.  But they are associated with the old sqlite connection,
> > which causes crashes like this, I think.
> Once internalClose is called you cannot reopen a connection using the same
> Connection object.

Oh, hmm, if that's the case, I don't know what's going on here though...

But still, can someone with a valgrind build see if this patch improves the valgrind output?
Comment 39 Nicholas Nethercote [:njn] 2011-05-31 17:47:54 PDT
Comment on attachment 536431 [details] [diff] [review]
Patch (v1)

Review of attachment 536431 [details] [diff] [review]:
-----------------------------------------------------------------

I think the dynamic allocation is fine, it's less error-prone.

More importantly, I don't see any changes that would affect registration in the way you suggested might help.

::: storage/src/mozStorageConnection.cpp
@@ +334,5 @@
>  public:
>    NS_DECL_ISUPPORTS
>  
> +  // If you add more reporter types, make sure to update
> +  // Connection::MemoryReporterCount.

Ugh.  Adding a sentinel enum value with a name like "Limit" or "Last" is a better way of doing this.
Comment 40 :Ehsan Akhgari 2011-06-01 10:25:09 PDT
(In reply to comment #39)
> Comment on attachment 536431 [details] [diff] [review] [review]
> Patch (v1)
> 
> Review of attachment 536431 [details] [diff] [review] [review]:
> -----------------------------------------------------------------
> 
> I think the dynamic allocation is fine, it's less error-prone.
> 
> More importantly, I don't see any changes that would affect registration in
> the way you suggested might help.

The way that this patch does what I was talking about is through the fact that the array entries will be overwritten on each initialization, so older memory reporters would not remain around after the second re-init.

But considering what Shawn said, it doesn't seem that this is the source of the bug (and I can't see what else could be going wrong).

> ::: storage/src/mozStorageConnection.cpp
> @@ +334,5 @@
> >  public:
> >    NS_DECL_ISUPPORTS
> >  
> > +  // If you add more reporter types, make sure to update
> > +  // Connection::MemoryReporterCount.
> 
> Ugh.  Adding a sentinel enum value with a name like "Limit" or "Last" is a
> better way of doing this.

That would've required me to move the reporter class to the mozStorageConnection.h header, which was more change than I was willing to make.

Throwing this back into the pool.
Comment 41 msth67 2011-06-03 00:38:39 PDT
Seeing this repeatedly on current nightly build (9a6c139a4e58):opening about:permissions,then clicking "forget about this site" (sometimes this in itself makes the browser hang but not crash,warning "Script: jar:file:///e:/builds/moz2_slave/cen-w32-ntly/build/obj-firefox/dist/firefox/omni.jar!/components/nsPrivateBrowsingService.js:578" is displayed),then going in about:memory immediately after triggers the crash.
Comment 42 msth67 2011-06-03 01:44:36 PDT
edit:forgot to add,no testpilot installed.
Comment 43 Tim Taubert [:ttaubert] 2011-06-03 05:26:11 PDT
(In reply to comment #41)
> Seeing this repeatedly on current nightly build (9a6c139a4e58):opening
> about:permissions,then clicking "forget about this site" (sometimes this in
> itself makes the browser hang but not crash,warning "Script:
> jar:file:///e:/builds/moz2_slave/cen-w32-ntly/build/obj-firefox/dist/firefox/
> omni.jar!/components/nsPrivateBrowsingService.js:578" is displayed),then
> going in about:memory immediately after triggers the crash.

Same here. Crashed two times before with another STR (that I can't remember). Seems to crash a lot with today's nightly (just happened again while typing this - same signature) :/
Comment 44 AndreiD[QA] 2011-06-03 07:06:36 PDT
I could reproduce an "about:memory" page crash on:
Mozilla/5.0 (Windows NT 6.1; rv:6.0a2) Gecko/20110602 Firefox/6.0a2

Here is the link to the crash:
http://crash-stats.mozilla.com/report/index/bp-669b7070-967a-4f86-8c28-def5d2110603

Steps to reproduce:
1. Go to the "about:" page
2. From the "about:" page go to "about:memory"
Comment 45 Thomas Ahlblom 2011-06-03 07:36:58 PDT
Steps to Reproduce with a new clean profile:

1. Open one tab with about:memory
2. Open one tab with about:permissions
3. Press and hold F5
4. Left click, alternating fast, between the two tabs

That usually makes my Firefox crash within 5 seconds.

It's also reproducible with for example about:blank instead of about:permissions, but the combination with about:permissions seems to crash easily.
Comment 46 Nicholas Nethercote [:njn] 2011-06-03 14:12:55 PDT
Thanks for the improved steps-to-reproduce, everyone.  I'll look at this again on Monday.
Comment 47 :aceman 2011-06-03 14:19:14 PDT
I get this when I toggle "more verbose" and "less verbose" in about:memory.
Crash report: bp-98f1650b-d51a-4e55-8caf-85c082110603 .
Comment 48 Nicholas Nethercote [:njn] 2011-06-05 23:42:51 PDT
The steps-to-reproduce in comment 45 are very reliable for me, I've been recreating this crash all afternoon.

I'm seeing this:

  WARNING: NS_ENSURE_TRUE(!(mAsyncExecutionThread)) failed: file /home/njn/moz/mc1/storage/src/mozStorageConnection.cpp, line 799

which comes from this code:

  Connection::Close()
  {
    if (!mDBConn)
      return NS_ERROR_NOT_INITIALIZED;
  
    { // Make sure we have not executed any asynchronous statements.
      MutexAutoLock lockedScope(sharedAsyncExecutionMutex);
      NS_ENSURE_FALSE(mAsyncExecutionThread, NS_ERROR_UNEXPECTED);
    } 
    
    nsresult rv = setClosedState();
    NS_ENSURE_SUCCESS(rv, rv);
    
    return internalClose();
  }

It's awfully suspicious, because in one crash I just had, the reporter that made the crashing GetPath call belonged to a Connection for which this assertion failed, and somehow the internalClose() call didn't happen, which means the reporter wasn't unregistered.

Connection has both Close() and AsyncClose() methods, and each one calls internalClose().  Looks like maybe Close() is being called in a circumstance where AsyncClose() should be, or something like that?
Comment 49 Nicholas Nethercote [:njn] 2011-06-06 00:04:24 PDT
Ok, it's definitely a problem with the async/sync close.  I've checked several crashes and every time the crash occurs the first time the memory reporters are invoked after the assertion in Connection::Close is triggered.
Comment 50 Andrew Sutherland [:asuth] 2011-06-06 10:00:47 PDT
Can you get a call stack on the caller that is doing bad things?

I would propose we make the destructor not use the public API for fail-safe closure, but do generate an assertion if the destructor has to close up shop and AsyncClose should have been sued...
Comment 51 Nicholas Nethercote [:njn] 2011-06-06 18:32:40 PDT
(In reply to comment #50)
> Can you get a call stack on the caller that is doing bad things?

Here's the call stack when the assertion fires.  |foobar| is a function I inserted to break on, it's called just before the assertion fails.  Entries #5 and #6 look interesting.

#0  mozilla::storage::foobar () at /home/njn/moz/mc1/storage/src/mozStorageConnection.cpp:792
#1  0x00007ffff6578c38 in mozilla::storage::Connection::Close (this=0x7fffd63bd2c0)
    at /home/njn/moz/mc1/storage/src/mozStorageConnection.cpp:808
#2  0x00007ffff657a421 in mozilla::storage::Connection::~Connection (this=0x7fffd63bd2c0, 
    __in_chrg=<value optimised out>) at /home/njn/moz/mc1/storage/src/mozStorageConnection.cpp:455
#3  0x00007ffff657a64c in mozilla::storage::Connection::Release (this=0x7fffd63bd2c0)
    at /home/njn/moz/mc1/storage/src/mozStorageConnection.cpp:459
#4  0x00007ffff6587beb in ~nsRefPtr (this=0x7fffd60a5680) at ../../dist/include/nsAutoPtr.h:969
#5  ~AsyncExecuteStatements (this=0x7fffd60a5680)
    at /home/njn/moz/mc1/storage/src/mozStorageAsyncStatementExecution.h:66
#6  mozilla::storage::AsyncExecuteStatements::Release (this=0x7fffd60a5680)
    at /home/njn/moz/mc1/storage/src/mozStorageAsyncStatementExecution.cpp:531
#7  0x00007ffff620c684 in DoDeferredRelease<nsISupports*> (cx=0x7fffe034e400, status=JSGC_END)
    at /home/njn/moz/mc1/js/src/xpconnect/src/xpcjsruntime.cpp:625
#8  XPCJSRuntime::GCCallback (cx=0x7fffe034e400, status=JSGC_END)
    at /home/njn/moz/mc1/js/src/xpconnect/src/xpcjsruntime.cpp:935
#9  0x00007ffff5d187b9 in DOMGCCallback (cx=0x7fffe034e400, status=JSGC_END)
    at /home/njn/moz/mc1/dom/base/nsJSEnvironment.cpp:3481
#10 0x00007ffff6ca1e2e in js_GC (cx=0x7fffe034e400, comp=0x0, gckind=GC_NORMAL)
    at /home/njn/moz/mc1/js/src/jsgc.cpp:2741
#11 0x00007ffff6c0fa21 in JS_GC (cx=0x7fffe034e400) at /home/njn/moz/mc1/js/src/jsapi.cpp:2600
#12 0x00007ffff61e0e0e in nsXPConnect::Collect (this=<value optimised out>)
    at /home/njn/moz/mc1/js/src/xpconnect/src/nsXPConnect.cpp:406
#13 0x00007ffff61da5ed in nsXPConnect::GarbageCollect (this=<value optimised out>)
    at /home/njn/moz/mc1/js/src/xpconnect/src/nsXPConnect.cpp:414
#14 0x00007ffff5d152a6 in nsJSContext::GarbageCollectNow ()
    at /home/njn/moz/mc1/dom/base/nsJSEnvironment.cpp:3252
#15 0x00007ffff5d152cc in GCTimerFired (aTimer=<value optimised out>, 
    aClosure=<value optimised out>) at /home/njn/moz/mc1/dom/base/nsJSEnvironment.cpp:3302
#16 0x00007ffff6978a7b in nsTimerImpl::Fire (this=0x7fffe02d72b0)
    at /home/njn/moz/mc1/xpcom/threads/nsTimerImpl.cpp:424
#17 0x00007ffff6978de7 in nsTimerEvent::Run (this=0x7fffc5888be0)
    at /home/njn/moz/mc1/xpcom/threads/nsTimerImpl.cpp:520
#18 0x00007ffff6971b95 in nsThread::ProcessNextEvent (this=0x7fffec12aaa0, mayWait=0, 
    result=0x7fffffffd2ec) at /home/njn/moz/mc1/xpcom/threads/nsThread.cpp:618
#19 0x00007ffff6909568 in NS_ProcessNextEvent_P (thread=<value optimised out>, mayWait=0)
    at /home/njn/moz/mc1/d64/xpcom/build/nsThreadUtils.cpp:245
#20 0x00007ffff67e423b in mozilla::ipc::MessagePump::Run (this=0x7fffe9a53440, 
    aDelegate=0x7fffec1b18f0) at /home/njn/moz/mc1/ipc/glue/MessagePump.cpp:110
#21 0x00007ffff69b770c in MessageLoop::RunInternal (this=0x7fffec1b18f0)
    at /home/njn/moz/mc1/ipc/chromium/src/base/message_loop.cc:218
#22 0x00007ffff69b771b in MessageLoop::RunHandler (this=<value optimised out>)
    at /home/njn/moz/mc1/ipc/chromium/src/base/message_loop.cc:202
#23 0x00007ffff69b7a71 in MessageLoop::Run (this=0x7fffec1b18f0)
    at /home/njn/moz/mc1/ipc/chromium/src/base/message_loop.cc:176
#24 0x00007ffff668c5c4 in nsBaseAppShell::Run (this=0x7fffe47bf0f0)
    at /home/njn/moz/mc1/widget/src/xpwidgets/nsBaseAppShell.cpp:189
#25 0x00007ffff63e8c73 in nsAppStartup::Run (this=0x7fffe334a6a0)
    at /home/njn/moz/mc1/toolkit/components/startup/nsAppStartup.cpp:222
#26 0x00007ffff5421cfa in XRE_main (argc=<value optimised out>, argv=<value optimised out>, 
    aAppData=<value optimised out>) at /home/njn/moz/mc1/toolkit/xre/nsAppRunner.cpp:3697
#27 0x0000000000401cb7 in main (argc=4, argv=0x7fffffffdd18)
    at /home/njn/moz/mc1/browser/app/nsBrowserApp.cpp:158
Comment 52 Andrew Sutherland [:asuth] 2011-06-06 23:48:25 PDT
Dang.  I was hoping it would be an obvious C++ misuse.  That is a JS caller with a presumably short-lived connection (since it's getting GC'ed and I presume you're not in xpcom-shutdown based on the rest of the stack.)

You will want to look at the fields on the Connection to see if it names names.  Also fast may just be to use the NSPR logging support of mozStorage to watch the SQL statements go by, although XPConnect's deferred release mechanism may blur things depending on how frequently GC's are happening.

In regards to the blog post about assertions/what not and my comment above, I should clarify that the specific decision to make Close capable of failing was to return an error so that when a caller called Close (as they must per the documentation), it would fail, they would look at the documentation again, and see the note that they have to be using AsyncClose.  This was determined better than all the possible race-conditions that would otherwise arise when people screwed up.  The memory reporters, as one might expect, came afterwards and the new correctness constraint obviously did not propagate sufficiently.

I'm assuming the current failure is in-tree logic that has unit tests but our lack of sufficiently loud complaining in the destructor case never made it obvious that the JS storage-user was not obeying the required contract.  (An interesting counterpoint is that in this most specific case, depending on GC to clean up the connection probably turns out fine for correctness if we didn't bail on the close, except for the bit where it's not a great way to make sure expensive resources are cleaned up in a timely fashion, plus the potential for a few quirky side-effects with other connections to the same database.)
Comment 53 Marcia Knous [:marcia - use ni] 2011-06-07 10:24:12 PDT
Adding the Linux specific sig so it gets picked up in crash stats.
Comment 54 Nicholas Nethercote [:njn] 2011-06-07 18:00:44 PDT
AFAICT the real underlying problem is that some JS code that uses an async DB connection fails to call asyncClose() on it before it becomes garbage.  (But I don't really have any idea about all this storage stuff, that's just a guess.)
If I'm right, how can we track down that defect?


(In reply to comment #52)
> Dang.  I was hoping it would be an obvious C++ misuse.  That is a JS caller
> with a presumably short-lived connection (since it's getting GC'ed and I
> presume you're not in xpcom-shutdown based on the rest of the stack.)

It's not in shutdown, the crash occurs when switching back and forth between about:memory and about:permissions while holding down F5 to trigger frequent reloads (i.e. the steps to reproduce from comment 45).


> You will want to look at the fields on the Connection to see if it names
> names. 

I just triggered the crash three times in a row;  each time Connection::getFilename() is "places.sqlite".

How/where are these Connections created from JS code?


> Also fast may just be to use the NSPR logging support of mozStorage
> to watch the SQL statements go by, although XPConnect's deferred release
> mechanism may blur things depending on how frequently GC's are happening.

I used NSPR_LOG_MODULES=mozStorage to do this, the log file is attached.  I have no idea how to interpret the output.


> In regards to the blog post about assertions/what not and my comment above,
> I should clarify that the specific decision to make Close capable of failing
> was to return an error so that when a caller called Close (as they must per
> the documentation), it would fail, they would look at the documentation
> again, and see the note that they have to be using AsyncClose.

By "the documentation" I take it you mean this comment on mozIStorageConnection::close():  "It is illegal to use call this method if any asynchronous statements have been executed on this connection."  Is that right?

That seems odd, because ~Connection() calls Close(), so it's impossible to avoid calling Close().  Or is the idea that the JS code should have called AsyncClose() on the Connection before it became garbage, in which case internalClose() sets Connection::mDBConn to NULL and so the subsequent call to Close() from Connection() will return NS_ERROR_NOT_INITIALIZED immediately upon entry?  That sequence of events isn't obvious.


> This was
> determined better than all the possible race-conditions that would otherwise
> arise when people screwed up.  The memory reporters, as one might expect,
> came afterwards and the new correctness constraint obviously did not
> propagate sufficiently.

AFAICT the memory reporters aren't doing anything wrong -- The memory reporters just exposed the underlying defect.  And that defect is that Close() is being called on an async connection, instead of AsyncClose(), right?  And that's happening because asyncClose() isn't called before the 
Connection becomes garbage on the JS side.  


> I'm assuming the current failure is in-tree logic that has unit tests but
> our lack of sufficiently loud complaining in the destructor case never made
> it obvious that the JS storage-user was not obeying the required contract. 

Am I right that the contract is "JS code using an async DB connection must call AsyncClose() on that connection before it becomes garbage?"

Louder complaints would be good here, it seems.


> (An interesting counterpoint is that in this most specific case, depending
> on GC to clean up the connection probably turns out fine for correctness if
> we didn't bail on the close, except for the bit where it's not a great way
> to make sure expensive resources are cleaned up in a timely fashion, plus
> the potential for a few quirky side-effects with other connections to the
> same database.)

What are those side-effects?  Given that the current interface has been shown to be error-prone, I wonder if a more forgiving interface that auto-closes a connection in the appropriate fashion (sync or async) would be better?  After all, sync connections that aren't closed will be GC'd ok, so why shouldn't async connections be the same?
Comment 55 Nicholas Nethercote [:njn] 2011-06-07 18:09:58 PDT
Created attachment 537928 [details]
logging

Here's the logging output which I forgot to attach to the previous comment.
Comment 56 Nicholas Nethercote [:njn] 2011-06-07 18:13:00 PDT
> Given that the current interface has been
> shown to be error-prone, I wonder if a more forgiving interface that
> auto-closes a connection in the appropriate fashion (sync or async) would be
> better?  After all, sync connections that aren't closed will be GC'd ok, so
> why shouldn't async connections be the same?

Hmm, I think I see why.  If we call ~Connection() on an async connection that hasn't been AsyncClose'd, we're in trouble -- the Connection is about to be deallocated, and it's too late to call AsyncClose() because that event gets dispatched and will occur an unknown time in the future.  In contrast, a sync connection can be immediately closed.

So a more obvious failure mode would be nice, along with detection of the JS code that's failing to call asyncClose().
Comment 57 Nicholas Nethercote [:njn] 2011-06-07 19:18:59 PDT
\
Comment 58 Nicholas Nethercote [:njn] 2011-06-07 19:26:35 PDT
Created attachment 537938 [details] [diff] [review]
patch

mak found that about:permissions opens an async DB connection to places.sqlite and never asyncClose()s it.  This patch adds an asyncClose();  I pretty much guessed where it was needed.

With this patch, I cannot reproduce the crash using the steps from comment 45.  
This patch probably also fixes the crashes mentioned in comments 27, 28, 41, 42 and 45, all of which mentioned about:permissions.  The crashes in comment 44 and 47 are unclear, but note that if you visit about:permissions even once, this crash can happen subsequently -- I can reproduce it by visiting about:permissions, closing that tab, and then switching between two about:memory tabs quickly while holding down F5.

So, I think this bug should just be about this about:permissions problem, and if any subsequent crashes like this show up, they should go in new bug reports.  It wouldn't surprise me if there were more crashes, I saw urlclassifier.sqlite mentioned a few times in early investigation, and this does appear to be an error-prone API.

(Jono, you said in comment 5 you saw this months ago, before about:permissions was implemented, so it might be worth looking for async DB connections in your Test Pilot code that aren't asyncClose()d.)
Comment 59 Andrew Sutherland [:asuth] 2011-06-07 19:53:33 PDT
(In reply to comment #54)
> AFAICT the real underlying problem is that some JS code that uses an async
> DB connection fails to call asyncClose() on it before it becomes garbage. 

Yes.  Which is not to say that it's reasonable for storage to be causing a crash or endangering the heap when this sequence of events occurs; storage shouldn't do that, it's a bug.  Hopefully mak or someone on the places team can provide a fix for storage.  (I am only hanging around storage as a consultant type to help spin up other people so I can stop hanging around at all.)

> By "the documentation" I take it you mean this comment on
> mozIStorageConnection::close():  "It is illegal to use call this method if
> any asynchronous statements have been executed on this connection."  Is that
> right?

Right, that is also in the MDC docs for mozIStorageConnection and a similar comment also exists in the overview documentation: https://developer.mozilla.org/en/storage#Closing_a_connection
"To close a connection on which only synchronous transactions were performed, use the mozIStorageConnection.close() method. If you performed any asynchronous transactions, you should instead use the mozIStorageConnection.asyncClose() method."

 
> That seems odd, because ~Connection() calls Close(), so it's impossible to
> avoid calling Close().  Or is the idea that the JS code should have called
> AsyncClose() on the Connection before it became garbage, in which case
> internalClose() sets Connection::mDBConn to NULL and so the subsequent call
> to Close() from Connection() will return NS_ERROR_NOT_INITIALIZED
> immediately upon entry?  That sequence of events isn't obvious.

For legacy reasons I believe storage was trying to be friendly to C++ code that relied on reference counting rather than explicit closes.

Yes, storage gained a bug when the memory reporter code got added but the destructor kept using the public API that is meant to provide feedback.
 

> What are those side-effects?  Given that the current interface has been
> shown to be error-prone, I wonder if a more forgiving interface that
> auto-closes a connection in the appropriate fashion (sync or async) would be
> better?  After all, sync connections that aren't closed will be GC'd ok, so
> why shouldn't async connections be the same?

Connections manually maintaining transactions can hang around with active locks which could starve out other connections or cause extreme growth of the write-ahead-log.

Ideally we could generate a new type of error console spew called "illegal usage" that extension and mozilla developers could check during development that would cause unit test failures and allow addons.mozilla.org reviewers to fail the add-on.
Comment 60 Nicholas Nethercote [:njn] 2011-06-07 19:58:40 PDT
> this does appear to be an error-prone API.

I did a very crude API check -- I searched MXR for JS files containing calls to executeAsync() that lack calls to asyncClose().  I got 7 matches.  Excluding aboutPermissions.js, they are:

- http://mxr.mozilla.org/mozilla-central/source/services/sync/modules/async.js

- http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/PlacesDBUtils.jsm

- http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/PlacesUtils.jsm

- http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/nsPlacesExpiration.js

- http://mxr.mozilla.org/mozilla-central/source/toolkit/components/contentprefs/nsContentPrefService.js

- http://mxr.mozilla.org/mozilla-central/source/browser/app/profile/extensions/testpilot@labs.mozilla.com/modules/experiment_data_store.js

I looked through all of these, they all looked suspect.  I'd appreciate if anyone who knows about these files could take a look.
Comment 61 Marco Bonardo [::mak] 2011-06-08 02:27:36 PDT
All the Sync and Places code there is fine, since they use the main Places connection that is AsyncClose()d by Places itself. Sync was evaluating moving to a read-only clone, but we can keep an eye on that for future.

nsContentPrefService seems wrong, it does indeed run async statements and then invokes close().

I can't find where Test Pilot closes the connection, indeed looks like it never does.
Comment 62 :Ehsan Akhgari 2011-06-08 09:13:52 PDT
(In reply to comment #61)
> All the Sync and Places code there is fine, since they use the main Places
> connection that is AsyncClose()d by Places itself. Sync was evaluating moving
> to a read-only clone, but we can keep an eye on that for future.
> 
> nsContentPrefService seems wrong, it does indeed run async statements and then
> invokes close().
> 
> I can't find where Test Pilot closes the connection, indeed looks like it never
> does.

This might explain why I've been seeing this crash without ever opening about:permissions...
Comment 63 :Margaret Leibovic 2011-06-08 09:17:45 PDT
Comment on attachment 537938 [details] [diff] [review]
patch

I'm passing this review off to mak, since he knows more about the consequences of this kind of change.

This is the right place to close the connection, but gPlacesDatabase is a clone, so I want to make sure that still closes the connection correctly.
Comment 64 Marco Bonardo [::mak] 2011-06-08 12:31:58 PDT
Comment on attachment 537938 [details] [diff] [review]
patch

Review of attachment 537938 [details] [diff] [review]:
-----------------------------------------------------------------

it's ok, clones are just connections, and they have to be closed, doing that at unload seems fine as well.
Comment 65 Nicholas Nethercote [:njn] 2011-06-08 17:57:36 PDT
*** Bug 660213 has been marked as a duplicate of this bug. ***
Comment 66 Nicholas Nethercote [:njn] 2011-06-08 23:36:28 PDT
http://hg.mozilla.org/mozilla-central/rev/60b4d8afcfa9
Comment 67 Nicholas Nethercote [:njn] 2011-06-09 15:50:02 PDT
Comment on attachment 537938 [details] [diff] [review]
patch

Requesting approval-mozilla-aurora.  This patch fixes a major bug in about:permissions -- if you view about:permissions and then close or reload it, you're likely to crash (a) you visit about:memory, or (b) telemetry code measures memory usage, as it does periodically, or (c) Test Pilot measures memory usage.

The risk is low, it's a one-line change that asyncCloses a connection to the places DB when the page is unloaded.
Comment 68 :Ehsan Akhgari 2011-06-09 16:08:28 PDT
(In reply to comment #61)
> nsContentPrefService seems wrong, it does indeed run async statements and
> then invokes close().
> 
> I can't find where Test Pilot closes the connection, indeed looks like it
> never does.

These are still not fixed...  We should either fix them here or file followups.
Comment 69 Ed Morley [:emorley] 2011-06-09 16:13:43 PDT
> nsContentPrefService seems wrong, it does indeed run async statements and
> then invokes close().
Bug 662986

> I can't find where Test Pilot closes the connection, indeed looks like it
> never does.
Bug 662985

And to avoid crashing in the first place if the DB connection isn't asyncClose()d:
Bug 662989

:-)
Comment 70 Shawn Wilsher :sdwilsh 2011-06-11 15:14:26 PDT
Sorry folks.  I should have caught this in review in the first place.  The fact that I didn't catch it is telling that this API is a huge footgun in its current form (me being the owner of Storage code and all...).
Comment 71 Nicholas Nethercote [:njn] 2011-06-21 23:48:57 PDT
http://hg.mozilla.org/releases/mozilla-aurora/rev/c0b591190f7a
Comment 72 Sheila Mooney 2011-08-02 10:23:02 PDT
I am still seeing this crash in the latest builds. I am going to reopen this so we can investigate more.
Comment 73 :Ehsan Akhgari 2011-08-02 11:59:25 PDT
(In reply to comment #72)
> I am still seeing this crash in the latest builds. I am going to reopen this so
> we can investigate more.

Sheila, please see comment 69.
Comment 74 Nicholas Nethercote [:njn] 2011-08-02 12:56:39 PDT
Yeah, I'll close this again.  This is about this crash after viewing about:permissions.

Sheila, can you open a new bug if one of the ones in comment 69 don't cover what you've seen?
Comment 75 Sheila Mooney 2011-08-02 14:09:31 PDT
Ok, sorry guys. I will double check the other 2 bugs cover the additional crashes and log something new if appropriate.
Comment 76 Robert Kaiser 2011-08-03 05:41:46 PDT
What Sheila has seen is that https://crash-stats.mozilla.com/report/list?signature=mozilla%3A%3Astorage%3A%3AConnection%3A%3AgetFilename()&version=Firefox%3A7.0a2 still lists this crash in more current builds than the comment #71 checkin would point to.
Interestingly, I didn't see it in the last two days or so on https://crash-stats.mozilla.com/report/list?signature=mozilla%3A%3Astorage%3A%3AConnection%3A%3AgetFilename%28%29&version=Firefox%3A8.0a1 but that doesn't necessarily mean too much with the low rates there.

In any case, there's a crash left and we need a bug to track it. You know better which one that is, but please care to have the crash signature field for that one filled appropriately.
Comment 77 Nicholas Nethercote [:njn] 2011-08-03 17:26:56 PDT
Bug 662989 is open to prevent these crashes, is that enough?  I added the crash signature to it.
Comment 78 Robert Kaiser 2011-08-04 15:43:03 PDT
(In reply to comment #77)
> Bug 662989 is open to prevent these crashes, is that enough?  I added the
> crash signature to it.

Yes, thanks, we just need some open bug to be connected to the crash signature(s) so we know where it's being tracked.

Note You need to log in before you can comment on or make changes to this bug.