Closed Bug 662989 Opened 13 years ago Closed 13 years ago

Avoid crashing if a DB connection isn't asyncClose()d when it should be

Categories

(Core :: SQLite and Embedded Database Bindings, defect)

x86
All
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla8
Tracking Status
firefox7 + fixed

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

(Keywords: crash, Whiteboard: [inbound])

Crash Data

Attachments

(1 file, 4 obsolete files)

From bug 654573 comment 59:

>> AFAICT the real underlying problem is that some JS code that uses an async
>> DB connection fails to call asyncClose() on it before it becomes garbage. 
>
> Yes.  Which is not to say that it's reasonable for storage to be causing a
> crash or endangering the heap when this sequence of events occurs; storage
> shouldn't do that, it's a bug.

The problem is that if a connection on which async transactions were executed is not explicitly closed with asyncClose(), its destructor will run Close() instead, which will bail out early (because of the async-ness), so Connection::internalClose() isn't run, and the memory reporters for the connection are not unregistered.  The end result is that memory reporters are left pointing to a Connection that has been freed, which is a sure recipe for a subsequent hard-to-debug crash (eg. bug 654573, bug 660213).

To make things more robust, we just need to ensure that the memory reporters are always freed when ~Close() is run.
This patch unregisters the memory reporters for a Connection in ~Connection() if they haven't already been unregistered.

I re-ran the stress test from bug 654573 (having removed the corresponding fix to aboutPermissions.js from that bug):

- Open about:memory?verbose and about:permissions
- Hold down F5 to trigger constant reloads
- Click back and forth between the two tabs, once per second

Without this patch, this causes a crash every time within a few seconds.  With this patch, I couldn't reproduce the crash.  However, I did see the SQLite memory usage go up a lot.  In a 30 second run, I got this warning 107 times:

  WARNING: NS_ENSURE_TRUE(!(mAsyncExecutionThread)) failed: file /home/njn/moz/mi1/storage/src/mozStorageConnection.cpp, line 797

which means that 107 connections to places.sqlite were not asyncClosed as they should be.

Before doing my stress-test, the relevant parts of about:memory looked like this:

├───4,192,088 B (09.10%) -- storage
│   └──4,192,088 B (09.10%) -- sqlite
│      ├──1,110,936 B (02.41%) -- places.sqlite
│      │  ├────960,616 B (02.09%) -- cache-used
│      │  ├─────95,440 B (00.21%) -- schema-used
│      │  └─────54,880 B (00.12%) -- stmt-used
│      ├────865,448 B (01.88%) -- other

After 30 seconds, they looked like this:

├───24,052,136 B (22.35%) -- storage
│   └──24,052,136 B (22.35%) -- sqlite
│      ├──20,537,184 B (19.09%) -- other
│      ├───1,221,296 B (01.13%) -- places.sqlite
│      │   ├──1,059,664 B (00.98%) -- cache-used
│      │   ├─────95,440 B (00.09%) -- schema-used
│      │   └─────66,192 B (00.06%) -- stmt-used

(The reason the extra memory is all in the storage/sqlite/other bucket is because the memory reporters are unregistered for each destroyed Connection even though the SQLite connection is still open.)

107 connections led to a ~20,000,000 byte increase in memory usage, which is about 187,000 bytes per unclosed connection.
Attachment #538168 - Flags: review?(mak77)
This patch takes an alternative approach -- it avoids the crashes, but also
keeps the memory reporters alive, but appends "-LEAKED" to their name so
it's obvious from looking at about:memory if any connections have leaked!

For example:

├──10,088,872 B (12.59%) -- storage
│  └──10,088,872 B (12.59%) -- sqlite
│     ├───4,552,864 B (05.68%) -- places.sqlite-LEAKED
│     │   ├──3,977,792 B (04.96%) -- cache-used
│     │   ├────575,072 B (00.72%) -- schema-used
│     │   └──────────0 B (00.00%) -- stmt-used

This is better -- the leak is findable in optimized builds, even from 
reports from normal users.  You don't have to spot the warning message in 
stderr (but that's still there).

This required a few changes:

- We detect in ~Connection() if the connection has been leaked (due to a
  missing asyncClose) and set StorageMemoryReporter::mHasLeaked.  "-LEAKED"
  is then appended onto the name of the connection as given by the memory
  reporter.

- StorageMemoryReporter no longer has a reference to the Connection, but
  rather a sqlite3*.  This means that the crashes are avoided when
  Connection is destroyed, because it's communicating directly with the
  sqlite connection, rather than going through the Connection.

  (Nb: up until now we've been passing a Connection& as the first argument
  to sqlite3_db_status().  I have no idea how that even compiled, let alone
  worked.  This patch passes an sqlite3* instead.)

- I also tweaked the descriptions on the memory reporters.

Note that I'm very new to Mozilla strings and am completely uncertain if my
usage of nsCString is appropriate;  thorough checking would be appreciated.
Attachment #538168 - Attachment is obsolete: true
Attachment #538168 - Flags: review?(mak77)
Attachment #538197 - Flags: review?(mak77)
(In reply to comment #2)
>   (Nb: up until now we've been passing a Connection& as the first argument
>   to sqlite3_db_status().  I have no idea how that even compiled, let alone
>   worked.  This patch passes an sqlite3* instead.)
C++ type operator on the Connection class.
Comment on attachment 538197 [details] [diff] [review]
better patch (against 70732:96fac50b57ed)

Review of attachment 538197 [details] [diff] [review]:
-----------------------------------------------------------------

It looks good to me.

Shawn, since this is still a transition and I'm not yet officially a peer, would you mind checking if I missed something?

::: storage/src/mozStorageConnection.cpp
@@ +357,5 @@
>    }
>  
>    NS_IMETHOD GetPath(char **memoryPath)
>    {
>      nsCString path;

Why not a nsCAutoString? I'd expect its buffer to be sufficient most of the time.

@@ +363,5 @@
>      path.AppendLiteral("explicit/storage/sqlite/");
> +    path.Append(mFileName);
> +
> +    if (mHasLeaked) {
> +      path.AppendLiteral("-LEAKED");

fwiw, -LEAKED sounds scary to me (And I guess to users), looks like everything there leaked, not just a connection. What about -HAS-LEAKS?

@@ +607,5 @@
>        break;
>    }
>  
> +  nsRefPtr<StorageMemoryReporter> reporter;
> +  nsCString filename = this->getFilename();

nsCAutoString here as well

@@ +766,5 @@
>  #endif
>  
>    for (PRUint32 i = 0; i < mMemoryReporters.Length(); i++) {
>      (void)::NS_UnregisterMemoryReporter(mMemoryReporters[i]);
> +    mMemoryReporters[i] = NULL;

nsnull

::: storage/src/mozStorageConnection.h
@@ +203,5 @@
>  
>    sqlite3 *mDBConn;
>    nsCOMPtr<nsIFile> mDatabaseFile;
>  
> +  nsTArray<nsCOMPtr<StorageMemoryReporter> > mMemoryReporters;

Shouldn't this rather be a nsRefPtr since not anymore an interface?
Attachment #538197 - Flags: review?(mak77) → review+
(In reply to comment #4)
> Shawn, since this is still a transition and I'm not yet officially a peer,
> would you mind checking if I missed something?
If you need me to do reviews, just flag me still.  I should start doing them again this weekend.
(In reply to comment #4)
> >    NS_IMETHOD GetPath(char **memoryPath)
> >    {
> >      nsCString path;
> 
> Why not a nsCAutoString? I'd expect its buffer to be sufficient most of the
> time.

"Most of the time" makes me nervous.  And that code pre-dates my patch.


> @@ +363,5 @@
> >      path.AppendLiteral("explicit/storage/sqlite/");
> > +    path.Append(mFileName);
> > +
> > +    if (mHasLeaked) {
> > +      path.AppendLiteral("-LEAKED");
> 
> fwiw, -LEAKED sounds scary to me (And I guess to users), looks like
> everything there leaked, not just a connection. What about -HAS-LEAKS?

The entire connection has leaked, and this memory reporter is describing the connection, so I think it's a reasonable name.  It's supposed to sound scary :)


> > +  nsRefPtr<StorageMemoryReporter> reporter;
> > +  nsCString filename = this->getFilename();
> 
> nsCAutoString here as well

Again, I'd rather not make assumptions about the length of the filenames.


> > +    mMemoryReporters[i] = NULL;
> 
> nsnull

I changed it.


> > +  nsTArray<nsCOMPtr<StorageMemoryReporter> > mMemoryReporters;
> 
> Shouldn't this rather be a nsRefPtr since not anymore an interface?

I'm not sure.  StorageMemoryReporter inherits from nsIMemoryReporter...
Attached patch better patch, v2 (obsolete) — Splinter Review
For sdwilsh to review.
Attachment #538197 - Attachment is obsolete: true
Attachment #540380 - Flags: review?(sdwilsh)
(In reply to comment #6)
> (In reply to comment #4)
> > > +  nsTArray<nsCOMPtr<StorageMemoryReporter> > mMemoryReporters;
> > 
> > Shouldn't this rather be a nsRefPtr since not anymore an interface?
> 
> I'm not sure.  StorageMemoryReporter inherits from nsIMemoryReporter...

When the template type is a concrete class, you should always use nsRefPtr.
Comment on attachment 540380 [details] [diff] [review]
better patch, v2

Review of attachment 540380 [details] [diff] [review]:
-----------------------------------------------------------------

This code feels like it should be very easy to write a test for, so I'd like to see that before giving it r+ (and I should have a faster review turn-around time moving forward here).

::: storage/src/mozStorageConnection.cpp
@@ +363,5 @@
>      path.AppendLiteral("explicit/storage/sqlite/");
> +    path.Append(mFileName);
> +
> +    if (mHasLeaked) {
> +      path.AppendLiteral("-LEAKED");

I think I'd prefer to not do "-LEAKED" but rather " (LEAKED)".  Thoughts?

@@ +424,5 @@
>      return convertResultCode(rc);
>    }
> +
> +  // We call this when we know we've leaked a connection.
> +  void hasLeaked()

hasLeaked sounds like it is querying status.  How about setWillLeak?

@@ +819,2 @@
>      MutexAutoLock lockedScope(sharedAsyncExecutionMutex);
>      NS_ENSURE_FALSE(mAsyncExecutionThread, NS_ERROR_UNEXPECTED);

Because this error isn't very telling in it's current state, let's add a local variable to help make the debug output better:
bool asyncCloseWasCalled = !mAsyncExecutionThread;
NS_ENSURE_TRUE(asyncCloseWasCalled, NS_ERROR_UNEXPECTED);

::: storage/src/mozStorageConnection.h
@@ +203,5 @@
>  
>    sqlite3 *mDBConn;
>    nsCOMPtr<nsIFile> mDatabaseFile;
>  
> +  nsTArray<nsCOMPtr<StorageMemoryReporter> > mMemoryReporters;

s/nsCOMPtr/nsRefPtr/
Attachment #540380 - Flags: review?(sdwilsh) → review-
(In reply to comment #9)
> 
> I think I'd prefer to not do "-LEAKED" but rather " (LEAKED)".  Thoughts?

No other memory reporters have a space in them.  But I guess it doesn't matter, and we shouldn't even be seeing this in practice.  I'll update.

> hasLeaked sounds like it is querying status.  How about setWillLeak?

setWillLeak sounds funny to me.  I originally had markAsLeaked(), I'll go back to that.

Thanks for reviewing, I'll do the test and make the changes.
(In reply to comment #10)
> No other memory reporters have a space in them.  But I guess it doesn't matter,
> and we shouldn't even be seeing this in practice.  I'll update.
Yeah, but other memory reporters aren't indicating state like this either AFIAK, right?

> setWillLeak sounds funny to me.  I originally had markAsLeaked(), I'll go back
> to that.
markAsLeaked sounds better than my suggestion (:
(In reply to comment #6)
> > @@ +363,5 @@
> > >      path.AppendLiteral("explicit/storage/sqlite/");
> > > +    path.Append(mFileName);
> > > +
> > > +    if (mHasLeaked) {
> > > +      path.AppendLiteral("-LEAKED");
> > 
> > fwiw, -LEAKED sounds scary to me (And I guess to users), looks like
> > everything there leaked, not just a connection. What about -HAS-LEAKS?
> 
> The entire connection has leaked, and this memory reporter is describing the
> connection, so I think it's a reasonable name.  It's supposed to sound scary
> :)

Yes, I meant that to me sounds possible there are multiple connections, and maybe some of them leaked and some not. Btw, I don't mind :)

Regarding the autostrings, they should avoid the heap allocation if the string is shorter and not be much more expensive when longer (they'll just allocate on the heap the rare database names longer than 64 chars). So I'm not sure what is preventing not using them, apart their advantages.
Blocks: 670953
Attached patch patch, v3, with test (obsolete) — Splinter Review
Well, that wasn't much fun, but it's done.

I ended up going with "foo-LEAKED" because I couldn't bear "foo (LEAKED)" since it was a different style to every other memory reporter.  Hope you don't mind.  Hopefully it'll never be seen by anyone :)
Attachment #540380 - Attachment is obsolete: true
Attachment #546094 - Flags: review?(sdwilsh)
Comment on attachment 546094 [details] [diff] [review]
patch, v3, with test

Review of attachment 546094 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/aboutmemory/tests/chrome/test_asyncClose_leak.xul
@@ +22,5 @@
> +  const Ci = Components.interfaces;
> +  const Cu = Components.utils;
> +
> +  var db = Cc["@mozilla.org/browser/nav-history-service;1"].
> +           getService(Ci.nsPIPlacesDatabase).DBConnection.clone(true);

having a test in another component depending on Places is a bit strange esp. since one may compile without it, I suppose it may be better to create a db for this test, it should just be matter of opening a connection to a new database.

@@ +28,5 @@
> +  // A statement, the exact form doesn't matter, it just has to be
> +  // asynchronous.
> +  var stmt = db.createAsyncStatement(
> +               "SELECT SUM(visit_count) AS count " +
> +               "FROM moz_places WHERE rev_host = :rev_host");

and as a query to simplify things you may select from sqlite_master table that always exists

@@ +42,5 @@
> +    }
> +  });
> +  stmt.finalize();
> +  //db.asyncClose();
> +  stmt = null;

this commented out line looks bogus as it is, I know why it is there, but a comment would clarify that for everyone.
> I suppose it may be better to create a db
> for this test, it should just be matter of opening a connection to a new
> database.

How do I do that?
just make a nsIFile and storage.openDatabase(file) it
Attached patch patch, v4Splinter Review
This patch address mak's comments on the previous patch.
Attachment #546094 - Attachment is obsolete: true
Attachment #546094 - Flags: review?(sdwilsh)
Attachment #548063 - Flags: review?(mak77)
Try run for e1c8cbe45980 is complete.
Detailed breakdown of the results available here:
    http://tbpl.mozilla.org/?tree=Try&rev=e1c8cbe45980
Results:
    success: 128
    warnings: 16
    failure: 2
Total buildrequests: 146
Sigh, I got a try failure on the Linux64 opt build, which I cannot reproduce locally.  Presumably the GC is not running in time and so the DB isn't being collected, which means the leak doesn't occur in time.
Comment on attachment 548063 [details] [diff] [review]
patch, v4

Review of attachment 548063 [details] [diff] [review]:
-----------------------------------------------------------------

it looks fine to me, you should clearly resolve the random failures in the test before pushing, but that's obvious.

::: storage/src/mozStorageConnection.cpp
@@ +469,5 @@
> +  // with asyncClose(), the connection is about to leak and it's too late to do
> +  // anything about it.  So we mark the memory reporters accordingly so that
> +  // the leak will be obvious in about:memory.
> +  for (PRUint32 i = 0; i < mMemoryReporters.Length(); i++) {
> +    if (mMemoryReporters[i] != nsnull) {

any specific reasons you can't do if (mMemoryReporters[i]) ? I think the general good practice is to avoid nsnull or boolean comparisons

@@ +820,3 @@
>      MutexAutoLock lockedScope(sharedAsyncExecutionMutex);
> +    bool asyncCloseWasCalled = !mAsyncExecutionThread;
> +    NS_ENSURE_TRUE(asyncCloseWasCalled, NS_ERROR_UNEXPECTED);

I think there is no need of a warning here, based on recent discussion about removing useless warnings (benjamin would like to remove NS_ENSURE_ usage in new patches), I'd rather
if (mAsyncExecutionThread)
  return NS_ERROR_UNEXPECTED;
and better explain in the comment above that this may fail if asyncClosed has not been called before, rather than adding another bool assignment.

::: toolkit/components/aboutmemory/tests/chrome/test_asyncClose_leak.xul
@@ +24,5 @@
> +
> +  // Make a fake DB file.
> +  var file = Cc["@mozilla.org/file/directory_service;1"].
> +             getService(Ci.nsIProperties).
> +             get("TmpD", Ci.nsIFile);

you may create a file in ProfD I guess that should be removed at the end of the test harness run, not sure if creating in tmp is fine on all platforms, esp because that file may stick there and never be removed across test runs.

@@ +26,5 @@
> +  var file = Cc["@mozilla.org/file/directory_service;1"].
> +             getService(Ci.nsIProperties).
> +             get("TmpD", Ci.nsIFile);
> +  file.append("asyncClose_leak-fake-DB-tmp.sqlite");
> +  file.createUnique(Ci.nsIFile.NORMAL_FILE_TYPE, 0666);

and then you wouldn't need this createUnique thing (lust let sqlite create the file and harness cleanup, bonus points if the test deletes the db before finishing, that is usually a good choice)

@@ +29,5 @@
> +  file.append("asyncClose_leak-fake-DB-tmp.sqlite");
> +  file.createUnique(Ci.nsIFile.NORMAL_FILE_TYPE, 0666);
> +
> +  var service = Cc["@mozilla.org/storage/service;1"].
> +                getService(Ci.mozIStorageService);

nit: s/service/storage/ since service is so generic I have no idea which service is

@@ +44,5 @@
> +    },
> +    handleCompletion: function(aReason) {
> +    }
> +  });
> +  stmt.finalize();

nit: try {} around execution and put finalize() in a finally {} (people copies from our tests, let's give them good practices)
Attachment #548063 - Flags: review?(mak77) → review+
We're still getting occasional crashes, see bug 654573 comment 76.
Crash Signature: [@ mozilla::storage::StorageMemoryReporter::GetPath ] [@ @0x0 | mozilla::storage::StorageMemoryReporter::GetPath ] [@ mozilla::storage::Connection::getFilename() ] [@ mozilla::storage::Connection::getFilename ]
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
With combined signatures, it is #3 top browser crasher in 7.0a2.
OS: Linux → All
Hardware: x86_64 → x86
(In reply to Scoobidiver from comment #25)
> With combined signatures, it is #3 top browser crasher in 7.0a2.

Ugh.  If we want to fix it in FF7, either we land this patch, which tolerates the non-asyncClosed connections -- i.e. avoids the crash but the leak still occurs.  Alternatively, we find the connection that isn't being asyncClosed;  that's harder but would truly fix the problem.
Leaking is better than crashing.  We should fix the crash, and then we can try to find the bad connection after the fact to plug the leak.
Comment on attachment 548063 [details] [diff] [review]
patch, v4

We decided we want this for aurora rather than hunt down every case where they are mismatched. Can we make sure there is a followup for finding all the mismatched calls?
Attachment #548063 - Flags: approval-mozilla-aurora+
http://hg.mozilla.org/releases/mozilla-aurora/rev/d05ba21b434b

> Can we make sure there is a followup for finding all
> the mismatched calls?

Bug 677809.
Severity: normal → critical
Keywords: crash
How can this be verified?
Thanks.
(In reply to Vlad from comment #30)
> How can this be verified?

Run the chrome test that landed with the patch.
Based on today's TBPL, this test is passing. http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Aurora/1313761786.1313766031.6313.gz&fulltext=1

Setting resolution to Verified Fixed.
Status: RESOLVED → VERIFIED
Product: Toolkit → Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: