Last Comment Bug 802239 - mozStorage leaks one thread per connection until shutdown
: mozStorage leaks one thread per connection until shutdown
Status: RESOLVED FIXED
[MemShrink:P2], [qa-]
:
Product: Toolkit
Classification: Components
Component: Storage (show other bugs)
: Trunk
: All All
: -- major (vote)
: mozilla19
Assigned To: Kyle Huey [:khuey] (khuey@mozilla.com) (Away until 6/13)
:
Mentors:
Depends on:
Blocks: 496019 798849
  Show dependency treegraph
 
Reported: 2012-10-16 10:29 PDT by Boris Zbarsky [:bz]
Modified: 2012-12-18 05:19 PST (History)
23 users (show)
ryanvm: in‑testsuite-
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
affected
-
affected
-
fixed
fixed
-
wontfix
-
wontfix


Attachments
Patch (4.00 KB, patch)
2012-10-16 11:19 PDT, Kyle Huey [:khuey] (khuey@mozilla.com) (Away until 6/13)
bugmail: review+
Details | Diff | Review
Patch (4.53 KB, patch)
2012-10-18 11:03 PDT, Kyle Huey [:khuey] (khuey@mozilla.com) (Away until 6/13)
bugmail: review+
bajaj.bhavana: approval‑mozilla‑aurora+
Details | Diff | Review

Description Boris Zbarsky [:bz] 2012-10-16 10:29:45 PDT
See discussion in bug 798849.  To quote that bug:

  That's insane.
Comment 1 Ben Turner (not reading bugmail, use the needinfo flag!) 2012-10-16 10:51:26 PDT
Looks like this regressed in bug 496019, see comment 71.
Comment 2 Kyle Huey [:khuey] (khuey@mozilla.com) (Away until 6/13) 2012-10-16 11:19:00 PDT
Created attachment 671942 [details] [diff] [review]
Patch

This is mostly the interdiff between v4.4 and v4.5 in Bug 496019, with one major exception.  This does not handle the case where someone forgets to call AsyncClose before the connection is destroyed.  That already leaks stuff (Bug 704030) and it seems like we should deal with both problems at the same time.  It does add a fatal assertion that we don't do that, which should catch any mistakes in the tree.
Comment 3 Andrew Sutherland [:asuth] 2012-10-16 11:43:48 PDT
Comment on attachment 671942 [details] [diff] [review]
Patch

It's sad that the wrong patch landed way back when.

Thanks for the updated patch, including the improve nsCOMPtr use!

If you/anyone reading this is going to handle bug 704030 too, I think mak would be the best reviewer for that.
Comment 4 Karl Tomlinson (ni?:karlt) 2012-10-16 16:20:18 PDT
Sounds like we should consider back porting to other releases for our poor 32-bit users, unless there's some reason why there won't be many connections in the life of the app?
Comment 5 Nicholas Nethercote [:njn] 2012-10-16 17:11:05 PDT
Additional clarification:  khuey said this wasn't a major problem because in the browser these connections are mostly for-the-lifetime-of-the-process anyway, and so fixing the leak won't make much difference.

However, it may be an issue for some add-ons, and also in tests that open and close many connections.
Comment 6 Kyle Huey [:khuey] (khuey@mozilla.com) (Away until 6/13) 2012-10-17 11:57:21 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/df42fa3fb07f
Comment 7 Alex Keybl [:akeybl] 2012-10-17 16:04:08 PDT
This is a longstanding issue with no known critical user impact. No need to track for upcoming releases. If there's still *significant* value in uplifting, please nominate for aurora/beta approval.
Comment 8 Ryan VanderMeulen [:RyanVM] 2012-10-17 19:23:02 PDT
https://hg.mozilla.org/mozilla-central/rev/df42fa3fb07f
Comment 9 Ryan VanderMeulen [:RyanVM] 2012-10-17 19:43:57 PDT
So this got backed out on inbound without any comments as to why...
https://hg.mozilla.org/integration/mozilla-inbound/rev/cabf5819a565

I transplanted the backout to m-c rather than waiting for the next merge.
https://hg.mozilla.org/mozilla-central/rev/d9520848b410
Comment 10 Kyle Huey [:khuey] (khuey@mozilla.com) (Away until 6/13) 2012-10-17 21:29:46 PDT
It got backed out because it appeared to be causing a bunch of asserts (but not hte one I added) in xpcshell.
Comment 11 Kyle Huey [:khuey] (khuey@mozilla.com) (Away until 6/13) 2012-10-18 11:03:08 PDT
Created attachment 672858 [details] [diff] [review]
Patch

Now with less crashes.

The only change is in Connection::isAsyncClosing.  Service::Observe contains code to observe xpcom-shutdown-threads and spin the event loop why isAsyncClosing is true on any of the set of extant Connections.  Because I used mAsyncExecutionTarget.forget(), mAsyncExecutionTarget is null during the closing process now and isAsyncClosing always returns false.  Thus we don't spin the event loop enough and we trip the assertion that after xpcom-shutdown-threads all of our connections are shutdown.  This code is the only consumer of isAsyncClosing, so this should be safe.
Comment 12 Andrew Sutherland [:asuth] 2012-10-18 11:47:42 PDT
Comment on attachment 672858 [details] [diff] [review]
Patch

Looks good.  Although mAsyncExecutionThreadShuttingDown never transitions to false, mDBConn will become false and this will be eventually (it's not done in a lock, but a lock is not required for correctness) result in isAsyncClosing returning false and the loop spinning terminating.  The old 4.5 rev patch would have terminated prematurely depending on the race.
Comment 13 Andrew Sutherland [:asuth] 2012-10-18 11:50:17 PDT
Er, and the internalClose happens on the main thread anyways, so strike the bit about "eventually".  Man, it's been a while.

Thanks for all y'all making the platform a better place!
Comment 14 Kyle Huey [:khuey] (khuey@mozilla.com) (Away until 6/13) 2012-10-18 11:51:48 PDT
Thanks for dredging your memory here.

https://hg.mozilla.org/integration/mozilla-inbound/rev/30a31da7097f
Comment 15 Ryan VanderMeulen [:RyanVM] 2012-10-18 18:38:48 PDT
https://hg.mozilla.org/mozilla-central/rev/30a31da7097f
Comment 16 Kyle Huey [:khuey] (khuey@mozilla.com) (Away until 6/13) 2012-10-19 09:51:05 PDT
Comment on attachment 672858 [details] [diff] [review]
Patch

[Approval Request Comment]
Bug caused by (feature/regressing bug #): A long time ago
User impact if declined: Moth test runs will go orange for silly reasons more often, lowering our test coverage.
Testing completed (on m-c, etc.): On m-c
Risk to taking this patch (and alternatives if risky): This is moderately risky.  We thing we've figured this out correctly but this is complex multithreaded code.  We haven't seen any problems on m-c since the second landing.
String or UUID changes made by this patch: None.
Comment 17 Alex Keybl [:akeybl] 2012-10-19 16:01:08 PDT
Comment on attachment 672858 [details] [diff] [review]
Patch

[Triage Comment]
The risk/reward here doesn't play out from the user perspective. Apologies.
Comment 18 Ed Morley [:emorley] 2012-10-19 16:05:34 PDT
(In reply to Alex Keybl [:akeybl] from comment #17)
> Comment on attachment 672858 [details] [diff] [review]
> Patch
> 
> [Triage Comment]
> The risk/reward here doesn't play out from the user perspective. Apologies.

Fair enough, but Linux32 mochitest-browser-chrome now hidden on TBPL for aurora.
Comment 19 Ryan VanderMeulen [:RyanVM] 2012-10-20 07:36:00 PDT
(In reply to Ed Morley (PTO/travelling until 24th Oct) [:edmorley UTC+1] from comment #18)
> Fair enough, but Linux32 mochitest-browser-chrome now hidden on TBPL for
> aurora.

Alex, please reconsider this nomination. I really don't think we want no mozilla18 browser-chrome test coverage on Linux and this patch is the only way to get it working again.
Comment 20 Kyle Huey [:khuey] (khuey@mozilla.com) (Away until 6/13) 2012-10-21 16:07:36 PDT
Comment on attachment 672858 [details] [diff] [review]
Patch

Renominating for comments 18/19.
Comment 21 Lukas Blakk [:lsblakk] use ?needinfo 2012-10-22 16:15:30 PDT
Comment on attachment 672858 [details] [diff] [review]
Patch

We don't want to end up with no mochitest coverage, is there user-facing risk here if there are regressions from this patch?  That's not clear from the comments so far, and if the risk is just more wonkiness in the tests that we will a) catch quickly and b) fix with a simple backout of this patch, we could reconsider this for uplift.
Comment 22 Karl Tomlinson (ni?:karlt) 2012-10-22 16:20:44 PDT
The line of discussion here seems to imply that tests are doing something that our users are not doing.  If that is the case, then the alternative may be to disable those particular tests.

There is also risk in that approach that some testing of features that users do use may be disabled.
Comment 23 Masatoshi Kimura [:emk] 2012-10-22 23:14:07 PDT
Disabling particular tests will not workaround the problem. It will only move the failure to other tests (see bug 798849 comment #0). The only way would be hiding the entire browser-chrome on Linux opt if the patch can not be accepted.
Comment 24 Karl Tomlinson (ni?:karlt) 2012-10-23 00:07:52 PDT
From comment 5, the difference between the browser and the tests is that some tests are starting new mozStorage connections, while the browser (re-)uses lifetime connections.  Only the tests that use new mozStorage connections would need to be disabled.  I'm assuming not every test in browser-chrome uses new mozStorage connections.
Comment 25 Lukas Blakk [:lsblakk] use ?needinfo 2012-10-24 16:20:56 PDT
Kyle: In your risk assessment you say: "This is moderately risky.  We thing we've figured this out correctly but this is complex multithreaded code.  We haven't seen any problems on m-c since the second landing."  It's been on central 5 days now, is that an enough time now to re-evaluate the risk again and know if we're doing this right?  Given comment 17 and comment 18 we want to be able to approve this for aurora but we are concerned about any potential regressions that could go undetected.  What is a decent amount of time (and usage) to give us more confidence in uplifting here?
Comment 26 Kyle Huey [:khuey] (khuey@mozilla.com) (Away until 6/13) 2012-10-24 16:39:48 PDT
I think we can be more confident now that the fix is correct.  We have not seen any confirmed or suspected regressions from this.
Comment 27 Alex Keybl [:akeybl] 2012-10-25 15:07:57 PDT
(In reply to Ryan VanderMeulen from comment #19)
> (In reply to Ed Morley (PTO/travelling until 24th Oct) [:edmorley UTC+1]
> from comment #18)
> > Fair enough, but Linux32 mochitest-browser-chrome now hidden on TBPL for
> > aurora.
> 
> Alex, please reconsider this nomination. I really don't think we want no
> mozilla18 browser-chrome test coverage on Linux and this patch is the only
> way to get it working again.

I didn't suggest we hide the tests, just that we leave this unfixed. I don't see this bug suggesting we're permaorange "Moth test runs will go orange for silly reasons more often".
Comment 28 Kyle Huey [:khuey] (khuey@mozilla.com) (Away until 6/13) 2012-10-25 15:11:44 PDT
This bug causes us to run out of virtual address space and crash, which turns the test run orange.
Comment 29 Phil Ringnalda (:philor) 2012-10-25 15:35:43 PDT
To put it absolutely as clearly as possible:

The "bc" job, the browser-chrome tests, are already hidden on the "Linux pgo" row of tbpl on https://tbpl.mozilla.org/?tree=Mozilla-Aurora. It's no longer Moth, browser-chrome got separated out from Moth. That means that the Linux 32-bit on-push and nightly builds will continue to run browser-chrome tests, but no sheriff will look at them, for the lifespan of Gecko 18. edmorley spoke for us in comment 18 - we're done with looking at OOM-destroyed runs.

It's still possible to see them in https://tbpl.mozilla.org/?tree=Mozilla-Aurora&noignore=1, though they hardly ever run to completion, but we will not be looking at them and we will not be sorting out the dozens of different expected OOM failures resulting from this bug from any other failures, so if (as is not uncommon) someone breaks a test on 32-bit Linux, we will not know, and we will not back them out. It has run green 4 times since Tuesday morning, and orange 27 times. I didn't look, since they are not now our responsibility, but based on how it went on the trunk, I would expect that more than 20 of those oranges were not just "orange for silly reasons," they were instead "orange from OOM, causing the test run to halt and fail, so there is no way that even someone who knows all of the OOM failures can know whether or not something bad happened to another test."
Comment 30 Boris Zbarsky [:bz] 2012-10-25 17:37:38 PDT
Alex, the bug covering the permaorange is bug 798849, which is blocked on this one.  The patch for this bug fixes that one.
Comment 31 Phil Ringnalda (:philor) 2012-10-26 22:59:09 PDT
https://hg.mozilla.org/releases/mozilla-aurora/rev/fb18c6ecbb99
Comment 32 Ed Morley [:emorley] 2012-12-18 05:19:36 PST
Note that:
status-firefox-esr17: affected → wontfix

Means I've just hidden Linux32 opt/pgo mochitest-browser-chrome on ESR17.

Note You need to log in before you can comment on or make changes to this bug.