See discussion in bug 798849. To quote that bug: That's insane.
5 years ago
5 years ago
Created attachment 671942 [details] [diff] [review] Patch This is mostly the interdiff between v4.4 and v4.5 in Bug 496019, with one major exception. This does not handle the case where someone forgets to call AsyncClose before the connection is destroyed. That already leaks stuff (Bug 704030) and it seems like we should deal with both problems at the same time. It does add a fatal assertion that we don't do that, which should catch any mistakes in the tree.
Comment on attachment 671942 [details] [diff] [review] Patch It's sad that the wrong patch landed way back when. Thanks for the updated patch, including the improve nsCOMPtr use! If you/anyone reading this is going to handle bug 704030 too, I think mak would be the best reviewer for that.
Sounds like we should consider back porting to other releases for our poor 32-bit users, unless there's some reason why there won't be many connections in the life of the app?
Additional clarification: khuey said this wasn't a major problem because in the browser these connections are mostly for-the-lifetime-of-the-process anyway, and so fixing the leak won't make much difference. However, it may be an issue for some add-ons, and also in tests that open and close many connections.
This is a longstanding issue with no known critical user impact. No need to track for upcoming releases. If there's still *significant* value in uplifting, please nominate for aurora/beta approval.
So this got backed out on inbound without any comments as to why... https://hg.mozilla.org/integration/mozilla-inbound/rev/cabf5819a565 I transplanted the backout to m-c rather than waiting for the next merge. https://hg.mozilla.org/mozilla-central/rev/d9520848b410
It got backed out because it appeared to be causing a bunch of asserts (but not hte one I added) in xpcshell.
Created attachment 672858 [details] [diff] [review] Patch Now with less crashes. The only change is in Connection::isAsyncClosing. Service::Observe contains code to observe xpcom-shutdown-threads and spin the event loop why isAsyncClosing is true on any of the set of extant Connections. Because I used mAsyncExecutionTarget.forget(), mAsyncExecutionTarget is null during the closing process now and isAsyncClosing always returns false. Thus we don't spin the event loop enough and we trip the assertion that after xpcom-shutdown-threads all of our connections are shutdown. This code is the only consumer of isAsyncClosing, so this should be safe.
Comment on attachment 672858 [details] [diff] [review] Patch Looks good. Although mAsyncExecutionThreadShuttingDown never transitions to false, mDBConn will become false and this will be eventually (it's not done in a lock, but a lock is not required for correctness) result in isAsyncClosing returning false and the loop spinning terminating. The old 4.5 rev patch would have terminated prematurely depending on the race.
Er, and the internalClose happens on the main thread anyways, so strike the bit about "eventually". Man, it's been a while. Thanks for all y'all making the platform a better place!
Thanks for dredging your memory here. https://hg.mozilla.org/integration/mozilla-inbound/rev/30a31da7097f
Comment on attachment 672858 [details] [diff] [review] Patch [Approval Request Comment] Bug caused by (feature/regressing bug #): A long time ago User impact if declined: Moth test runs will go orange for silly reasons more often, lowering our test coverage. Testing completed (on m-c, etc.): On m-c Risk to taking this patch (and alternatives if risky): This is moderately risky. We thing we've figured this out correctly but this is complex multithreaded code. We haven't seen any problems on m-c since the second landing. String or UUID changes made by this patch: None.
Comment on attachment 672858 [details] [diff] [review] Patch [Triage Comment] The risk/reward here doesn't play out from the user perspective. Apologies.
(In reply to Alex Keybl [:akeybl] from comment #17) > Comment on attachment 672858 [details] [diff] [review] > Patch > > [Triage Comment] > The risk/reward here doesn't play out from the user perspective. Apologies. Fair enough, but Linux32 mochitest-browser-chrome now hidden on TBPL for aurora.
(In reply to Ed Morley (PTO/travelling until 24th Oct) [:edmorley UTC+1] from comment #18) > Fair enough, but Linux32 mochitest-browser-chrome now hidden on TBPL for > aurora. Alex, please reconsider this nomination. I really don't think we want no mozilla18 browser-chrome test coverage on Linux and this patch is the only way to get it working again.
Comment on attachment 672858 [details] [diff] [review] Patch Renominating for comments 18/19.
Comment on attachment 672858 [details] [diff] [review] Patch We don't want to end up with no mochitest coverage, is there user-facing risk here if there are regressions from this patch? That's not clear from the comments so far, and if the risk is just more wonkiness in the tests that we will a) catch quickly and b) fix with a simple backout of this patch, we could reconsider this for uplift.
The line of discussion here seems to imply that tests are doing something that our users are not doing. If that is the case, then the alternative may be to disable those particular tests. There is also risk in that approach that some testing of features that users do use may be disabled.
Disabling particular tests will not workaround the problem. It will only move the failure to other tests (see bug 798849 comment #0). The only way would be hiding the entire browser-chrome on Linux opt if the patch can not be accepted.
From comment 5, the difference between the browser and the tests is that some tests are starting new mozStorage connections, while the browser (re-)uses lifetime connections. Only the tests that use new mozStorage connections would need to be disabled. I'm assuming not every test in browser-chrome uses new mozStorage connections.
Kyle: In your risk assessment you say: "This is moderately risky. We thing we've figured this out correctly but this is complex multithreaded code. We haven't seen any problems on m-c since the second landing." It's been on central 5 days now, is that an enough time now to re-evaluate the risk again and know if we're doing this right? Given comment 17 and comment 18 we want to be able to approve this for aurora but we are concerned about any potential regressions that could go undetected. What is a decent amount of time (and usage) to give us more confidence in uplifting here?
I think we can be more confident now that the fix is correct. We have not seen any confirmed or suspected regressions from this.
(In reply to Ryan VanderMeulen from comment #19) > (In reply to Ed Morley (PTO/travelling until 24th Oct) [:edmorley UTC+1] > from comment #18) > > Fair enough, but Linux32 mochitest-browser-chrome now hidden on TBPL for > > aurora. > > Alex, please reconsider this nomination. I really don't think we want no > mozilla18 browser-chrome test coverage on Linux and this patch is the only > way to get it working again. I didn't suggest we hide the tests, just that we leave this unfixed. I don't see this bug suggesting we're permaorange "Moth test runs will go orange for silly reasons more often".
This bug causes us to run out of virtual address space and crash, which turns the test run orange.
To put it absolutely as clearly as possible: The "bc" job, the browser-chrome tests, are already hidden on the "Linux pgo" row of tbpl on https://tbpl.mozilla.org/?tree=Mozilla-Aurora. It's no longer Moth, browser-chrome got separated out from Moth. That means that the Linux 32-bit on-push and nightly builds will continue to run browser-chrome tests, but no sheriff will look at them, for the lifespan of Gecko 18. edmorley spoke for us in comment 18 - we're done with looking at OOM-destroyed runs. It's still possible to see them in https://tbpl.mozilla.org/?tree=Mozilla-Aurora&noignore=1, though they hardly ever run to completion, but we will not be looking at them and we will not be sorting out the dozens of different expected OOM failures resulting from this bug from any other failures, so if (as is not uncommon) someone breaks a test on 32-bit Linux, we will not know, and we will not back them out. It has run green 4 times since Tuesday morning, and orange 27 times. I didn't look, since they are not now our responsibility, but based on how it went on the trunk, I would expect that more than 20 of those oranges were not just "orange for silly reasons," they were instead "orange from OOM, causing the test run to halt and fail, so there is no way that even someone who knows all of the OOM failures can know whether or not something bad happened to another test."
Alex, the bug covering the permaorange is bug 798849, which is blocked on this one. The patch for this bug fixes that one.
Note that: status-firefox-esr17: affected → wontfix Means I've just hidden Linux32 opt/pgo mochitest-browser-chrome on ESR17.