Open Bug 988464 Opened 6 years ago Updated 5 months ago

Flag threads that aren't shut down before nsThreadManager::Shutdown()

Categories

(Core :: XPCOM, defect)

defect
Not set

Tracking

()

ASSIGNED
mozilla35

People

(Reporter: jesup, Assigned: jesup)

References

(Depends on 2 open bugs)

Details

Attachments

(1 file, 8 obsolete files)

This is to catch nsThread leaks like the one in bug 986764.  Threads should be shut down in xpcom-shutdown-threads.

However, this may not be always worthwhile (and will cost code just for cleanup), so we could easily mark known-want-to-shutdown-at-threadmanager-shutdown threads so they don't get flagged and just ::Shutdown().
It's an API contract that client code must shut down all threads by the time xpcom-shutdown-threads finishes, so all of that code ought to exist already.
Note: some threads with ShutdownNeeded() == false have PRThread pointers that go to freed memory!
https://tbpl.mozilla.org/?tree=Try&rev=8ac1c136a95f
For some reason I wasn't getting notices on Try runs with nsThread:3 in automation.py.in, though local mochitest runs do show it.  So I added an printf() for good measure.
Attachment #8397304 - Attachment is obsolete: true
shuts down MediaThread in xpcom-shutdown (which is before xpcom-shutdown-threads....).  Perhaps we should shut it down later in xpcom-shutdown-threads.  Also it appeared there were several extant refs to MediaManager itself
Note: automation.py is no longer used in Mochitest.
Attachment #8397379 - Attachment is obsolete: true
Blocks: 988872
Blocks: 988871
Blocks: 988874
Blocks: 988877
Blocks: 988881
Blocks: 988884
Attachment #8397840 - Attachment is obsolete: true
Comment on attachment 8398015 [details] [diff] [review]
log nsThreads still active at nsThreadManager::Shutdown()

I can remove automation.py.in if we're not using it anymore (seemed to be used for a local ./mach mochitest-plain)
Attachment #8398015 - Flags: review?(benjamin)
Attachment #8397382 - Attachment is obsolete: true
I think we need to lock around the PRThread access
Attachment #8398015 - Attachment is obsolete: true
Attachment #8398015 - Flags: review?(benjamin)
Attachment #8398041 - Flags: review?(benjamin)
(In reply to Randell Jesup [:jesup] from comment #9)
> I can remove automation.py.in if we're not using it anymore (seemed to be
> used for a local ./mach mochitest-plain)

We're not using it for mochitest, it's still being used for reftest. It should not be used in a local mochitest run, I'm not sure where you're seeing that.
Comment on attachment 8398041 [details] [diff] [review]
log nsThreads still active at nsThreadManager::Shutdown()

Why is the top part #ifdef PR_LOGGING but the lower part #ifdef DEBUG?

Is this code so performance-critical that we couldn't use PR logging in all builds and use PR_LOG_TEST to guard whether we enter those logging loops?
Flags: needinfo?(rjesup)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #13)
> Comment on attachment 8398041 [details] [diff] [review]
> log nsThreads still active at nsThreadManager::Shutdown()
> 
> Why is the top part #ifdef PR_LOGGING but the lower part #ifdef DEBUG?

History of how I wrote it; originally it was leeching off nsThread's logs.  It should be PR_LOGGING I think

> Is this code so performance-critical that we couldn't use PR logging in all
> builds and use PR_LOG_TEST to guard whether we enter those logging loops?

It's not perf-critical, and only does two PR_Log() calls anyways in shutdown.  (Partly I was initially cautious about letting this code run in opt/release builds because I know half-shut-down is a tricky state.)  And there should be few or no threads here anyways.)  We can use PR_LOG_TEST, but I don't think it's important here.
Flags: needinfo?(rjesup)
Comment on attachment 8398041 [details] [diff] [review]
log nsThreads still active at nsThreadManager::Shutdown()

ok r=me with #ifdef PR_LOGGING instead of DEBUG
Attachment #8398041 - Flags: review?(benjamin) → review+
Attachment #8398041 - Attachment is obsolete: true
Comment on attachment 8457444 [details] [diff] [review]
log nsThreads still active at nsThreadManager::Shutdown()

Corry r+ forward
Attachment #8457444 - Flags: review+
sorry had to back out this changes, seems this caused xpcshell test failures like 

https://tbpl.mozilla.org/php/getParsedLog.php?id=43999486&tree=Mozilla-Inbound

TEST-UNEXPECTED-FAIL | /builds/slave/talos-slave/test/build/tests/xpcshell/tests/toolkit/modules/tests/xpcshell/test_sqlite.js | test failed (with xpcshell return code: 1), see following log:
PROCESS-CRASH | /builds/slave/talos-slave/test/build/tests/xpcshell/tests/toolkit/modules/tests/xpcshell/test_sqlite.js | application crashed [@ 0x7fff82857c00]


Operating system: Mac OS X
23:48:35     INFO -                    10.6.8 10K549
23:48:35     INFO -  CPU: amd64
23:48:35     INFO -       family 6 model 23 stepping 10
23:48:35     INFO -       2 CPUs
23:48:35     INFO -  Crash reason:  EXC_BAD_ACCESS / 0x0000000d
23:48:35     INFO -  Crash address: 0x0
23:48:35     INFO -  Thread 0 (crashed)
23:48:35     INFO -   0  0x7fff82857c00
23:48:35     INFO -      rbx = 0x0000000000000008   r12 = 0x0000000000000002
23:48:35     INFO -      r13 = 0x74537a6f6d206461   r14 = 0x74537a6f6d206461
23:48:35     INFO -      r15 = 0x00007fff829a5da4   rip = 0x00007fff82857c00
23:48:35     INFO -      rsp = 0x00007fff5fbfd228   rbp = 0x00007fff5fbfd250
23:48:35     INFO -      Found by: given as instruction pointer in context
23:48:35     INFO -  Thread 1
23:48:35     INFO -   0  0x7fff8286dc0a
23:48:35     INFO -      rbx = 0x0000000106280e50   r12 = 0x00007fff7037a998
23:48:35     INFO -      r13 = 0x00007fff7037ad68   r14 = 0xffffffffffffffff
23:48:35     INFO -      r15 = 0x00007fff7037ada8   rip = 0x00007fff8286dc0a
23:48:35     INFO -      rsp = 0x0000000106280ce8   rbp = 0x0000000106280e90
23:48:35     INFO -      Found by: given as instruction pointer in context
23:48:35     INFO -  Thread 2
23:48:35     INFO -   0  0x7fff8286ea2a
23:48:35     INFO -      rbx = 0x0000000106704000   r12 = 0x00000001061593c0
23:48:35     INFO -      r13 = 0x0000000106165000   r14 = 0x0000000106704000
23:48:35     INFO -      r15 = 0x0000000000000000   rip = 0x00007fff8286ea2a
23:48:35     INFO -      rsp = 0x0000000106703f28   rbp = 0x0000000106703f50
23:48:35     INFO -      Found by: given as instruction pointer in context
23:48:35     INFO -  Thread 3
23:48:35     INFO -   0  0x7fff8286dc0a
23:48:35     INFO -      rbx = 0x0000000106147490   r12 = 0x0000000106127000
23:48:35     INFO -      r13 = 0x0000000106126800   r14 = 0x0000000106147490
23:48:35     INFO -      r15 = 0x0000000000000001   rip = 0x00007fff8286dc0a
23:48:35     INFO -      rsp = 0x0000000107280ac8   rbp = 0x0000000107280b30
23:48:35     INFO -      Found by: given as instruction pointer in context
23:48:35     INFO -  Thread 4
23:48:35     INFO -   0  0x7fff8288fa6a
23:48:35     INFO -      rbx = 0x0000000107f81000   r12 = 0x0000000106117308
I should dig out the cause and re-land this
Target Milestone: mozilla33 → mozilla35
We need to get this landed again. We're leaking threads in async storage and we didn't notice...
Minor unbitrotting of jesup's patch.
Attachment #8457444 - Attachment is obsolete: true
That sql XPCshell test still fails.  Locally, the log contains a bunch of lines that look like this:

0:04.76 LOG: Thread-1 INFO "CONSOLE_MESSAGE: (warn) [JavaScript Error: "Warning: Sqlite connection 'gc_99.sqlite#0' was not properly closed. Auto-close triggered by garbage collection.
" {file: "resource://gre/modules/Sqlite.jsm" line: 67}]"

The OSX crash report says that there's a stack overflow, though I'm not sure in what.
FYI, this is the bug we were discussing at Whistler
Simplified version of the patch, which simply NS_ASSERTION()s on any leaked nsThread.  Note that simple tests (load Youtube, quit (linux)) show 2 leaked threads: Image Scaler, and FileBlockCache (which helpfully doesn't set the threadname; I'll file a bug on that).  We likely won't want to land this without fixing those.  Note that GDB shows that FileBlockCache::Close() isn't even being called in my test.
Attachment #8649759 - Flags: review?(nfroyd)
Attachment #8586281 - Attachment is obsolete: true
Assignee: nobody → rjesup
Status: NEW → ASSIGNED
cpearce, bholley - note comment 25 about FileBlockCache leaking
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c2a8c5cf4b35

Mostly green; several C++ unit tests and XPCShell tests don't bother to shut down some of their threads.  Interesting that in mochitests, timing works out that Image Scaler and FileBlockCache get shut down (albeit with a sample size of only 1 run), while running a local linux64 debug build they fail always when closing a window after a simple test of Youtube.  I can't say I'm surprised at some of our thread shutdowns having timing races, though - one of the bugs that poked me to push this forward again has a strongly implied timing-caused thread leak.  Wouldn't be shocked if turning this on exposed a handful of intermittents in automation.

Right now it just uses NS_ASSERTION; we could make it log in opt builds (though I'd prefer something that caused orange/red on treeherder; perhaps use some env var to react more harshly if we're in a mochitest).  MOZ_CRASH would seem overkill however.
Comment on attachment 8649759 [details] [diff] [review]
log nsThreads still active at nsThreadManager::Shutdown()

Review of attachment 8649759 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with changes below.

::: xpcom/threads/nsThread.h
@@ +173,5 @@
>      }
>    };
>  
> +  // This lock protects access to mObserver, mEvents and mEventsAreDoomed
> +  // and mShutdownRequired.  All of those fields (except mShutdownRequired)

Nit: this should now read "...mEvents, mEventsAreDoomed, and mShutdownRequired"  Feel free to rewrap the comment if necessary.

::: xpcom/threads/nsThreadManager.cpp
@@ +3,5 @@
>  /* This Source Code Form is subject to the terms of the Mozilla Public
>   * License, v. 2.0. If a copy of the MPL was not distributed with this
>   * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
>  
> +#include "prprf.h"

Can we just use mozilla/Snprintf.h and use snprintf_literal directly below?  Best to avoid introducing more NSPR-isms.
Attachment #8649759 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj][:nfroyd] from comment #28)
> Can we just use mozilla/Snprintf.h and use snprintf_literal directly below? 
> Best to avoid introducing more NSPR-isms.

Sure!  I hadn't noticed that had landed; good!
See Also: → 1203869
Blocks: 1270575
Depends on: 1271402
Depends on: 1271429
Depends on: 1271514
Depends on: 1271791
tracking-fennec: --- → ?
relnote-firefox: --- → ?
Flags: sec-bounty?
Flags: in-testsuite+
Flags: behind-pref+
Flags: a11y-review+
tracking-fennec: ? → ---
relnote-firefox: ? → ---
Flags: sec-bounty?
Flags: in-testsuite+
Flags: behind-pref+
Flags: a11y-review+

"PERF" key word?

You need to log in before you can comment on or make changes to this bug.