Closed Bug 1163021 Opened 10 years ago Closed 6 years ago

NS_GetSpecialDirectory is intermittently main-thread-only (crash)

Categories

(Core :: XPCOM, defect)

defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla71
Tracking Status
firefox71 --- fixed

People

(Reporter: jib, Assigned: Gijs)

References

(Depends on 1 open bug)

Details

Attachments

(2 files, 3 obsolete files)

TL;DR; Broken faulty-use detector in NS_GetSpecialDirectory is a landmine. A 39 android crash reported in Bug 1162720 was caused by (lack of testing and) calling NS_GetSpecialDirectory off main-thread, which crashed here: http://mxr.mozilla.org/mozilla-central/source/js/xpconnect/src/XPCWrappedJS.cpp?rev=7bc6ca149561&mark=521-521#514 NS_IMETHODIMP nsXPCWrappedJS::CallMethod(uint16_t methodIndex, const XPTMethodDescriptor* info, nsXPTCMiniVariant* params) { // Do a release-mode assert against accessing nsXPCWrappedJS off-main-thread. if (!MOZ_LIKELY(NS_IsMainThread())) => MOZ_CRASH(); > (gdb) bt > #0 0x9b9ac0e0 in nsXPCWrappedJS::CallMethod (this=0x97f27500, methodIndex=<optimized out>, info=0x97fd1908, params=0x911098c8) at /Volumes/GeckoDev/gecko/js/xpconnect/src/XPCWrappedJS.cpp:521 > #1 0x9b6828e6 in PrepareAndDispatch (self=0x97984290, methodIndex=<optimized out>, args=<optimized out>) at /Volumes/GeckoDev/gecko/xpcom/reflect/xptcall/md/unix/xptcstubs_arm.cpp:93 > #2 0x9b6820a8 in SharedStub () from /Volumes/GeckoDev/objdirs/objdir-android-opt/dist/bin/libxul.so > #3 0x9b66b102 in FindProviderFile (aElement=0x97984290, aData=0x911099c8) at /Volumes/GeckoDev/gecko/xpcom/io/nsDirectoryService.cpp:344 > #4 0x9b66da1a in nsDirectoryService::Get (this=0xaf901f40, aProp=0x9cdcabba "ProfD", aUuid=..., aResult=0x85b41e94) at /Volumes/GeckoDev/gecko/xpcom/io/nsDirectoryService.cpp:374 > #5 0x9c096216 in NS_GetSpecialDirectory (aResult=0x85b41e94, aSpecialDirName=0x9cdcabba "ProfD") at ../../../dist/include/nsDirectoryServiceUtils.h:28 The question is why did NS_GetSpecialDirectory not crash on desktop? The answer is here: http://mxr.mozilla.org/mozilla-central/source/xpcom/io/nsDirectoryService.cpp?rev=2a019a588c26&mark=365-365#353 NS_IMETHODIMP nsDirectoryService::Get(const char* aProp, const nsIID& aUuid, void** aResult) { if (NS_WARN_IF(!aProp)) { return NS_ERROR_INVALID_ARG; } nsDependentCString key(aProp); nsCOMPtr<nsIFile> cachedFile = mHashtable.Get(key); if (cachedFile) { nsCOMPtr<nsIFile> cloneFile; cachedFile->Clone(getter_AddRefs(cloneFile)); => return cloneFile->QueryInterface(aUuid, aResult); } It did not crash because by luck some unrelated code must have peeked at NS_APP_USER_PROFILE_50_DIR from main thread earlier (AND nobody cleared the cache since). Whether this is always true on desktop or only some of the time, who knows? This seems bad https://www.youtube.com/watch?v=4F4qzPbcFiA so I thought it'd be interesting to turn off the nsDirectoryService caching and see what happens. Turns out Firefox does not even start. See try for details: https://treeherder.mozilla.org/#/jobs?repo=try&revision=130f1f7e691f I think we need tests in the tree that run without caching to rid us of code that depends on cache hit.
If the intent is to just enforce that people don't do this off the main thread, seems like we can just explicitly assert or crash if not mainthread on directory service entrypoints...
The caching isn't safe either because the main thread could be modifying the hash table while another thread is reading from it. nsDirectoryService is just not threadsafe at all.
Indeed, that's more useful. A non-caching mode is probably not especially valuable, and it's likely to cause shutdown issues that aren't important in practice.
Oops where "that" means moving the main-thread assertion to the very top of the method.
Agree, if nsDirectoryService is main-thread only. I've gone to great lengths to avoid file IO on main thread, as I was told that was bad, so I'm a bit surprised to learn this API is main-only.
nsIFiles are compartment-threaded and providers are typically main-thread, so the directory service is inherently forced to be main-thread. Typically you get the path you want on the main thread and pass that to your worker threads to do the actual I/O (or use OS.File to do the abstraction).
Thanks, I'll follow that pattern.
I moved the assert up, and Firefox still fails to start. Time to file some bugs I guess: > [35088] WARNING: NS_ENSURE_SUCCESS(EnsureScriptEnvironment(), nullptr) failed with result 0x80040111: file /Users/Jan/moz/mozilla-central/docshell/base/nsDocShell.cpp, line 4585 > Hit MOZ_CRASH() at /Users/Jan/moz/mozilla-central/xpcom/io/nsDirectoryService.cpp:363 > #01: nsDirectoryService::Get(char const*, nsID const&, void**)[/Users/Jan/moz/mozilla-central/obj-x86_64-apple-darwin12.2.1-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x13906a] > #02: non-virtual thunk to nsDirectoryService::Get(char const*, nsID const&, void**)[/Users/Jan/moz/mozilla-central/obj-x86_64-apple-darwin12.2.1-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x139a0a] > #03: AddAppDirToCommandLine(std::vector<std::string, std::allocator<std::string> >&)[/Users/Jan/moz/mozilla-central/obj-x86_64-apple-darwin12.2.1-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x82bef5] > #04: mozilla::ipc::GeckoChildProcessHost::PerformAsyncLaunchInternal(std::vector<std::string, std::allocator<std::string> >&, base::ProcessArchitecture)[/Users/Jan/moz/mozilla-central/obj-x86_64-apple-darwin12.2.1-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x82b644] > #05: mozilla::ipc::GeckoChildProcessHost::PerformAsyncLaunch(std::vector<std::string, std::allocator<std::string> >, base::ProcessArchitecture)[/Users/Jan/moz/mozilla-central/obj-x86_64-apple-darwin12.2.1-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x82ad5f] > #06: mozilla::ipc::GeckoChildProcessHost::RunPerformAsyncLaunch(std::vector<std::string, std::allocator<std::string> >, base::ProcessArchitecture)[/Users/Jan/moz/mozilla-central/obj-x86_64-apple-darwin12.2.1-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x82a7ed] > #07: void DispatchToMethod<mozilla::ipc::GeckoChildProcessHost, bool (mozilla::ipc::GeckoChildProcessHost::*)(std::vector<std::string, std::allocator<std::string> >, base::ProcessArchitecture), std::vector<std::string, std::allocator<std::string> >, base::Proc[/Users/Jan/moz/mozilla-central/obj-x86_64-apple-darwin12.2.1-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x831ccd] > #08: RunnableMethod<mozilla::ipc::GeckoChildProcessHost, bool (mozilla::ipc::GeckoChildProcessHost::*)(std::vector<std::string, std::allocator<std::string> >, base::ProcessArchitecture), Tuple2<std::vector<std::string, std::allocator<std::string> >, base::Proc[/Users/Jan/moz/mozilla-central/obj-x86_64-apple-darwin12.2.1-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x831b8e] > #09: MessageLoop::RunTask(Task*)[/Users/Jan/moz/mozilla-central/obj-x86_64-apple-darwin12.2.1-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x7b3c80] > #10: MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const&)[/Users/Jan/moz/mozilla-central/obj-x86_64-apple-darwin12.2.1-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x7b41ff] > #11: MessageLoop::DoWork()[/Users/Jan/moz/mozilla-central/obj-x86_64-apple-darwin12.2.1-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x7b4424] > #12: base::MessagePumpLibevent::Run(base::MessagePump::Delegate*)[/Users/Jan/moz/mozilla-central/obj-x86_64-apple-darwin12.2.1-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x7b5fab] > #13: MessageLoop::RunInternal()[/Users/Jan/moz/mozilla-central/obj-x86_64-apple-darwin12.2.1-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x7b3b65] > #14: MessageLoop::RunHandler()[/Users/Jan/moz/mozilla-central/obj-x86_64-apple-darwin12.2.1-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x7b3a75] > #15: MessageLoop::Run()[/Users/Jan/moz/mozilla-central/obj-x86_64-apple-darwin12.2.1-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x7b3a1d] > #16: base::Thread::ThreadMain()[/Users/Jan/moz/mozilla-central/obj-x86_64-apple-darwin12.2.1-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x7d9129] > #17: ThreadFunc(void*)[/Users/Jan/moz/mozilla-central/obj-x86_64-apple-darwin12.2.1-debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x7da36c] > #18: _pthread_body[/usr/lib/system/libsystem_pthread.dylib +0x1899] > #19: _pthread_struct_init[/usr/lib/system/libsystem_pthread.dylib +0x172a]
typo
Attachment #8603409 - Attachment is obsolete: true
Sorry, it's on Gecko_IOThread (5)
Depends on: 1163079
Depends on: 1163103
Attachment #8603411 - Flags: review?(bent.mozilla)
Comment on attachment 8603411 [details] [diff] [review] Enforce main-thread-only in nsDirectoryService (2) Review of attachment 8603411 [details] [diff] [review]: ----------------------------------------------------------------- Thanks! ::: xpcom/io/nsDirectoryService.cpp @@ +359,5 @@ > > nsDependentCString key(aProp); > > + if (!MOZ_LIKELY(NS_IsMainThread())) { > + MOZ_CRASH(); This could just be |MOZ_RELEASE_ASSERT(NS_IsMainThread());|
Attachment #8603411 - Flags: review?(bent.mozilla) → review+
With MOZ_RELEASE_ASSERT. Carrying forward r=bent.
Attachment #8603388 - Attachment is obsolete: true
Attachment #8603411 - Attachment is obsolete: true
Attachment #8606488 - Flags: review+
Can't land until Bug 1163079 and Bug 1163103 are fixed.
Depends on: 1575045

(In reply to :Gijs (he/him) from comment #16)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9806c46c363c9112446e7ae56ead5420fa71a57b

It looks like this problem has only gotten worse since this bug got filed. :-(

I've patches for the existing deps and will put those up in a second, and then take a look how many different cases are the cause of all the linux orange...

Depends on: 1575564
Depends on: 1575566
Depends on: 1575569
Depends on: 1575692
Depends on: 1576292

Even with the fix from bug 1576292 (all the others have landed on central by now) this seems to break the profiling "run" task, but it seems we don't get crash dumps there so it's hard to know why that is. :-\

https://treeherder.mozilla.org/#/jobs?repo=try&revision=604b8f3ee872f8842fb6c472050d54303149216b

Depends on: 1579859

(In reply to :Gijs (he/him) from comment #19)

Even with the fix from bug 1576292 (all the others have landed on central by now) this seems to break the profiling "run" task, but it seems we don't get crash dumps there so it's hard to know why that is. :-\

https://treeherder.mozilla.org/#/jobs?repo=try&revision=604b8f3ee872f8842fb6c472050d54303149216b

OK, a fix for this is now on autoland (from bug 1579859, which had a successful try run run on https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=265745898&revision=26824c36907b2b8e829f60212adb231a92f59ef9 (the broken [B] job seems to be an infra issue where there's no screen session or something).

Jib, I think this means you could land your patch here (once updated) with r=bent and/or re-upload to phab, get it stamped by someone so you can autoland. Would you like to do the honours or would you like me to take over the bug? :-)

Flags: needinfo?(jib)

Gijs if you can take this over that would be great! Thanks for fixing this!

Assignee: nobody → gijskruitbosch+bugs
Flags: needinfo?(jib)

I don't have a good idea for fixing the jar logging thing without either rearchitecting it (which I'm not sure is useful) or adding non-mainthread-usable accessors for the gredir (which I'm not sure are a good idea). But either way we only use jar logging on pgo builds so adding a debug assert will work already, I think, and will hopefully stop people making the same mistakes in new code, while we figure out what to do with bug 1579859.

Pushed by gijskruitbosch@gmail.com: https://hg.mozilla.org/integration/autoland/rev/45abb8f9b9b3 add debug assert that directory service usage happens only on the main thread, r=froydnj
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla71
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: