Closed Bug 669050 Opened 13 years ago Closed 8 years ago

xpcshell hangs when generating cache during install

Categories

(Core :: Security: PSM, defect)

Other
OpenBSD
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: gaston, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(5 files)

on my buildbot, xpcshell hangs during make install on OpenBSD when GENERATE_CACHE is called, and that since the 26/06. 

http://buildbot.rhaalovely.net/builders/mozilla-central-amd64/builds/76 was fine with rev 68253b0ab50d7a7956783d145dbfd6a3f1113fc0, the failure started with  
http://buildbot.rhaalovely.net/builders/mozilla-central-amd64/builds/77 with rev 82b9558a9eeb1011206f3a3abeaef370d1f5a061.

(gdb) bt
#0  _atomic_lock (lock=0x219fc8ad0) at /usr/src/lib/libpthread/arch/amd64/_atomic_lock.c:21
#1  0x000000020736e3a0 in _spinlock_debug (lck=0x219fc8ad0, fname=0x2074714c8 "/usr/src/lib/libpthread/uthread/uthread_mutex.c", lineno=791)
    at /usr/src/lib/libpthread/uthread/uthread_spinlock.c:90
#2  0x000000020736a3ae in mutex_unlock_common (mutex=0x20c543400, add_reference=0) at /usr/src/lib/libpthread/uthread/uthread_mutex.c:791
#3  0x000000020dd0609c in PR_Unlock (lock=0x20c543400) at /var/buildslave-mozilla/mozilla-central-amd64/build/nsprpub/pr/src/pthreads/ptsynch.c:237
#4  0x000000020f93ff48 in nsPSMBackgroundThread::requestExit (this=0x219fc8400) at Mutex.h:115
#5  0x000000020f945feb in nsNSSComponent::deleteBackgroundThreads (this=0x219fa7000)
    at /var/buildslave-mozilla/mozilla-central-amd64/build/security/manager/ssl/src/nsNSSComponent.cpp:400
#6  0x000000020f94a721 in ~nsNSSComponent (this=0x219fa7000) at /var/buildslave-mozilla/mozilla-central-amd64/build/security/manager/ssl/src/nsNSSComponent.cpp:439
#7  0x000000020f9472f5 in nsNSSComponent::Release (this=0x219fa7000)
    at /var/buildslave-mozilla/mozilla-central-amd64/build/security/manager/ssl/src/nsNSSComponent.cpp:2070
#8  0x000000020fc3f5d3 in nsCOMPtr_base::assign_with_AddRef (this=0x219a634f8, rawPtr=0x0) at nsCOMPtr.h:479
#9  0x000000020fc79cd4 in FreeFactoryEntries (aCID=Variable "aCID" is not available.
) at nsCOMPtr.h:998
#10 0x000000020fc79b07 in nsBaseHashtable<nsIDHashKey, nsFactoryEntry*, nsFactoryEntry*>::s_EnumReadStub (table=Variable "table" is not available.
) at nsBaseHashtable.h:345
#11 0x000000020fc3da02 in PL_DHashTableEnumerate (table=0x20a23b868, 
    etor=0x20fc79af2 <nsBaseHashtable<nsIDHashKey, nsFactoryEntry*, nsFactoryEntry*>::s_EnumReadStub(PLDHashTable*, PLDHashEntryHdr*, unsigned int, void*)>, 
    arg=0x7f7ffffe4f40) at /usr/obj/buildslave-m-c/xpcom/build/pldhash.c:754
#12 0x000000020fc7ac53 in nsComponentManagerImpl::FreeServices (this=Variable "this" is not available.
) at nsBaseHashtable.h:206
#13 0x000000020fc47394 in mozilla::ShutdownXPCOM (servMgr=0x0) at /var/buildslave-mozilla/mozilla-central-amd64/build/xpcom/build/nsXPComInit.cpp:654
#14 0x000000020fc475b7 in NS_ShutdownXPCOM_P (servMgr=Variable "servMgr" is not available.
) at /var/buildslave-mozilla/mozilla-central-amd64/build/xpcom/build/nsXPComInit.cpp:564
#15 0x000000020422e1eb in NS_ShutdownXPCOM (svcMgr=Variable "svcMgr" is not available.
) at /var/buildslave-mozilla/mozilla-central-amd64/build/xpcom/stub/nsXPComStub.cpp:167
#16 0x00000000004068b7 in main (argc=4, argv=0x7f7ffffe5230, envp=0x7f7ffffe5258)
    at /var/buildslave-mozilla/mozilla-central-amd64/build/js/src/xpconnect/shell/xpcshell.cpp:2023
Almost certainly a result of bug 468736.
Blocks: 468736
What is the stack of the other threads (the PSM thread in particular)?
Component: XPCOM → Security: PSM
QA Contact: xpcom → psm
Here's the bt, i'm surprised gdb only lists 3 threads...

Btw, as thread 6 comes from ipc/chromium; i should mention that my builds need patches from bug #648735 to succeed.
Oh,  and i can confirm reverting  http://hg.mozilla.org/mozilla-central/rev/1d2879d39b2a  fixes the issue, xpcshell doesn't go into an infinite loop & make package works again, so this is definitely related.
And as can be seen on http://buildbot.rhaalovely.net/builders/mozilla-aurora-amd64/builds/116/steps/package/logs/stdio this also seems to affect mozilla-aurora. But strangely, this doesnt seem to affect comm-central, where make package goes just fine, see http://buildbot.rhaalovely.net/builders/comm-central-amd64/builds/97/steps/package/logs/stdio.
(In reply to Landry Breuil from comment #6)
> And as can be seen on
> http://buildbot.rhaalovely.net/builders/mozilla-aurora-amd64/builds/116/
> steps/package/logs/stdio this also seems to affect mozilla-aurora. But
> strangely, this doesnt seem to affect comm-central, where make package goes
> just fine, see
> http://buildbot.rhaalovely.net/builders/comm-central-amd64/builds/97/steps/
> package/logs/stdio.

comm-central doesn't do cache generation.
Blocks: openbsdmeta
Hmm, dunno what changed wrt comm-central, probably cache generation was enabled recently there too. it started failing to package a while ago (sorry no exact date as i didn't look at the buildbot for a while), and i had to apply the same revert to fix it.
Without it it fails :
http://buildbot.rhaalovely.net/builders/comm-central-amd64/builds/141

Error is :
uncaught exception: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIObserverService.removeObserver]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: resource:///components/steelApplication.js :: app_observe :: line 687"  data: no]

And then, xpcshell is hung in an infinite loop, taking 100% cpu (traces incoming)
With the revert, packaging is ok :
http://buildbot.rhaalovely.net/builders/comm-central-amd64/builds/143
And for the record, this is right now the only patch needed to make m-c tip build and package fine on OpenBSD, so i'd be glad if someone could help finding a fix for that issue :)
Brian has looked at PSM hangs in the past.
I've been away for the past months, but the issue is still present in m-c. Need to refresh the backout patch as it doesn't apply anymore..
Since nsSSLThread was removed in bug #674147, it's now quite impossible for me to backout the commit and get back to a working state. xpcshell still hangs after :

resource:///modules/HUDService.jsm
resource:///modules/reflect.jsm
resource:///modules/devtools/TiltGL.jsm
resource:///modules/devtools/StyleEditorUtil.jsm
resource:///modules/CSPUtils.jsm

several threads are stuck on pthread_cond_wait / pthread_cond_timedwait. Can someone look at the attached backtraces and uncipher them ?
Now that nsSSLThread has been removed, we don't need nsPSMBackgroundThread except for nsCertVerificationThread. nsCertVerificationThread is almost never used, and its uses aren't performance-critical, so we should just create the thread when needed and have it die when it has finished its task. That is, nsCertVerificationThread::addJob() should just create the thread with a single job to run, and the thread should die after dispatching the result. This will eliminate all the complexity here and will solve the problem.

Luckily (or sadly), I can write the patch to do that faster than I could probably figure out why this is deadlocking.
I looked at the 'common denominators' between all those thread backtraces, and all/most of them have :
- a thread with the following callstack closing XPCOM/destroying an NSSComponent :
main() -> NS_ShutdownXPCOM() -> NS_ShutdownXPCOM_P() -> mozilla::ShutdownXPCOM() ->nsComponentManagerImpl::FreeServices() ->  PL_DHashTableEnumerate() -> nsBaseHashtable<nsIDHashKey, nsFactoryEntry*, nsFactoryEntry*>::s_EnumReadStub() -> FreeFactoryEntries() -> nsCOMPtr_base::assign_with_AddRef() -> nsNSSComponent::Release() -> ~nsNSSComponent() -> nsNSSComponent::deleteBackgroundThreads() -> nsPSMBackgroundThread::requestExit() -> various thread lock calls
- a thread in ipc/chromium/src/base/message_loop.cc on kevent() :
_thread_start() -> ThreadFunc() -> base::Thread::ThreadMain() -> MessageLoop::Run() -> MessageLoop::RunHandler() -> MessageLoop::RunInternal() -> base::MessagePumpLibevent::Run() -> event_base_loop() -> kq_dispatch() -> kevent()
- a thread in the hang monitor ?
mozilla::HangMonitor::ThreadMain() -> PR_WaitCondVar() -> pthread_cond_wait()
- a thread in the js gc ?
js::GCHelperThread::threadMain() -> js::GCHelperThread::threadLoop() -> PR_WaitCondVar() -> pthread_cond_wait()
- a thread in another watchdog ?
XPCJSRuntime::WatchdogMain() -> PR_WaitCondVar() -> pt_timedwait() -> pthread_cond_timedwait()
- a thread in PSM ?
nsPSMBackgroundThread::nsThreadRunner() -> nsCertVerificationThread::Run() -> PR_WaitCondVar() -> pthread_cond_wait()

i hope it sheds some light...
and the produced firefox binary runs fine from dist/bin after a quick test, so it's only xpcshell hanging during make package, or i can try fiddling with certs to try to reproduce the issue in firefox..
Landry, can you reproduce this by just running any (single) test in xpcshell? Does it only happen when you run specific tests? Are the tests you mention in comment 13 the only tests that cause this?

Also, is this OpenBSD-specific or does this happen also on any other platform?
I'm not running any tests, i'm just doing 'make package', which from my understanding calls GENERATE_CACHE in packager.mk which in turns calls xpcshell, asking it to populate_startupcache().. and yes, it always stalls after printing 'resource:///modules/CSPUtils.jsm'. I don't know how to reproduce it with tests, but i can try if you give me the details.

I've only seen it on OpenBSD, since i'm only running OpenBSD. Note that our pthread library is in userland, if that matters....
I believe the problem may be that GENERATE_CACHE is invoking xpcshell in such a way that the profile-change-net-teardown event is never fired. In particular, I bet that GENERATE_CACHE never actually starts up the network.

To see if this is correct, please try the patch in bug 706955 to see if it resolves the bug. It moves the startup/shutdown of this thread to be based on profile startup/teardown instead of network startup/teardown.

If that doesn't work, you can try the patch I am attaching to this bug for now, which should let you get on with your testing. It simply removes the PSM background threads completely. This will cause the certificate manager UI (and related dialog boxes) to stop working and/or crash, so it isn't the final solution. But, it should allow you to get through the test suite.
Attachment #587176 - Flags: feedback?(landry)
(In reply to Brian Smith (:bsmith) from comment #21)
> Created attachment 587176 [details] [diff] [review]
> [NOT FOR CHECKIN] Do not create the PSM background threads
> 
> I believe the problem may be that GENERATE_CACHE is invoking xpcshell in
> such a way that the profile-change-net-teardown event is never fired. In
> particular, I bet that GENERATE_CACHE never actually starts up the network.
> 
> To see if this is correct, please try the patch in bug 706955 to see if it
> resolves the bug. It moves the startup/shutdown of this thread to be based
> on profile startup/teardown instead of network startup/teardown.

I can confirm that the patch in #706955 fixes my issue, yay!
Comment on attachment 587176 [details] [diff] [review]
[NOT FOR CHECKIN] Do not create the PSM background threads

Unsetting feedback flag since finally that diff wasn't needed.
Attachment #587176 - Flags: feedback?(landry)
Fwiw, rechecked and this still happens even after the recent commits to that area (#712363, jsruntime being singlethread..) https://bugzilla.mozilla.org/attachment.cgi?id=593767 fixes the issue.
I can see the error message in the comment 8 on Linux x86_64.

Does Mike Hommey help us if this is related with GENERATE_CACHE?
Oh and very interestingly, when doing make install for firefox 11.0 on powerpc (but not i386/amd64), xpcshell hangs at the same stage: resource:///modules/CSPUtils.jsm

So now that's not only related to make package (something changed in nss threads between 10 and 11 ?), and i'm puzzled at why it fails on ppc and not other archs. I'll try with the workaround in 706995.
(In reply to Landry Breuil from comment #26)
You are right. I got the error message for XPIProvider.jsm, not steelApplication.js.
I rechecked my build log this year and I always got the message below:

Failed to import resource:///modules/XPIProvider.jsm:[Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProperties.get]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: resource://gre/modules/FileUtils.jsm :: FileUtils_getDir :: line 60"  data: no]

At least, I can see the error message after 2012-01-04 (200a8d1fb452 in m-c).
I'm not sure we're talking about the same issue. You're seeing a warning, i'm seeing a xpcshell lockup..
Confirmed, now i need https://bug706955.bugzilla.mozilla.org/attachment.cgi?id=593767 backported to 11.0 to get xpcshell pass over CSPUtils.jsm, and that only on powerpc. Rechecked amd64/i386, it's not needed. The only 'important patch' i have for ppc is the one from #691898.

Brian, any progress on #706955, that would fix this issue ?
And i've tested with thunderbird 11.0 (still on ppc), the hang is not at the exact same place but it also happens during make install (in fact, same message than in comment 8):

resource:///modules/reflect.jsm
resource:///modules/CSPUtils.jsm
resource:///modules/subprocess.jsm
uncaught exception: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIObserverService.removeObserver]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: resource:///components/steelApplication.js :: app_observe :: line 687"  data: no]

.. and then, xpcshell at 100%, doin' nothing..

i'm puzzled by two things :
- why only on ppc ?
- what changed between 10 and 11 wrt 'install' target ? Previously (and starting from 7 when 468736 landed), i was only seeing the issue during gmake -C objdir package and there was no issue for install, now with 11 i'm seeing it during 'make install'
Fwiw, ffx 17.0b2 packages fine on ppc & amd64 without the patchset from 706955, will test tb 17.0beta on ppc to confirm the new status..
Is this still an issue?
Flags: needinfo?(landry)
I think this one is gone, and since it was 3+ years ago now....
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(landry)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: