Closed Bug 353962 Opened 18 years ago Closed 16 years ago

Firefox 2.0 often hangs in Intel Mac OS X 10.4.7

Categories

(Core :: JavaScript Engine, defect)

PowerPC
macOS
defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla1.9alpha1

People

(Reporter: fishywang, Assigned: brendan)

References

Details

(5 keywords, Whiteboard: [Fx 2.0.0.1])

Attachments

(18 files, 1 obsolete file)

200.96 KB, application/octet-stream
Details
169.55 KB, application/octet-stream
Details
10.36 KB, text/plain
Details
745 bytes, text/plain
Details
5.63 KB, text/plain
Details
2.63 KB, text/plain
Details
7.11 KB, text/plain
Details
8.36 KB, text/plain
Details
7.99 KB, text/plain
Details
9.69 KB, text/plain
Details
1.19 KB, patch
Details | Diff | Splinter Review
8.91 KB, text/plain
Details
6.58 KB, text/plain
Details
39.35 KB, text/plain
Details
1.61 KB, patch
brendan
: review+
brendan
: approval1.9+
Details | Diff | Splinter Review
2.20 KB, patch
brendan
: review+
dveditz
: superreview+
beltzner
: approval1.9b3+
brendan
: approval1.9+
Details | Diff | Splinter Review
1.40 KB, text/plain
Details
1.13 KB, patch
brendan
: review+
Details | Diff | Splinter Review
User-Agent:       Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
Build Identifier: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6

Firefox 2.0 (beta 2 and RC1 candidate) often hangs in Intel Mac OS X 10.4.7.
The hang may happen after opening Firefox, open a external url, etc.

Reproducible: Sometimes

Steps to Reproduce:
1.Open Firefox.app, or set Firefox as the default browser, and open a url
2.
3.

Actual Results:  
Firefox hangs, the mouse icon keeps on the rolling ball, and use cmd+alt+esc will show Firefox "not response" after a few seconds

Expected Results:  
Firefox starts and show the homepage or the specified url correctly

Firefox 1.5 works well
I created a new profile and it seems don't happen again. But after I copied the "signons.txt" file from my old profile to the new profile, it hangs again
Version: unspecified → 2.0 Branch
Thanks for your report. Since we can't ask you to give us your passwords, could you maybe try to minimize the file to see if it's a particular entry causing the trouble? Also if possible, please attach a profile from Shark.app or Sampler.app to see if we can figure out where does Firefox hang.

Also, same signons.txt (with key3.db) works fine in Firefox 1.5, correct?
Keywords: hang
maybe signons.txt isn't the problem. I've just copied signons.txt but without the key3.db, sorry for that.
But the new profile I created, after some normal use (just visit website, etc.) and empty signons.txt (but in the normal use I've stored some passwords) also hangs sometimes, but not as often as my old profile.
Firefox 1.5 never hangs on my machine with the same old profile
Well if you could figure out what causes the hang and/or figure out what hangs (using the abovementioned tools), it would help.
Attached file A Sampler.app trace
I just open Firefox.app to display my homepage(about:mozilla) and firefox hangs.
I just sent this to the ff mozilla support group, fyi
---
ff 2.0 hangs a lot for me.  About 4 times a day.  I'm on macbook w/ 2 gig mem.  I keep a lot of tabs open (~20) accross crashes/hangs/reboots; else, I have noticed no pattern on this.

Mostly, when it hangs, I "force quit" ff, and it recovers nicely.

Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1) Gecko/20061003 Firefox/2.0
my problem is pretty much identical to comment #6; only differences are macbook pro; 10.4.8 and 1.5 gb of memory; never had the problem on 1.5; i trashed everything and started over with no luck. i'm about to try that again, but i'm not confident it will do anything 
here is a shark "timing profile" taken after a ff hang.  ff rc3.  Macbook 2mb.  Should I set some other options?  Seems I can get it to hang within a day...
1) In order for Shark output to be interesting (to non-mac devs), you need to export it as text. For it to be truly useful, a developer need to first mine the data because it usually contains lots of redundant info.

2) Sampler output also needs to be exported as text, if you wnat non-mac devs to look at it. Either Sampler.app or Shark.app can do this conversion, I think.

I will try to write more about on the developer wiki about this in the near future.
This is the text version for attachment #241686 [details]
afaik the code in question comes from here: http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&root=/cvsroot&subdir=mozilla/js/src/xpconnect/src&command=DIFF_FRAMESET&file=xpccallcontext.cpp&rev2=1.1.2.5&rev1=1.1.2.4

That follows is a /slightly/ shortened view of the problem, i've used {x} to indicate the number of spaces i've removed.
Call graph:
{3} 285 Thread_5907
{11} 285 XRE_main
{15} 285 nsAppShell::Run()
{35} 285 nsTimerImpl::Fire()
{37} 285 nsGlobalWindow::TimerCallback(nsITimer*, void*)
{39} 285 nsGlobalWindow::RunTimeout(nsTimeout*)
{41} 285 nsJSContext::CallEventHandler(JSObject*, JSObject*, unsigned, long*, long*)
{69} 285 nsJSCID::GetService(nsISupports**)
{71} 285 nsComponentManagerImpl::GetService(nsID const&, nsID const&, void**)
{73} 285 nsComponentManagerImpl::CreateInstance(nsID const&, nsISupports*, nsID const&, void**)
{81} 285 nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS*, unsigned short, nsXPTMethodInfo const*, nsXPTCMiniVariant*)
{87} 285 js_InvokeConstructor
{97} 285 XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode)
{105} 285 nsXPCWrappedJSClass::GetRootJSObject(XPCCallContext&, JSObject*)
{107} 285 nsXPCWrappedJSClass::CallQueryInterfaceOnJSObject(XPCCallContext&, JSObject*, nsID const&)
{109} 285 AutoScriptEvaluate::~AutoScriptEvaluate [in-charge]()
{111} 285 nsJSContext::ScriptExecuted()
{113} 285 nsJSContext::ScriptEvaluated(int)
{115} 285 JS_GC
{117} 285 js_GC
{119} 285 PR_WaitCondVar

{3} 285 Thread_6003
{9} 285 nsHostResolver::ThreadFunc(void*)
{11} 285 nsHostResolver::OnLookupComplete(nsHostRecord*, unsigned, PRAddrInfo*)
{13} 285 nsDNSAsyncRequest::OnLookupComplete(nsHostResolver*, nsHostRecord*, unsigned)
{15} 285 nsCOMPtr_base::assign_with_AddRef(nsISupports*)
{17} 285 nsProxyEventObject::Release()
{21} 285 nsProxyEventObject::Release()
{27} 285 nsQueryInterface::operator()(nsID const&, void**) const
{29} 285 nsXPCWrappedJS::QueryInterface(nsID const&, void**)
{31} 285 nsXPCWrappedJSClass::DelegatedQueryInterface(nsXPCWrappedJS*, nsID const&, void**)
{33} 285 XPCCallContext::XPCCallContext[in-charge](XPCContext::LangType, JSContext*, JSObject*, JSObject*, long, unsigned, long*, long*)
{35} 285 XPCJSContextStack::GetSafeJSContext(JSContext**)
{37} 285 nsXPConnect::InitClasses(JSContext*, JSObject*)
{39} 285 XPCNativeWrapper::AttachNewConstructorObject(XPCCallContext&, JSObject*)
{41} 285 JS_InitClass
{49} 285 js_AtomizeString
{51} 285 js_Lock
{53} 285 PR_WaitCondVar

unless i'm missing something we're basically *not* grabbing a lock on a context just because we know no one else has it, which seems wrong/broken.
Assignee: nobody → dbradley
Component: General → XPConnect
Product: Firefox → Core
QA Contact: general → xpconnect
Version: 2.0 Branch → Trunk
I should note that XPCJSContextStack::GetSafeJSContext starts a request before calling InitClasses.  So is the bug really in XPConnect or in JS?

In any case, enough people are seeing this, and we have confirmed deadlock traces, so confirming bug and maybe we should block 2.0 on this...
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: blocking1.8.1?
Given that this didn't happen with 1.5, according to comments, adding "regression" keyword.
Keywords: regression
are you configured to use proxy auto config? the pref name is network.proxy.autoconfig_url in case you need to check prefs.js manually.

unfortunately other possibilities include sessionstore and safebrowsing, if someone could explain how to use user.js to disable these, that'd be appreciated.

also, i presume you've tried running in safe mode?
Assignee: dbradley → general
Component: XPConnect → JavaScript Engine
QA Contact: xpconnect → general
(In reply to comment #14)
> unfortunately other possibilities include sessionstore and safebrowsing, if
> someone could explain how to use user.js to disable these, that'd be
> appreciated.

Add:
user_pref("browser.safebrowsing.enabled", false);
In prefs.js to disable safebrowsing.

If it's hanging on startup, try adding that and seeing if it helps.
Timeless, you wrote: "unless i'm missing something we're basically *not* grabbing a lock on a context just because we know no one else has it, which seems wrong/broken."  Presumably you are talking about jband's code to call JS_BeginRequest only if JS_GetContextThread returns non-zero.  That's old code dating from when requests were not used on the main thread, and it was thought that main-thread contexts had zero thread ids.  They don't.  The test of JS_GetContextThread should be removed, but this code isn't implicated in the stacks you show.

There must be another stack for there to be a deadlock.  The two stacks you show do not complete a cycle in the lock dependency graph.

Does anyone have full stack backtraces for a hang associated with this bug?  Does the shark data attached show thread stacks?

/be

Assignee: general → dbradley
Component: JavaScript Engine → XPConnect
QA Contact: general → xpconnect
I've did it and it hangs again

(In reply to comment #15)
> (In reply to comment #14)
> > unfortunately other possibilities include sessionstore and safebrowsing, if
> > someone could explain how to use user.js to disable these, that'd be
> > appreciated.
> 
> Add:
> user_pref("browser.safebrowsing.enabled", false);
> In prefs.js to disable safebrowsing.
> 
> If it's hanging on startup, try adding that and seeing if it helps.
> 

Session restore can be disabled by setting browser.sessionstore.enabled = false.

Is there anything in particular I should look for in the session restore code that might cause this?
I choose to reopen a closed tab and firefox hangs again, this is the Shark report.
I will disable sessionsaver and try if it hangs again
This is a report generated by Sampler.app
This looks more like a stuck lock (missing unlock) than a deadlock.  Fun.  If someone can catch it in gdb and find me on irc.mozilla.org, I can help debug.  Otherwise we'll have to go looking for a needle in a haystack.

/be
Moving back to JS engine based on latest data.  If this is a stuck JS lock then we could try to narrow down where the missing unlock might be based on novel code usage.  Does anti-phishing use new JS1.7 features?

/be
Assignee: dbradley → general
Component: XPConnect → JavaScript Engine
QA Contact: xpconnect → general
Just to get some additional data points, I checked the hendrix feedback to see if there is more widespread reports of hang problems on the Mac. Looks as if most all of the hang reports in hendrix are from Windows, and there were not that many reports that were specific to either RC2 or RC3.
Flags: blocking1.8.1? → blocking1.8.1.1?
(In reply to comment #22)
> Moving back to JS engine based on latest data.  If this is a stuck JS lock then
> we could try to narrow down where the missing unlock might be based on novel
> code usage.  Does anti-phishing use new JS1.7 features?

Nope, based on code that works in FF 1.5.
It hangs again and i use gdb to attach this process, execute the "fin" command.
It hangs on ReceiveNextEventCommon, but after a while, it exited from this function and didn't hangs anymore.
I don't know if it's not a real hang, or gdb changed the time-order so resolved the deadlock. Next time it hangs, i'll wait for some minutes to ensure it's a real hang, and then use gdb to attach.
Hope this helps.
Another: Firefox hangs on start-up, here's the "bt" and "fin" command in gdb:
---
(gdb) bt
#0  0x90024427 in semaphore_wait_signal_trap ()
#1  0x90028414 in pthread_cond_wait ()
#2  0x00fbe931 in PR_WaitCondVar ()
#3  0x00fbeb91 in PR_Wait ()
#4  0x002ff17a in nsUrlClassifierDBService::EnsureThreadStarted ()
#5  0x002ff7f9 in nsUrlClassifierDBService::CancelStream ()
#6  0x00301b5f in TableUpdateListener::OnStopRequest ()
#7  0x005c96a6 in nsHttpChannel::OnStopRequest ()
#8  0x0034c7f8 in nsInputStreamPump::OnStateStop ()
#9  0x0034cd7d in nsInputStreamPump::OnInputStreamReady ()
#10 0x01872b55 in nsInputStreamReadyEvent::EventHandler ()
#11 0x01847585 in PL_HandleEvent ()
#12 0x0184783e in PL_ProcessPendingEvents ()
#13 0x0184895c in nsEventQueueImpl::ProcessPendingEvents ()
#14 0x0180e439 in NS_ShutdownXPCOM_P ()
#15 0x00003468 in ScopedXPCOMStartup::~ScopedXPCOMStartup ()
#16 0x00006171 in XRE_main ()
#17 0x00003258 in main ()
(gdb) fin
Run till exit from #0  0x90024427 in semaphore_wait_signal_trap ()
0x90028414 in pthread_cond_wait ()
(gdb) 
Run till exit from #0  0x90028414 in pthread_cond_wait ()
0x00fbe931 in PR_WaitCondVar ()
(gdb) 
Run till exit from #0  0x00fbe931 in PR_WaitCondVar ()
0x00fbeb91 in PR_Wait ()
(gdb) 
Run till exit from #0  0x00fbeb91 in PR_Wait ()
0x002ff17a in nsUrlClassifierDBService::EnsureThreadStarted ()
(gdb) 
Run till exit from #0  0x002ff17a in nsUrlClassifierDBService::EnsureThreadStarted ()
---
and this is the sampler.app report:
---
# Report 0 - Timed Samples for firefox-bin (process 7610)
Stacks at 2006-10-22 00:54:24 +0800
Samples (displayed/total): 560/560
Call graph:
    280 Thread_444f
      280 _pthread_body
        280 PR_Select
          280 nsThread::Main(void*)
            280 TimerThread::Run()
              280 PR_WaitCondVar
                280 semaphore_wait_signal_trap
                  280 semaphore_wait_signal_trap
    280 Thread_5a07
      280 start
        280 start
          280 main
            280 XRE_main
              280 ScopedXPCOMStartup::~ScopedXPCOMStartup [in-charge]()
                280 NS_ShutdownXPCOM_P
                  280 nsEventQueueImpl::ProcessPendingEvents()
                    280 PL_ProcessPendingEvents
                      280 PL_HandleEvent
                        280 nsInputStreamReadyEvent::EventHandler(PLEvent*)
                          280 nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream*)
                            280 nsInputStreamPump::OnStateStop()
                              280 nsHttpChannel::OnStopRequest(nsIRequest*, nsISupports*, unsigned)
                                280 TableUpdateListener::OnStopRequest(nsIRequest*, nsISupports*, unsigned)
                                  280 nsUrlClassifierDBService::CancelStream()
                                    280 nsUrlClassifierDBService::EnsureThreadStarted()
                                      280 PR_Wait
                                        280 PR_WaitCondVar
                                          280 semaphore_wait_signal_trap
                                            280 semaphore_wait_signal_trap

Total number in stack (recursive counted multiple, when >=5):

Sort by top of stack, same collapsed (when >= 5):
        semaphore_wait_signal_trap        560
---
hope this helps
It hangs many times today, but nearly every time it hangs, I use gdb to attach it, and execute "fin", it will hang at the function ReceiveNextEventCommon and then get out of it and the hang just disappears.
There're 2 exceptions: 1 is comment #26 that hangs on start-up(the ReceiveNextEventCommon hangs are during surfing but not starting up), another is also during surfing:
(gdb) bt
#0  0x90024427 in semaphore_wait_signal_trap ()
#1  0x90028414 in pthread_cond_wait ()
#2  0x00fbe931 in PR_WaitCondVar ()
#3  0x00f0fdd6 in js_GC ()
#4  0x00ee42c6 in JS_GC ()
#5  0x0051e060 in nsJSContext::Notify ()
#6  0x0184a6ae in nsTimerImpl::Fire ()
#7  0x0184ae8f in handleTimerEvent ()
#8  0x01847585 in PL_HandleEvent ()
#9  0x0184783e in PL_ProcessPendingEvents ()
#10 0x90828379 in CFRunLoopRunSpecific ()
#11 0x90827eb5 in CFRunLoopRunInMode ()
#12 0x92dcdb90 in RunCurrentEventLoopInMode ()
#13 0x92dcd297 in ReceiveNextEventCommon ()
#14 0x92e15929 in _AcquireNextEvent ()
#15 0x92e15774 in RunApplicationEventLoop ()
#16 0x0023feb3 in nsAppShell::Run ()
#17 0x002d7f8a in nsAppStartup::Run ()
#18 0x000064f9 in XRE_main ()
#19 0x00003258 in main ()
(gdb) fin
Run till exit from #0  0x90024427 in semaphore_wait_signal_trap ()
0x90028414 in pthread_cond_wait ()
(gdb) 
Run till exit from #0  0x90028414 in pthread_cond_wait ()
0x00fbe931 in PR_WaitCondVar ()
(gdb) 
Run till exit from #0  0x00fbe931 in PR_WaitCondVar ()
0x00f0fdd6 in js_GC ()
(gdb) 
Run till exit from #0  0x00f0fdd6 in js_GC ()
(In reply to comment #26)

Hmm, this looks like anti-phishing code.  Are you able to get this trace to trigger with anti-phishing off?  It shouldn't invoke TableUpdateListener unless it's on.
Whew, no stuck JS lock (I searched the source and couldn't find any bad paths).  Moving to anti-phishing, cc'ing darin.

/be
Assignee: general → nobody
Component: JavaScript Engine → Phishing Protection
Product: Core → Firefox
QA Contact: general → phishing.protection
(In reply to comment #28)
> (In reply to comment #26)
> 
> Hmm, this looks like anti-phishing code.  Are you able to get this trace to
> trigger with anti-phishing off?  It shouldn't invoke TableUpdateListener unless
> it's on.

Actually, it doesn't really matter, since it seems like having it on can cause a hang during startup.  Investigating an idea that might be causing this . . .
I've disabled anti-phishing(in preferences->Security->Tell me blah blah...) but it also hangs.
(In reply to comment #31)
> I've disabled anti-phishing(in preferences->Security->Tell me blah blah...) but
> it also hangs.

So I think the startup hang (comment #26) is due to anti-phishing (bug 355399).  The hangs while running seem unrelated, especially if you're not seeing it with anti-phishing off.
*** Bug 356721 has been marked as a duplicate of this bug. ***
On Macbook Pro (Intel 2.16 Core Duo) running OS X 1.4.8, Firefox 2.0 frequently hangs when opening a new page (often from a hyperlink.)  There is no way to unfreeze the program other than to quit or force quit.  Sometimes, resume session re-freezes the program but not always.
#0  0x9002449f in semaphore_wait_trap ()
#1  0x9000110f in pthread_mutex_lock ()
#2  0x00fbe46c in PR_Lock ()
#3  0x00fbeaf8 in PR_EnterMonitor ()
#4  0x0184e2bf in nsProxyEventObject::GetNewOrUsedProxy ()
#5  0x0184f0e4 in nsProxyObjectManager::GetProxyForObject ()
#6  0x0184ee31 in NS_GetProxyForObject ()
#7  0x002ffd56 in nsUrlClassifierDBService::Exists ()
#8  0x0185d98d in XPTC_InvokeByIndex ()
#9  0x0038ccfb in XPCWrappedNative::CallMethod ()
#10 0x0037ef93 in XPC_WN_CallMethod ()
#11 0x00f22443 in js_Invoke ()
#12 0x00f13316 in js_Interpret ()
#13 0x00f2299b in js_Invoke ()
#14 0x005e6a84 in nsXPCWrappedJSClass::CallMethod ()
#15 0x005e3dcf in nsXPCWrappedJS::CallMethod ()
#16 0x0185da46 in XPTC_InvokeByIndex ()
#17 0x0185dc60 in nsXPTCStubBase::Stub7 ()
#18 0x0185d98d in XPTC_InvokeByIndex ()
#19 0x0038ccfb in XPCWrappedNative::CallMethod ()
#20 0x0037ef93 in XPC_WN_CallMethod ()
#21 0x00f22443 in js_Invoke ()
#22 0x00f13316 in js_Interpret ()
#23 0x00f2299b in js_Invoke ()
#24 0x005e6a84 in nsXPCWrappedJSClass::CallMethod ()
#25 0x005e3dcf in nsXPCWrappedJS::CallMethod ()
#26 0x0185da46 in XPTC_InvokeByIndex ()
#27 0x0185dc92 in nsXPTCStubBase::Stub8 ()
#28 0x0185d98d in XPTC_InvokeByIndex ()
#29 0x0038ccfb in XPCWrappedNative::CallMethod ()
#30 0x0037ef93 in XPC_WN_CallMethod ()
#31 0x00f22443 in js_Invoke ()
#32 0x00f13316 in js_Interpret ()
#33 0x00f2299b in js_Invoke ()
#34 0x005e6a84 in nsXPCWrappedJSClass::CallMethod ()
#35 0x005e3dcf in nsXPCWrappedJS::CallMethod ()
#36 0x0185da46 in XPTC_InvokeByIndex ()
#37 0x0185db98 in nsXPTCStubBase::Stub3 ()
#38 0x002d5b8c in nsDocNavStartProgressListener::Observe ()
#39 0x0184a6ee in nsTimerImpl::Fire ()
#40 0x0184ae8f in handleTimerEvent ()
#41 0x01847585 in PL_HandleEvent ()
#42 0x0184783e in PL_ProcessPendingEvents ()
#43 0x90828379 in CFRunLoopRunSpecific ()
#44 0x90827eb5 in CFRunLoopRunInMode ()
#45 0x92dcdb90 in RunCurrentEventLoopInMode ()
#46 0x92dcd297 in ReceiveNextEventCommon ()
#47 0x92e15929 in _AcquireNextEvent ()
#48 0x92e15774 in RunApplicationEventLoop ()
#49 0x0023feb3 in nsAppShell::Run ()
#50 0x002d7f8a in nsAppStartup::Run ()
#51 0x000064f9 in XRE_main ()
#52 0x00003258 in main ()

So maybe something wrong with how we're using proxy events?
See also bug 343033 (Firefox trunk builds always hang on quit), where we also see the phishing service in the stack. 

Might it be the same dead-lock causing that bug?
iirc proxy events aren't safe before the thread manager rewrite, if 2.0 doesn't have that rewrite, then they aren't safe :).

can you walk the stack and figure out where the monitor/lock is that's already holding it?
thread 9 + thread 1 seems suspicious
(In reply to comment #38)
> Created an attachment (id=244130) [edit]
> backtrace of all threads
> 
> thread 9 + thread 1 seems suspicious

Thread 2 is stuck on the proxy event manager monitor too, but it looks like it's not holding anything.

Thread 9 has the proxy event manager monitor and seems to want the JS atom table lock.  Is it stuck waiting indefinitely, or if you finish in gdb after switching to thread 9, does the thread make forward progress and end up meta-stable (i.e., is this thread live-locking somehow)?

I can't see how the atom table lock is held by any thread shown.  What does 'info threads' disclose?  Any other threads not consecutively numbered, with a hole for threads that exited in the thread id space?

If you have this in gdb, find me on irc.

/be
Attached file another backtrace (obsolete) —
Another gdb backtrace.  I have this held in gdb, so find me if you want more information (info threads only shows the 11 shown here and finish doesn't complete on thread 11).
Attachment #244251 - Attachment is obsolete: true
FWIW, I am able to consistently repro the hang when I turn on auto-proxy config (pointing to a local .pac file).  When I turn off the auto-proxy config, the hang goes away.  [MacBook Pro, 10.4.8, only on FF 2.0)
In my case turning off auto-proxy and changing to manual doesn't help. FireFox still hangs...
I'm on a Mac Mini running 10.4.8

(In reply to comment #42)
> FWIW, I am able to consistently repro the hang when I turn on auto-proxy config
> (pointing to a local .pac file).  When I turn off the auto-proxy config, the
> hang goes away.  [MacBook Pro, 10.4.8, only on FF 2.0)
> 

Attached file A true deadlock in gdb
This is no doubt a deadlock, the attachment is the threads and backtraces
I did a total fresh install (deleting profile, etc) and I seem to be rid of the constant hangs that I had (on macbook pro running 10.4.8). I did notice that after i reinstalled the Google toolbar, I had the same problem again. So, I uninstalled that. Is anyone using the Google toolbar?
(In reply to comment #45)
> I did a total fresh install (deleting profile, etc) and I seem to be rid of the
> constant hangs that I had (on macbook pro running 10.4.8). I did notice that
> after i reinstalled the Google toolbar, I had the same problem again. So, I
> uninstalled that. Is anyone using the Google toolbar?
> 

Yes I was using Google toolbar. Now I uninstalled it and seems OK now.
But in FX 1.5 with the same version of Google toolbar, FX never hangs.
google toolbar worked fine for me in FX 1.5 as well.
I've disabled Google Toolbar for more than 24-hours and Fx never hangs after that so far.
Ran ff 11/3 -> today (11/6) on Mac (Intel) without Google toolbar.  No hangs so far.
I fear that this is nothing more than a "me too" posting but I have have yet to have a hang since I disabled the Google toolbar.  I did, however, send a message to Google support for the toolbar bringing this issue to their attention.  So far, all that I have heard back is an auto-reply email.  I hope that someone with more knowledge and/or better contacts can pursue this with them.
Just to be clear, it sounds like everyone is saying that this only happens with google toolbar, which may mean it's a bug in toolbar, not firefox.  Does this hang happen to anyone not using google toolbar?
(In reply to comment #51)
> Just to be clear, it sounds like everyone is saying that this only happens with
> google toolbar, which may mean it's a bug in toolbar, not firefox.  Does this
> hang happen to anyone not using google toolbar?
> 

I thought I would try FF2 again after moving back to FF1.7.  FF2 would hang on startup nearly everytime I launched it.  I am and was NOT using Google Toolbar.  I installed using only approved Mozilla extensions and NO Google Toolbar.  I turned off Phishing, reran and FF2 immediately hung with the spinning beach ball.  I force quit and reran.  FF2 started up fine and had done so for the last day now.  I do get that pesky spinning beach ball on startup but it doesn't seem to hang anymore.  FF2 seems quite a bit slower than FF1.7 too.  I am using an Intel iMac, 2ghz, 512 mb ram, OS X 10.4.8.
Not enough progress to realistically block 1.8.1.1 unfortunately.
Flags: wanted1.8.1.x+
Flags: blocking1.8.1.1?
Flags: blocking1.8.1.1-
(In reply to comment #53)
> Not enough progress to realistically block 1.8.1.1 unfortunately.

It seems that the problem is caused specifically by the pagerank feature in toolbar.  One possible work around is on bug 337492, but I agree that this shouldn't block 1.8.1.1.  We're also trying to work around the hang on the toolbar end.
I've enabled google toolbar, and disabled PageRank, and fx hangs once again.
This time it hangs on js_GC
(In reply to comment #55)
> I've enabled google toolbar, and disabled PageRank, and fx hangs once again.
> This time it hangs on js_GC

Please attach stack backtraces for all threads.

/be
This is the backtraces for all threads while a hang occurred in js_GC function
I am able to reproduce this on Windows XP SP2 on a core duo with a debug build of firefox 2.0, but ONLY when the google toolbar is installed.  I had a dump of this on Windows but unfortunately lost it when I swapped source and destination designations with a tar -czf command.  Sorry.
I'd like to get to the bottom of this.  If it's *only* happening in Mactels, it could be a bug in how compare-and-swap is implemented by jslock.c.  Anyone who can reproduce under gdb, please ping me on IRC.  Tony did last time, we spent a while, but there's more we can try.  It may be worth adding a double-check patch before hand, if you can do your own build.  I'll try to come up with one.

/be
(In reply to comment #58)
> I am able to reproduce this on Windows XP SP2 on a core duo with a debug build
> of firefox 2.0, but ONLY when the google toolbar is installed.  I had a dump of
> this on Windows but unfortunately lost it when I swapped source and destination
> designations with a tar -czf command.  Sorry.

Sorry, I misread this comment.  So if it happens on Windows too, there is probably a cross-platform bug to-do with thin locks (compare-and-swap-optimized locks).  I think Mike Moening saw this in his embedding, cured it via JS_USE_ONLY_NSPR_LOCKS. Cc'ing him.

/be

Building JS with JS_USE_ONLY_NSPR_LOCKS (or adding JS_NO_THIN_LOCKS = 1 to the js/src makefile) did it!  Hopefully a patched 2.0 will be available soon...?  In the meantime, I'll just use my freshly-built version.  Thanks.
> Building JS with JS_USE_ONLY_NSPR_LOCKS (or adding JS_NO_THIN_LOCKS = 1 to the
> js/src makefile) did it!

Thanks for rebuilding and testing, michael!  Did you rebuild on Mac or Windows XP?

> Not enough progress to realistically block 1.8.1.1 unfortunately.

Re-asking for blocking 1.8.1.1, since brendan/michael have made some progress.
Flags: blocking1.8.1.1- → blocking1.8.1.1?
Whiteboard: [Fx 2.0.0.1][potential fix in hand]
Built on and run on XP, I had forgotten to add all the plugins in the first run and now that I have I'm now getting a crash in nsViewManager... seemingly when flash content is init'd.
(In reply to comment #63)
> Built on and run on XP, I had forgotten to add all the plugins in the first run
> and now that I have I'm now getting a crash in nsViewManager... seemingly when
> flash content is init'd.
> 

Rebuilt view and the layout manager, relinked browser and voila, no more crash.  This looks to be a good fix.
Not blocking for 1.8.1.1, but if we get some more traction on this (looks like Brendan and Tony are already investigating), we should try to get a patch together for 1.8.1.2.
Flags: blocking1.8.1.1? → blocking1.8.1.1-
Attached patch hackaroundSplinter Review
Ride-along hack-around.

/be
Assignee: nobody → brendan
Status: NEW → ASSIGNED
Attachment #246741 - Flags: review?(mrbkap)
Attachment #246741 - Flags: approval1.8.1.1?
Attachment #246741 - Flags: review?(mrbkap) → review+
Comment on attachment 246741 [details] [diff] [review]
hackaround

approved for 1.8 branch, a=dveditz
Attachment #246741 - Flags: approval1.8.1.1? → approval1.8.1.1+
Hackaround checked into trunk and 1.8 branch:

Checking in jslock.h;
/cvsroot/mozilla/js/src/jslock.h,v  <--  jslock.h
new revision: 3.32; previous revision: 3.31
done

Checking in jslock.h;
/cvsroot/mozilla/js/src/jslock.h,v  <--  jslock.h
new revision: 3.27.22.2; previous revision: 3.27.22.1
done

May as well hit the 1.8.0 branch too.

/be
Comment on attachment 246741 [details] [diff] [review]
hackaround

This is probably needed on the 1.8.0 branch for Firefox 1.5.0.x.  If anyone knows otherwise, please say so quickly.

/be
Attachment #246741 - Flags: approval1.8.0.9?
Comment on attachment 246741 [details] [diff] [review]
hackaround

approved for 1.8.0 branch, a=dveditz for drivers
Attachment #246741 - Flags: approval1.8.0.9? → approval1.8.0.9+
(In reply to comment #64)
> (In reply to comment #63)
> > Built on and run on XP, I had forgotten to add all the plugins in the first run
> > and now that I have I'm now getting a crash in nsViewManager... seemingly when
> > flash content is init'd.
> > 
> 
> Rebuilt view and the layout manager, relinked browser and voila, no more crash.
>  This looks to be a good fix.

Michael, thanks for confirming -- would you please recite your PC's hardware details, particularly CPU type(s)?  I'd like to fix the compare-and-swap code, if it's sound but just has a bug (possibly in the asm).  If it's unsound, first time we've heard of it in a long time, and many, *many* 411/1-800-555-1212 VXML-based AVR calls (tellme.com uses this code, I believe).

/be
Keywords: fixed1.8.1.1
(In reply to comment #71)
> (In reply to comment #64)
> > (In reply to comment #63)
> > > Built on and run on XP, I had forgotten to add all the plugins in the first run
> > > and now that I have I'm now getting a crash in nsViewManager... seemingly when
> > > flash content is init'd.
> > > 
> > 
> > Rebuilt view and the layout manager, relinked browser and voila, no more crash.
> >  This looks to be a good fix.
> 
> Michael, thanks for confirming -- would you please recite your PC's hardware
> details, particularly CPU type(s)?  I'd like to fix the compare-and-swap code,
> if it's sound but just has a bug (possibly in the asm).  If it's unsound, first
> time we've heard of it in a long time, and many, *many* 411/1-800-555-1212
> VXML-based AVR calls (tellme.com uses this code, I believe).
> 
> /be
> 

Built an 'official' release build and things have been going great for the last week, here's the hardware/os:
OS Name	Microsoft Windows XP Professional
Version	5.1.2600 Service Pack 2 Build 2600
OS Manufacturer	Microsoft Corporation
System Name	***********
System Manufacturer	LENOVO
System Model	2613EAU
System Type	X86-based PC
(T2400 Intel CoreDuo)
Processor	x86 Family 6 Model 14 Stepping 8 GenuineIntel ~1828 Mhz
Processor	x86 Family 6 Model 14 Stepping 8 GenuineIntel ~1828 Mhz
BIOS Version/Date	LENOVO 79ETC1WW (2.01 ), 9/29/2006
SMBIOS Version	2.4
Windows Directory	C:\WINDOWS
System Directory	C:\WINDOWS\system32
Boot Device	\Device\HarddiskVolume1
Locale	United States
Hardware Abstraction Layer	Version = "5.1.2600.2180 (xpsp_sp2_rtm.040803-2158)"
User Name	*****************
Time Zone	Central Standard Time
Total Physical Memory	1,536.00 MB
Available Physical Memory	754.69 MB
Total Virtual Memory	2.00 GB
Available Virtual Memory	1.96 GB
Page File Space	3.35 GB
Page File	C:\pagefile.sys
This seems to be an Intel-specific problem. I upgraded from a G4 Powerbook to an Intel Core 2 Duo MacBook Pro yesterday. I used Apple's migration utility to copy everything over, including FF, plugins, and prefs, so the FF setup was identical. FF2 on the G4 never crashed; on the Intel it's frozen 5x today, always when opening a new window.

Hope this helps.
Assignee: brendan → general
Status: ASSIGNED → NEW
Component: Phishing Protection → JavaScript Engine
Flags: review+
Product: Firefox → Core
QA Contact: phishing.protection → general
Target Milestone: --- → mozilla1.9alpha
Assignee: general → brendan
Status: NEW → ASSIGNED
Whiteboard: [Fx 2.0.0.1][potential fix in hand] → [Fx 2.0.0.1]
Yeah, I think it's expected.  The thin locks are better optimized.  Hopefully another patch will come along to restore them to working properly and the preprocessor flag will go away again.
Comment on attachment 246741 [details] [diff] [review]
hackaround

Firefox 1.5.0.x reportedly works well so this isn't critical to get into 1.8.0.9. Time's up, push this request into the next release
Attachment #246741 - Flags: approval1.8.0.9+ → approval1.8.0.10?
The sooner the better. Why wait?  I'm pretty sure I've see the same problems in the 1.5 version of the engine with heavily threaded apps on multi-CPU boxes.
*** Bug 362877 has been marked as a duplicate of this bug. ***
Not critical, but we'd like to win back the lost performance by fixing how compare-and-swap is used.

/be
Severity: critical → major
Flags: blocking1.8.0.10?
Flags: blocking1.8.0.10? → blocking1.8.0.10+
*** Bug 360167 has been marked as a duplicate of this bug. ***
This bug still exist with Firefox 2.0.0.1 on Intel Mac OS X 10.4.8.
(In reply to comment #83)
> This bug still exist with Firefox 2.0.0.1 on Intel Mac OS X 10.4.8.

Evidence?  Need thread stack backtraces to prove it's not a different hang.

/be
Igor:  Can you provide more details?  We would like to know if this is being seen by others... so any info to help us reproduce will be nice.  Thanks!
I reproduced bug on library Intel Mac OS 10.4.8n system in Europe, but not able to run gdb thread stack trace. I was running Firefox 2.0.0.1 with Firetools Google toolbar 3.0 beta.
I will try to reproduce it again when I will be back and attached stack thread backtraces.
Comment on attachment 246741 [details] [diff] [review]
hackaround

Approved for 1.8.0 branch, a=jay for drivers.


Let's get this fix on the 1.8.0 branch since it appears to have fixed the major issue originally reported.

Igor is seeing a hang still, but we should probably log a new bug if evidence is found.
Attachment #246741 - Flags: approval1.8.0.10? → approval1.8.0.10+
brendan:  Can we get this landed on the 1.8.0 branch asap?  Want to get this for 1.5.0.10.  Thanks!
Fixed on the 1.8.0 branch:

js/src/jslock.h rev 3.27

/be
Keywords: fixed1.8.0.10
gen@mozilla.com has a contact who is also experience freeze.  Gen, can you check if he is using fx 2.0.0.1?
is this fix a candidate for bob's javascript in-testsuite?   setting flag to ?
Flags: in-testsuite?
(In reply to comment #91)

not really.
Flags: in-testsuite? → in-testsuite-
hi Gen, are you still seeing this repro on your mac os 10.4.7 or 10.4.8?  can you try verifying it and seeing if the fix works for you?   

We in QA here haven been unable to reproduce it on 10.4.8, and dont have a machine running 10.4.7. 

Greetings,

Firefox did often freeze with all Firefox 2 alphas, betas and 2.0. However, this bug was resolved in 2.0.0.1 RCs and the Final release.

I'm running a MacBook Pro with 10.4.8 and I don't think the freezing is related specificly to 10.4.7.

I'm sure you can safely mark this bug as fixed.
Well well well... I should have tested the new candidate build sooner.

Warning: This bug comes back loud and clear in 2.0.0.2 candidate build:
Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.2) Gecko/20070208 Firefox/2.0.0.2.

I had to backtrack to 2.0.0.1 loosing all my passwords in the process.

I'll try other 2.0.0.2 candidates hoping one of them will have this fixed.
(In reply to comment #95)
> Well well well... I should have tested the new candidate build sooner.
> 
> Warning: This bug comes back loud and clear in 2.0.0.2 candidate build:
> Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.2) Gecko/20070208
> Firefox/2.0.0.2.

Nothing reverted the source code fix. See comment 83 through comment 87, and please report the hang (with any more info, esp. your addons inventory) in a new bug. As always with hangs, the only way to be sure is to catch a deadlock via thread stack backtraces, which won't be easy.

/be
As I said, I've deleted 2.0.0.2 RC1 and reverted to 2.0.0.1.

I will wait for another 2.0.0.2 RC before trying again and creating a new bug with complete info, if necessary.

As for 2.0.0.1, I have 20 extensions enabled and I have not experienced a single hang since I'm using this version.

Mozilla/5.0 (Macintosh; U; Intel Mac OS X; fr; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1

I'm currently using a French 2.0.0.1 version, but I don't think it's language related.

Is there something specific I should look for in a newer 2.0.0.2 RC2?
hi jacques,

Can you try loading a new profile with 2.0.0.2 RC1 on your mac osx 10.4.7, and repro it then?   That will help us take out the 20 extensions victim if that is the cause of the crash.  
To create multiple profiles, you will need to set MOZ_NO_REMOTE=1 in your mac environment.   Thanks, Tony
Tony's comment is right on, and if you see no hangs on a 2.0.0.2 with a fresh profile, try adding the extensions you use. You could add the first ten, see whether you hang, and binary search those ten if so, else add the rest, confirm the hang, and try removing the last five, etc.

This really should go into a new bug, because the fix for this bug is still in place and there are lots of ways to hang Firefox. So please file one when you get a chance. Thanks,

/be
Attached file Firefox 2 failure
Another failure log.
(In reply to comment #100)
> Created an attachment (id=254892) [details]
> Firefox 2 failure
> 
> Another failure log.

Hi Gen -- that has nothing to do with this bug.  Please file a new bug and get some layout folks (dbaron at least) to have a look. Thanks,

/be

Brendan- thanks for the clarification.  I will do that.
(In reply to comment #97)
> I will wait for another 2.0.0.2 RC before trying again and creating a new bug
> with complete info, if necessary.
> 

As mentionned, I've waited for 2.0.0.2 RC2. I have it in hand and I can report happily that RC2 works like a charm.

I can now confirm that both 2.0.0.1 and 2.0.0.2 versions for Mac Intel do not hang anymore. Can someone else verify this too?

BTW I was 21 extension loaded.

Cheers,

-J
BTW I have 21 extensions loaded. [Sorry for the mistake]

- J
hi Jacques, do you mind trying your same profile in 2.002 RC2 and rerun it back on 2.002 RC1?   I am curious why it would start working now since i dont think there has been any code change between these builds regarding this bug.  At least i dont think /be checked anything new in right?
Nothing to do with JS lock optimizations changed since the hackaround fix for this bug went into the branches.  It's impossible to know what caused a hang, but if it happened in a branch build that had this bug's patch (attachment 246741 [details] [diff] [review]), then you should file a new bug. Adding comments here does not help get to the bottom of the (almost certainly) new and different problem.

If by some bad luck the patch was reverted, we'd end up back here when noticing. But I just updated my MOZILLA_1_8_BRANCH tree and still see

/* FIXME: bug 353962 hackaround */
#define JS_USE_ONLY_NSPR_LOCKS  1

in jslock.h.

/be
Hanging / crashing here too: see bug 368907 for trace etc. Sometimes up to 10x a day (*VERY annoying*). 

Maybe contact apple as it does send reports to apple, but does not send reports to mozilla?
Btw I started noticing the crashes around RC-1 or so... and it didn't go away with any new version. Using intel mac 10.4.9 now with 2.0.0.2. 
(In reply to comment #107)
> Hanging / crashing here too: see bug 368907 for trace etc. Sometimes up to 10x
> a day (*VERY annoying*). 

Thanks for reporting in bug 368907. Please don't bother commenting here. The stack trace there shows a bug completely unrelated to this one.

> Maybe contact apple as it does send reports to apple, but does not send reports
> to mozilla?

That's right -- IIRC we don't have talkback for mactel machines, but breakpad will support x86 macs.

/be
(In reply to comment #73)
> This seems to be an Intel-specific problem. I upgraded from a G4 Powerbook to
> an Intel Core 2 Duo MacBook Pro yesterday. I used Apple's migration utility to
> copy everything over, including FF, plugins, and prefs, so the FF setup was
> identical. FF2 on the G4 never crashed; on the Intel it's frozen 5x today,
> always when opening a new window.
> 
> Hope this helps.
> 
The powerpc version of Mozilla/FF does not use a thin locks implementation in js. So if this is related to thin locks in js it will not show up on ppc.

I have spent the better part of today surfing with 2.0.0.1 with a ppc thin locks implementation and have had no problems. System:

PowerMac 8600, 750GX processor (250 MHz speed)
Linux 2.4.31
dialup connection

I'll bump the speed on the 750GX. Would it be of any help to see what happens on a GigE (7455 @ 1.1 GHz x2) and Linux 2.6.17.7? Or MacOSX Panther? What about x86 Linux?

Should I have seen some hangs? Has this been reproduced on anything other than one of those multi-core beasties?

At the risk of getting smacked, why did the "fix" disable thin locks for everything (i.e. Sparc, PPC (still waiting on 296878 or whatever it is), x86 Linux)?
(In reply to comment #110)
> At the risk of getting smacked, why did the "fix" disable thin locks for
> everything (i.e. Sparc, PPC (still waiting on 296878 or whatever it is), x86
> Linux)?

The "fix" was explicitly labeled a hackaround, so let's call it that. This bug is still open.

Time was pressing and I wasn't willing to bet that the bug in jslock.c was really peculiar to x86 dual core machines. I suspect you are right that there is no such bug for PPC with a thinlock impl -- did you want to post a patch of the support you added?

/be
(In reply to comment #89)
> Fixed on the 1.8.0 branch:

js/src/jslock.h rev 3.27.26.1
(In reply to comment #111)
 
> The "fix" was explicitly labeled a hackaround, so let's call it that. This bug
> is still open.
> 
Did not mean to step on any toes. I don't know what a hackaround is? I spent way longer than I should have trying to figure out why my compare and swap code was not showing in jslock.o.

> I suspect you are right that there
> is no such bug for PPC with a thinlock impl

Do you think it is possible that this only shows up with dohickeys (extensions & addons) that are only available with MacOS or Windows? (i.e. I don't use many extension and addons.)

Do you think it would be worthwhile to see if any of the MacOS users that see this problem might be able to try to reproduce it on a PPC MacOS box (especially a dual core G5)? Wouldn't that help suggest whether it is a problem with the compare and swap implementation on x86 or something with thinlocks and their use?

Wonder if it would be worth trying the 386 implementation? Assuming (mm? where is a spelling bee winner when you need one?) those instructions are even supported anymore?

> did you want to post a patch of
> the support you added?
> 
If I understand you correctly then it has been languishing in bug 296878 for 2 years.

> /be
> 

Mac OSX has a set of atomic compare and swap functions for 32-bits and 64-bits that probably implements what is desired here. There are variants for the normal
operations and there are also "barrier" operations.

From the Mac OSX doc:

     These functions are thread and multiprocessor safe.  For each function,
     there is a version that does and anoother that does not incorporate a
     memory barrier.  Barriers strictly order memory access on a weakly-ordered weaklyordered
     ordered architecture such as PPC.  All loads and stores executed in
     sequential program order before the barrier will complete before any load
     or store executed after the barrier.  On a uniprocessor, the barrier
     operation is typically a nop.  On a multiprocessor, the barrier can be
     quite expensive.

     Most code will want to use the barrier functions to insure that memory
     shared between threads is properly synchronized.  For example, if you
     want to initialize a shared data structure and then atomically increment
     a variable to indicate that the initialization is complete, then you MUST
     use OSAtomicIncrement32Barrier() to ensure that the stores to your data
     structure complete before the atomic add.  Likewise, the consumer of that
     data structure MUST use OSAtomicDecrement32Barrier(), in order to ensure
     that their loads of the structure are not executed before the atomic
     decrement.  On the other hand, if you are simply incrementing a global
     counter, then it is safe and potentially much faster to use OSAtomicIn-crement32(). OSAtomicIncrement32().
     crement32().  If you are unsure which version to use, prefer the barrier
     variants as they are safer.

http://developer.apple.com/documentation/Darwin/Reference/ManPages/man3/OSAtomicCompareAndSwap32.3.html

It sounds like the barrier form should be used for generic Mac OSX builds. These functions are callouts so there is the call overhead vs inline assembler but it may be better than whatever we're doing now.
Here's another data point:  I turned off the phishing feature as well as the autodetect I had set up for my library's proxy server.  I immediately got less hanging up with FF 2.0.0.7.

I'm a novice just trying to figure out how to make FF work better on my Mac.  Thanks to everyone here for sharing your much greater expertise and helping me learn.
Here's another data point:  I turned off the phishing feature as well as the autodetect I had set up for my library's proxy server.  I immediately got less hanging up with FF 2.0.0.7 on my MacBook (2 GHz Core 2 Duo; 1.6 MHz RAM).

I'm a novice just trying to figure out how to make FF work better on my Mac.  Thanks to everyone here for sharing your much greater expertise and helping me learn.
(In reply to comment #114)
> Mac OSX has a set of atomic compare and swap functions for 32-bits and 64-bits
> that probably implements what is desired here. There are variants for the
> normal
> operations and there are also "barrier" operations.

Thanks for the pointers, but I'd like to know why our circa-1998 x86 asm coding in jslock.c, to implement compare-and-swap correctly on multiprocessors, is incorrect now on new Intel multicores.

Someone with detailed uarch knowledge could probably eyeball our code and say. I'm years out of date and pretty busy elsewhere, but I do want to "fix" this bug (should have filed a subordinate bug for the brute-force disabling change).

So, help diagnosing the actual problem in our jslock.c code on modern multicore x86 systems is appreciated.

/be
Keywords: helpwanted
mmoy, could this bug depend on bug 331545 (if that bug were fixed, would it cure what ails us on Windows)? Hoping you'll take this bug, and let me know what you need to get that bug (331545) moving.

/be
Attached patch use OSAtomic.hSplinter Review
This doesn't turn these functions back on in the header file, since comment 121 isn't addressed. I would like to get this reenabled for the beta, since the JSON encoder/decoder can bump pthread_mutex_(un)lock to the top of a profile through js_AtomizeString. See bug 410890 comment 3 for data.
Attachment #296600 - Flags: review?(brendan)
Comment on attachment 296600 [details] [diff] [review]
use OSAtomic.h

r/a=me if this inlines, otherwise reach for a macro. Thanks!

/be
Attachment #296600 - Flags: review?(brendan)
Attachment #296600 - Flags: review+
Attachment #296600 - Flags: approval1.9+
From private communication with mmoy, I have hopes that the hangs here are due to a bug in early Core 2 Duo processors covered by published Intel errata -- that would be good to confirm. If so, do the OS CAS routines cope?

/be
(In reply to comment #125)
> From private communication with mmoy, I have hopes that the hangs here are due
> to a bug in early Core 2 Duo processors covered by published Intel errata --
> that would be good to confirm. If so, do the OS CAS routines cope?
> 
> /be
> 

I'm using Core Duo but not Core 2 Duo, and I've experienced this bug
I also have a core duo and experience this a fair amount
(In reply to comment #126)
> (In reply to comment #125)
> > From private communication with mmoy, I have hopes that the hangs here are due
> > to a bug in early Core 2 Duo processors covered by published Intel errata --
> > that would be good to confirm. If so, do the OS CAS routines cope?
> > 
> > /be
> > 
> 
> I'm using Core Duo but not Core 2 Duo, and I've experienced this bug
> 

(In reply to comment #127)
> I also have a core duo and experience this a fair amount
> 


No, you don't. I turned off compare-and-swap over a year ago. See comment 68. If you see hangs with fresh 2.0.0.x and 3 beta or nightly builds, you are seeing some other bug. Please try to get a stack (luser, is there a way to force a stack via breakpad?), file a new bug if you can get some information other than "hung". Do not bother this bug.

/be
(In reply to comment #124)
> (From update of attachment 296600 [details] [diff] [review])
> r/a=me if this inlines, otherwise reach for a macro. Thanks!

It inlines at -Os, and it's checked in.
(In reply to comment #128)
> Please try to get a stack (luser, is there a way to
> force a stack via breakpad?)

Argh, I forgot to cc: Ted (AKA "luser" on IRC -- my calling his nick in the last comment could be misread as an abusive remark -- sorry about that!).

/be
You may be able to kill the process with the right signal, but I'm not terribly sure about that on Mac.
Blocks: 331545
(In reply to comment #125)
> From private communication with mmoy, I have hopes that the hangs here are due
> to a bug in early Core 2 Duo processors covered by published Intel errata --
> that would be good to confirm. If so, do the OS CAS routines cope?

It would be good to identify users that can reproduce this problem and exactly
what hardware and operating system they use. If this is multicore cache invalidation problem or other synchronization problem (either in hardware or software), then setting processor affinity should be a workaround. If setting processor affinity does not solve the problem, then I would conclude that it isn't a multicore problem. On Windows, it is possible to set processor affinity for a running process using the task manager. The process is described at http://articles.techrepublic.com.com/5100-10877-6168870.html

It can be set permanently using the imagecfg.exe tool from Microsoft.

Mac/OSX doesn't have a mechanism to set processor affinity until Leopard and it's an API from what I read. I do not know if there's a tool that you can just run on a process to set affinity. The Apple CHUD tools have a component called CPU pref that will allow the user to turn off one core on the fly. The CHUD tools are located on the installation disks that come with Macs. I think that they can also be downloaded from the Apple Developer Center.

It would be interesting to see if this problem goes away with either processor affinity or with turning a core off on a system. The former is a reasonable though pain-in-the neck workaround while the latter really isn't.
Again, the patch to turn off compare-and-swap landed over a year ago, so to test this bug you have to make your own build re-enabling it (comment out jslock.h line 190).

I have the CHUD CPU pref tool in my menubar and use it often, e.g., to get gdb to work better on firefox.

Comment 106 through comment 111 are worth re-reading, along with comment 39. We have good testimony that this bug only bit dual core Intel machines, and that its symptoms (not other hang bugs of course) went away when I turned off thin locks for all platforms on trunk and 1.8 branch.

Short of turning thin locks back on for Firefox 3 and making sure the hang gets reported unambiguously, I'm not flush with good ideas. If this is a known Intel bug I'd love to read the errata.

/be
The one that looks suspicious to me is AH39 at http://download.intel.com/design/mobile/SPECUPDT/31407918.pdf

The Windows builds at Vector64.com use the intrinsic that translate to compare and swap. I can do a release build on Mac OSX with the old processing for those unable to do their own builds. The Mac OSX builds at Vector64.com use the OSX system call for interlock compare and exchange.
I went back and reread the thread and it's clear that there were more Core Duo users with the problem than Core 2 Duo users. There is an errata that sounds
like it could cause this problem in Core Duo as well: AE44: Logical Processors
May Not Detect Write Back Memory Writes.

Multiprocessor systems may use polling of memory semaphores to sychronize
software activity. Because of this erratum, if a logical processor is
polling a WB memory location while it is being updated by another logical
processor, the update may not be detected. Implication: system may livelock
due to polling loop and undetected semaphore change.
Blocks: 312238
Attached patch the bug fix!Splinter Review
This bug was caused by the patch for bug 312238 -- early in the evolution of that bug's patch, a critical js_CompareAndSwap was removed (attachment 213358 [details]). Shame on me for not catching this during code review. This bug was found by Arjan Van De Ven of Intel, to whom I am profoundly grateful. A great day!

/be
Attachment #300723 - Flags: review+
Attachment #300723 - Flags: approval1.9b3?
Attachment #300723 - Flags: approval1.9+
Flags: blocking1.9+
Comment on attachment 300723 [details] [diff] [review]
the bug fix!

a=beltzner
Attachment #300723 - Flags: approval1.9b3? → approval1.9b3+
Fixed:

js/src/jslock.c 3.74
js/src/jslock.h 3.36

We should fix on the 1.8 branch, too.

/be
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Flags: blocking1.8.1.13?
Resolution: --- → FIXED
Comment on attachment 300723 [details] [diff] [review]
the bug fix!

Patch applies cleanly to the 1.8 branch.

/be
Attachment #300723 - Flags: approval1.8.1.13?
For those interested in the detailed race scenario (and I hope that bugzilla leaves the ascii art in tact):


Scenario; two CPUs, CPU0 and CPU1. Two threads, with say a "me" of 4 and 6 respectively, 4 runs on CPU0, 6 on CPU1.
Ascii art time flow (wide lines, sorry about that)

Time   fl->owner 		CPU0 / thread 4		CPU1 / thread 6		Comment
0	 0                Entering js_Lock		..  				Start situation, lock is not held by anyone
1	 0			is_CompareAndSwap(0,4)	..				This succeeds, "4" gets the lock
2	 4			return from js_Lock	..				all is well, "4" has the lock
3	 4			<actual work>		Entering js_Lock		"6" now tries to get the lock
4	 4			<actual work>		is_CompareAndSwap(0,6)	Lock is contended, CAS fails; going to slowpath
5	 4			<actual work>		entering js_Enqueue	slowpath entered, taking global lock
6	 4			<actual work>		o = 4; n = 5		(lines 978 and 979 of jslock.c)
7	 4			entering js_Unlock	..				"4" is done and is going to unlock the lock
8	 4			if (owner == me) 		..				line 1030 of jslock.c... this returns true
9	 4			..				jsCompareAndSwap(4,5)	ln 980: CompareAndSwap done, wait flag set
10	 5			owner = 0			..				line 1031 of jslock.c: we set the lock to 0
11	 0			..				calling js_SuspendThread	Going to sleep until woken		
12	 0                return;			..				line 1032 of jslock.c: we're done 
13 to eternity		..				waiting for a waker that never comes since the lock is free and nobody
								knows or cares that anyone is waiting for the lock

Ok that didn't work very well; adding as attachement instead
MikeM: you want the last patch in this bug.

Bob: we should make sure this gets into the js1.7 release (which is not out yet, IIRC -- please correct if wrong). If you are feeling generous, it could go into 1.6.x and probably help some folks, but it's up to you.

/be
(In reply to comment #142)

1.7.0 has been released but we can do a 1.7.1 shortly. I'll add this to the 1.6.1 blocker list. 
Blocks: 371572
Flags: blocking1.8.0.15?
(In reply to comment #136)
> Created an attachment (id=300723) [details]
> the bug fix!
> 
> This bug was caused by the patch for bug 312238 -- early in the evolution of
> that bug's patch, a critical js_CompareAndSwap was removed (attachment 213358 [details]).
> Shame on me for not catching this during code review. This bug was found by
> Arjan Van De Ven of Intel, to whom I am profoundly grateful. A great day!
> 
> /be
> 

So this was not related at all to those duo beasties, right? I should (or could?) have bumped into this with any of my various x86 builds or the thin lock enabled ppc builds. All on Linux. Any thoughts on why this wasn't seen more widely (more cpus and/or OSes)? I just checked my js_Unlock() code and it has the bad code. Think my being on dial-up could have changed timing enough to prevent me from seeing this?

kevin
(In reply to comment #144)
> So this was not related at all to those duo beasties, right? I should (or
> could?) have bumped into this with any of my various x86 builds

No, not if you did not re-enable thin locks by removing the #define from jslock.h.

> or the thin
> lock enabled ppc builds.

Yes, if you enabled thin locks after the patch for bug 312238 went in, you could see the bug's symptoms :-/.

> All on Linux. Any thoughts on why this wasn't seen
> more widely (more cpus and/or OSes)? I just checked my js_Unlock() code and it
> has the bad code. Think my being on dial-up could have changed timing enough to
> prevent me from seeing this?

Again, for over a year we have simply disabled thin locks to work around the undiagnosed bug.

/be
> 
> kevin
> 

If you really had thinlocks enabled in your ppc builds for a while, then I don't know why you didn't see hangs (or more hangs; perhaps a few went by without being analyzed?). The bug introduced by the patch for bug 312238 is a race condition, with unknown likelihood of bad outcomes.

Since we don't share JS objects cross-thread in Firefox, the only source of contention I know of is the JSRuntime's atom table lock (see comment 39). The anti-phishing feature uses cross-thread XPCOM proxying and JS on a background thread (again with no object sharing AFAIK). If you never used anti-phishing, had it disabled, then you probably wouldn't see the symptom.

/be
As to why Duo's saw this more often; that I can explain. It has to do with cpu caches and cross-cpu cache line bouncing. But first lets start with some lottery/statistics theory ;-)

This is a very tight race, it's on a few instructions in size.
Now, if you had a 1 CPU system, you can still hit this, due to the OS scheduler preempting you. The probability that this happens in the race window (2 instructions) in thread one is about one in 10.000.000 (assume 2Ghz cpu and 10msec timeslice). But for this race to happen, the 2nd thread also had to be preempted exactly in the right spot, so another one in 10.000.000 chance. (this is finger-in-the-air statistics, if you nitpick hard you may be able to argue a zero away or a zero extra, doesn't matter).
So the chance this happens on a UP system is about one in 100.000.000.000.000... or in other words, its more likely that you win the lottery two weeks in a row than hit this bug.

Now on a multi cpu system (or a dual core where each core has its own cache), the rules change a little. First of all, things run in parallel so no need to be preempted just right. Second, the timing rules for the race change. Because in the race scenario, you're bouncing cachelines back and forth between the 2 cpus, the window is suddenly dozens to hundreds of cycles (depending on various hardware things). It is still very unlikely that you'll see this on such a system, and that has to do with how cpu caches work. For the race to happen, cpu 0 has the cacheline with the lock in step 14, and needs to have it in step 16. While cpu 1 wants it for step 15 to happen. In a SMP situation, this effectively means that at the time of 15/16, the cache of cpu 0 has a request for the cacheline from cpu 1, but also a request from it's own cpu (I sweep these two together because the other-cpu request is a slower one). Now... caches are generally set up to give preferences to their own cpu in such cases (and the link to the own cpu is orders of magnitude faster anyway so this will happen naturally even without explicit choices)... which means that step 15 generally (but not always of course!) is unlikely to be timed just right for the race. In other words, the fundamental unfairness of the caches (bias to local) saves mozillas bacon here.

Now, enter Core2 DUO... here the 2 cores of the cpu share the cache... which means that the "bias to local" that saved mozilla above suddenly doesn't matter. The cache now has 2 requests, both from local cpus (and both requests are fast, not one slow as above), and who gets it is a coinflip, not a strong bias to cpu 0 -> race is suddenly practical to happen.
This isn't a cpu bug; in fact it's just caused because the cpus can do very fast cache sharing... so the speed of the cpu works against you in this race ;(
This bug is a mess. After fixing one problem a year ago, please file new bugs rather than resurrecting old ones. Our creaky branch-management flags and keywords simply can't handle it.

(In reply to comment #138)
> We should fix on the 1.8 branch, too.

Why? according to comment 136 this is caused by bug 312238 which does not appear to have been fixed on the 1.8 branch. Also, who wrote or reviewed the patch? From the bug it looks like a one-man affair which makes us nervous on the stable branch.
(In reply to comment #148)
> This bug is a mess.

It's worse than you think -- the bug where js_Unlock used an unlocked compare and store instead of js_CompareAndSwap was injected into the 1.8 branch, with this checkin:

revision 3.55.20.1
date: 2006/07/07 02:12:02;  author: mrbkap%gmail.com;  state: Exp;  lines: +63 -30
JS1.7 landing. r=brendan a=schrep

I found this using cvs annotate -rMOZILLA_1_8_BRANCH jslock.c | less. You can confirm using cvs diff.

> After fixing one problem a year ago, please file new bugs
> rather than resurrecting old ones. Our creaky branch-management flags and
> keywords simply can't handle it.

You're right, this bug has been mismanaged a long time. Sorry about that -- I use it as a cautionary tale now, when others think about rolling multiple fixes in under one bug.

> (In reply to comment #138)
> > We should fix on the 1.8 branch, too.
> 
> Why? according to comment 136 this is caused by bug 312238 which does not
> appear to have been fixed on the 1.8 branch.

See above -- bugzilla is not the source of ultimate truth, the cvs repo is.

> Also, who wrote or reviewed the
> patch? From the bug it looks like a one-man affair which makes us nervous on
> the stable branch.

C'mon, that's silly. The patch undoes an incorrect change and the #define that worked around that chnage; it's well understood. No superstitious invocation of crowd wisdom, please.

/be
Comment on attachment 300723 [details] [diff] [review]
the bug fix!

(In reply to comment #149)
> C'mon, that's silly.

To be honest what makes me nervous is that appearing to play favorites for the MoCo CTO might breed disrespect for the tree rules. But on inspection this is a back-out to previously good code, and I can wave sr=dveditz around so I guess it isn't going to be a problem.

approved for 1.8.1.13, a=dveditz
Attachment #300723 - Flags: superreview+
Attachment #300723 - Flags: approval1.8.1.13?
Attachment #300723 - Flags: approval1.8.1.13+
Dan: I think this is strange, and I'm saying so in public. No one invoked "MoCo" or my title (which reporters seem to forget, returning me to "Software Engineer", my starting job position out of grad school -- hey, I'm not proud -- it's what I put on my tax return).

The patch was from Arjan, not me -- he found the problem without knowing the bug history. I don't see why you should disrespect him to avoid "breed[ing] direspect" for tree rules that we are all following. Where is it written that branch fixes must be the product of a committee?

/be
(In reply to comment #151)
> Dan: I think this is strange, and I'm saying so in public. No one invoked
> "MoCo" or my title (which reporters seem to forget, returning me to "Software
> Engineer", my starting job position out of grad school -- hey, I'm not proud --
> it's what I put on my tax return).

I might have misunderstood, but I think Dan just meant that this was going outside of how we normally treat branch patches; that is, requiring a new bug be filed so we can track the progress better. Bugzilla wasn't made to handle multiple patches landing on multiple branches and multiple times.

Personally, my initial gut feeling was to not approve this patch in this bug but ask that a new bug be filed to help lessen the confusion. Just because we've done it previously in this bug doesn't mean we should keep doing behavior we know is wrong and hard to follow.

> The patch was from Arjan, not me -- he found the problem without knowing the
> bug history. I don't see why you should disrespect him to avoid "breed[ing]
> direspect" for tree rules that we are all following. Where is it written that
> branch fixes must be the product of a committee?

Again, it could just be the mess in this bug (which is why a new bug would be great in the future), but...

With apologies to Arjan, it's not clear in this bug that the patch was from him. I can track it back to bug 331545, but apparently it's not really a fix for that bug and was instead attached here (instead of a new bug getting filed). There was no comment saying that's where it came from though and, 147 comments in, it's hard to tell that's what happened.

The triage team has a lot of bugs to go through and we have to make quick judgement calls on both the bugs and their patches. Without clear explanations (and clear bugs!) it's hard to do that successfully. There's nothing "written that branch fixes must be the product of a committee" nor do I think that's what has happened here. But some form of driving has always (in my time here) occurred for stability/security branches and, again, it's a lot of bugs and a lot of work to make determines of risk vs reward.

Anyway, my two cents. Regardless, let's get the patch checked in. :)
(In reply to comment #152)
> I might have misunderstood, but I think Dan just meant that this was going
> outside of how we normally treat branch patches; that is, requiring a new bug
> be filed so we can track the progress better.

Excuse me... I meant to say... "outside of how we normally treat *multiple* branch patches in the same bug."
(In reply to comment #152)
> (In reply to comment #151)
> > Dan: I think this is strange, and I'm saying so in public. No one invoked
> > "MoCo" or my title (which reporters seem to forget, returning me to "Software
> > Engineer", my starting job position out of grad school -- hey, I'm not proud --
> > it's what I put on my tax return).
> 
> I might have misunderstood, but I think Dan just meant that this was going
> outside of how we normally treat branch patches; that is, requiring a new bug
> be filed so we can track the progress better.

That's fine but it has *nothing* to do with the job title on my business card, and I never invoked that title. Hence my b.s. detector still going off.

> Bugzilla wasn't made to handle
> multiple patches landing on multiple branches and multiple times.

You're right, and I said so myself (comment 149). It's fine to ask for a new bug, although that is mostly pro-forma overhead without actual added value that traces the chain of cause and effect to the actual malfunction being fixed -- tracing that would take you back into the swamp of the original, overloaded regressing bug (bug 312238), which took a composite patch *only part of which contained the regression*; and then to the overloaded tracking bug (this bug).

Yes, it's confusing. No, adding another bug doesn't really help except by summarizing in a fresh page, instead of adding comment 154 and counting. If that kind of summary bug would help, I will file it. I hope to avoid it.

> Personally, my initial gut feeling was to not approve this patch in this bug
> but ask that a new bug be filed to help lessen the confusion.

Your feelings are sound, young Skywalker :-P.

> Just because
> we've done it previously in this bug doesn't mean we should keep doing behavior
> we know is wrong and hard to follow.

Agreed again (did I write something about using this bug as a cautionary tale? I don't need it to be any more cautionary ;-).

> > The patch was from Arjan, not me -- he found the problem without knowing the
> > bug history. I don't see why you should disrespect him to avoid "breed[ing]
> > direspect" for tree rules that we are all following. Where is it written that
> > branch fixes must be the product of a committee?
> 
> Again, it could just be the mess in this bug (which is why a new bug would be
> great in the future), but...
> 
> With apologies to Arjan, it's not clear in this bug that the patch was from
> him. I can track it back to bug 331545,

That bug 331545 has nothing to do with the problem Arjan fixed. No one is nominating that bug's patch for the branch.

> but apparently it's not really a fix
> for that bug and was instead attached here (instead of a new bug getting
> filed).

This bug was about a specific hang, which Arjan's patch precisely fixes. I do not agree that a new bug is needed to track that self-same hang.

The error here lay in patching this bug with the hackaround that #defined JS_USE_ONLY_NSPR_LOCKS (attachment 246741 [details] [diff] [review] "hackaround"), then leaving this bug open. That was the time to close this bug and file a followup bug, or (better) to file a dependent bug to hold the hackaround patch. That was my fault, and once again: mea culpa.

> There was no comment saying that's where it came from though and, 147
> comments in, it's hard to tell that's what happened.

Please re-read comment 136, which completely diagrams the chain of events including a link to the patch that introduced the hang.

> The triage team has a lot of bugs to go through and we have to make quick
> judgement calls on both the bugs and their patches.

I know, and again, my apologies. This *is* confusing, but comment 136 is the summary you want.

> Without clear explanations
> (and clear bugs!) it's hard to do that successfully. There's nothing "written
> that branch fixes must be the product of a committee" nor do I think that's
> what has happened here.

That's a slight exaggeration of what Dan seemed to be arguing for in comment 148 (by arguing against a "one-man affair"). Really, there have been three JS hackers involved in the history of this bug (Feng, Igor, myself), and then Arjan independently identifying the flaw in the code, followed by my mangled bug body forensics.

> But some form of driving has always (in my time here)
> occurred for stability/security branches and, again, it's a lot of bugs and a
> lot of work to make determines of risk vs reward.

If the fundamentals here, which involve hairy things like multicore race conditions, are not easy for anyone to grok, then there's not a lot that new bug window-dressing can do. I would be unhappy if triagers were judging surface and not substance.

I'd rather have an exchange that points out the messy etiology, even if it rehashes the bug mismanagement, than let either my bug management mistakes or (what look to me like) cargo cult or strangly Moco-inside-baseball theories ("CTO"? Why is that not an ad-hominem argument? I'll resign if someone is coloring their judgment of my work based on a damn job title) of 1.8 branch bug risk management slide.

/be
Fixed on the 1.8 branch -- thanks for approval in the face of messy, overloaded bugs (one bug per patch, I've learned my lesson already).

js/src/jslock.c 3.55.20.4
js/src/jslock.h 3.27.22.3

/be
> The patch was from Arjan, not me

Thanks, that's the answer to my original question.

> tree rules that we are all following.

It was not clear that was the case, at a hurried glance it looked like a
self-review. I'm sorry we're not able to spend much time trying to understand
each bug, and this bug is already confusing.
Bob: which patch is still wanted for 1.8.0.15?  I see something got landed...  Are you willing to champion the rest in?
For some reason the only thing that landed on the 1.8.0 (Firefox 1.5) branch was the #define JS_USE_ONLY_NSPR_LOCKS 1 hackaround to jslock.h. js_Unlock in jslock.c on the 1.8.0 branch correctly uses js_CompareAndSwap to swap 0 for me in tl->owner. So this is all you need, caillon.

/be
Attachment #304263 - Flags: review+
Attachment #304263 - Flags: approval1.8.0.15?
Comment on attachment 304263 [details] [diff] [review]
1.8.0 branch patch

Thanks for the quick followup, brendan!  a=caillon for 1.8.0.15
Attachment #304263 - Flags: approval1.8.0.15? → approval1.8.0.15+
Fixed on the MOZILLA_1_8_0_BRANCH:

js/src/jslock.h 3.27.26.2

/be
Flags: blocking1.8.1.13? → blocking1.8.1.13+
Flags: blocking1.8.0.15? → blocking1.8.0.15-
Flags: blocking1.8.0.15- → blocking1.8.0.15+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: