353962 - Firefox 2.0 often hangs in Intel Mac OS X 10.4.7

Reporter

Description

•

18 years ago

User-Agent:       Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
Build Identifier: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6

Firefox 2.0 (beta 2 and RC1 candidate) often hangs in Intel Mac OS X 10.4.7.
The hang may happen after opening Firefox, open a external url, etc.

Reproducible: Sometimes

Steps to Reproduce:
1.Open Firefox.app, or set Firefox as the default browser, and open a url
2.
3.

Actual Results:  
Firefox hangs, the mouse icon keeps on the rolling ball, and use cmd+alt+esc will show Firefox "not response" after a few seconds

Expected Results:  
Firefox starts and show the homepage or the specified url correctly

Firefox 1.5 works well

Yuxuan Wang

Reporter

Comment 1

•

18 years ago

I created a new profile and it seems don't happen again. But after I copied the "signons.txt" file from my old profile to the new profile, it hangs again

Version: unspecified → 2.0 Branch

Nickolay_Ponomarev

Comment 2

•

18 years ago

Thanks for your report. Since we can't ask you to give us your passwords, could you maybe try to minimize the file to see if it's a particular entry causing the trouble? Also if possible, please attach a profile from Shark.app or Sampler.app to see if we can figure out where does Firefox hang.

Also, same signons.txt (with key3.db) works fine in Firefox 1.5, correct?

Keywords: hang

Yuxuan Wang

Reporter

Comment 3

•

18 years ago

maybe signons.txt isn't the problem. I've just copied signons.txt but without the key3.db, sorry for that.
But the new profile I created, after some normal use (just visit website, etc.) and empty signons.txt (but in the normal use I've stored some passwords) also hangs sometimes, but not as often as my old profile.
Firefox 1.5 never hangs on my machine with the same old profile

Nickolay_Ponomarev

Comment 4

•

18 years ago

Well if you could figure out what causes the hang and/or figure out what hangs (using the abovementioned tools), it would help.

Yuxuan Wang

Reporter

Comment 5

•

18 years ago

Attached file A Sampler.app trace — Details

I just open Firefox.app to display my homepage(about:mozilla) and firefox hangs.

Bill Northlich

Comment 6

•

18 years ago

I just sent this to the ff mozilla support group, fyi
---
ff 2.0 hangs a lot for me.  About 4 times a day.  I'm on macbook w/ 2 gig mem.  I keep a lot of tabs open (~20) accross crashes/hangs/reboots; else, I have noticed no pattern on this.

Mostly, when it hangs, I "force quit" ff, and it recovers nicely.

Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1) Gecko/20061003 Firefox/2.0

Roy Rosenzweig

Comment 7

•

18 years ago

my problem is pretty much identical to comment #6; only differences are macbook pro; 10.4.8 and 1.5 gb of memory; never had the problem on 1.5; i trashed everything and started over with no luck. i'm about to try that again, but i'm not confident it will do anything

Bill Northlich

Comment 8

•

18 years ago

Attached file shark time profile taken after ff hung — Details

here is a shark "timing profile" taken after a ff hang.  ff rc3.  Macbook 2mb.  Should I set some other options?  Seems I can get it to hang within a day...

Håkan Waara

Comment 9

•

18 years ago

1) In order for Shark output to be interesting (to non-mac devs), you need to export it as text. For it to be truly useful, a developer need to first mine the data because it usually contains lots of redundant info.

2) Sampler output also needs to be exported as text, if you wnat non-mac devs to look at it. Either Sampler.app or Shark.app can do this conversion, I think.

I will try to write more about on the developer wiki about this in the near future.

Yuxuan Wang

Reporter

Comment 10

•

18 years ago

Attached file The text version of my Sampler.app output — Details

This is the text version for attachment #241686 [details]

timeless

Comment 11

•

18 years ago

afaik the code in question comes from here: http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&root=/cvsroot&subdir=mozilla/js/src/xpconnect/src&command=DIFF_FRAMESET&file=xpccallcontext.cpp&rev2=1.1.2.5&rev1=1.1.2.4

That follows is a /slightly/ shortened view of the problem, i've used {x} to indicate the number of spaces i've removed.
Call graph:
{3} 285 Thread_5907
{11} 285 XRE_main
{15} 285 nsAppShell::Run()
{35} 285 nsTimerImpl::Fire()
{37} 285 nsGlobalWindow::TimerCallback(nsITimer*, void*)
{39} 285 nsGlobalWindow::RunTimeout(nsTimeout*)
{41} 285 nsJSContext::CallEventHandler(JSObject*, JSObject*, unsigned, long*, long*)
{69} 285 nsJSCID::GetService(nsISupports**)
{71} 285 nsComponentManagerImpl::GetService(nsID const&, nsID const&, void**)
{73} 285 nsComponentManagerImpl::CreateInstance(nsID const&, nsISupports*, nsID const&, void**)
{81} 285 nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS*, unsigned short, nsXPTMethodInfo const*, nsXPTCMiniVariant*)
{87} 285 js_InvokeConstructor
{97} 285 XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode)
{105} 285 nsXPCWrappedJSClass::GetRootJSObject(XPCCallContext&, JSObject*)
{107} 285 nsXPCWrappedJSClass::CallQueryInterfaceOnJSObject(XPCCallContext&, JSObject*, nsID const&)
{109} 285 AutoScriptEvaluate::~AutoScriptEvaluate [in-charge]()
{111} 285 nsJSContext::ScriptExecuted()
{113} 285 nsJSContext::ScriptEvaluated(int)
{115} 285 JS_GC
{117} 285 js_GC
{119} 285 PR_WaitCondVar

{3} 285 Thread_6003
{9} 285 nsHostResolver::ThreadFunc(void*)
{11} 285 nsHostResolver::OnLookupComplete(nsHostRecord*, unsigned, PRAddrInfo*)
{13} 285 nsDNSAsyncRequest::OnLookupComplete(nsHostResolver*, nsHostRecord*, unsigned)
{15} 285 nsCOMPtr_base::assign_with_AddRef(nsISupports*)
{17} 285 nsProxyEventObject::Release()
{21} 285 nsProxyEventObject::Release()
{27} 285 nsQueryInterface::operator()(nsID const&, void**) const
{29} 285 nsXPCWrappedJS::QueryInterface(nsID const&, void**)
{31} 285 nsXPCWrappedJSClass::DelegatedQueryInterface(nsXPCWrappedJS*, nsID const&, void**)
{33} 285 XPCCallContext::XPCCallContext[in-charge](XPCContext::LangType, JSContext*, JSObject*, JSObject*, long, unsigned, long*, long*)
{35} 285 XPCJSContextStack::GetSafeJSContext(JSContext**)
{37} 285 nsXPConnect::InitClasses(JSContext*, JSObject*)
{39} 285 XPCNativeWrapper::AttachNewConstructorObject(XPCCallContext&, JSObject*)
{41} 285 JS_InitClass
{49} 285 js_AtomizeString
{51} 285 js_Lock
{53} 285 PR_WaitCondVar

unless i'm missing something we're basically *not* grabbing a lock on a context just because we know no one else has it, which seems wrong/broken.

Assignee: nobody → dbradley

Component: General → XPConnect

Product: Firefox → Core

QA Contact: general → xpconnect

Version: 2.0 Branch → Trunk

Boris Zbarsky [:bzbarsky]

Comment 12

•

18 years ago

I should note that XPCJSContextStack::GetSafeJSContext starts a request before calling InitClasses.  So is the bug really in XPConnect or in JS?

In any case, enough people are seeing this, and we have confirmed deadlock traces, so confirming bug and maybe we should block 2.0 on this...

Status: UNCONFIRMED → NEW

Ever confirmed: true

Flags: blocking1.8.1?

Boris Zbarsky [:bzbarsky]

Comment 13

•

18 years ago

Given that this didn't happen with 1.5, according to comments, adding "regression" keyword.

Keywords: regression

timeless

Comment 14

•

18 years ago

are you configured to use proxy auto config? the pref name is network.proxy.autoconfig_url in case you need to check prefs.js manually.

unfortunately other possibilities include sessionstore and safebrowsing, if someone could explain how to use user.js to disable these, that'd be appreciated.

also, i presume you've tried running in safe mode?

Assignee: dbradley → general

Component: XPConnect → JavaScript Engine

QA Contact: xpconnect → general

Tony Chang (Google)

Comment 15

•

18 years ago

(In reply to comment #14)
> unfortunately other possibilities include sessionstore and safebrowsing, if
> someone could explain how to use user.js to disable these, that'd be
> appreciated.

Add:
user_pref("browser.safebrowsing.enabled", false);
In prefs.js to disable safebrowsing.

If it's hanging on startup, try adding that and seeing if it helps.

Brendan Eich [:brendan]

Assignee

Comment 16

•

18 years ago

Timeless, you wrote: "unless i'm missing something we're basically *not* grabbing a lock on a context just because we know no one else has it, which seems wrong/broken."  Presumably you are talking about jband's code to call JS_BeginRequest only if JS_GetContextThread returns non-zero.  That's old code dating from when requests were not used on the main thread, and it was thought that main-thread contexts had zero thread ids.  They don't.  The test of JS_GetContextThread should be removed, but this code isn't implicated in the stacks you show.

There must be another stack for there to be a deadlock.  The two stacks you show do not complete a cycle in the lock dependency graph.

Does anyone have full stack backtraces for a hang associated with this bug?  Does the shark data attached show thread stacks?

/be

Assignee: general → dbradley

Component: JavaScript Engine → XPConnect

QA Contact: general → xpconnect

Yuxuan Wang

Reporter

Comment 17

•

18 years ago

I've did it and it hangs again

(In reply to comment #15)
> (In reply to comment #14)
> > unfortunately other possibilities include sessionstore and safebrowsing, if
> > someone could explain how to use user.js to disable these, that'd be
> > appreciated.
> 
> Add:
> user_pref("browser.safebrowsing.enabled", false);
> In prefs.js to disable safebrowsing.
> 
> If it's hanging on startup, try adding that and seeing if it helps.
>

Dietrich Ayala (:dietrich)

Comment 18

•

18 years ago

Session restore can be disabled by setting browser.sessionstore.enabled = false.

Is there anything in particular I should look for in the session restore code that might cause this?

Yuxuan Wang

Reporter

Comment 19

•

18 years ago

Attached file A report generated by Shark.app — Details

I choose to reopen a closed tab and firefox hangs again, this is the Shark report.
I will disable sessionsaver and try if it hangs again

Yuxuan Wang

Reporter

Comment 20

•

18 years ago

Attached file I deadlock after I disabled sessionstore — Details

This is a report generated by Sampler.app

Brendan Eich [:brendan]

Assignee

Comment 21

•

18 years ago

This looks more like a stuck lock (missing unlock) than a deadlock.  Fun.  If someone can catch it in gdb and find me on irc.mozilla.org, I can help debug.  Otherwise we'll have to go looking for a needle in a haystack.

/be

Brendan Eich [:brendan]

Assignee

Comment 22

•

18 years ago

Moving back to JS engine based on latest data.  If this is a stuck JS lock then we could try to narrow down where the missing unlock might be based on novel code usage.  Does anti-phishing use new JS1.7 features?

/be

Assignee: dbradley → general

Component: XPConnect → JavaScript Engine

QA Contact: xpconnect → general

Marcia Knous [:marcia]

Comment 23

•

18 years ago

Just to get some additional data points, I checked the hendrix feedback to see if there is more widespread reports of hang problems on the Mac. Looks as if most all of the hang reports in hendrix are from Windows, and there were not that many reports that were specific to either RC2 or RC3.

Mike Schroepfer

Updated

•

18 years ago

Flags: blocking1.8.1? → blocking1.8.1.1?

Tony Chang (Google)

Comment 24

•

18 years ago

(In reply to comment #22)
> Moving back to JS engine based on latest data.  If this is a stuck JS lock then
> we could try to narrow down where the missing unlock might be based on novel
> code usage.  Does anti-phishing use new JS1.7 features?

Nope, based on code that works in FF 1.5.

Yuxuan Wang

Reporter

Comment 25

•

18 years ago

Attached file gdb attached, and execute "fin" command — Details

It hangs again and i use gdb to attach this process, execute the "fin" command.
It hangs on ReceiveNextEventCommon, but after a while, it exited from this function and didn't hangs anymore.
I don't know if it's not a real hang, or gdb changed the time-order so resolved the deadlock. Next time it hangs, i'll wait for some minutes to ensure it's a real hang, and then use gdb to attach.
Hope this helps.

Yuxuan Wang

Reporter

Comment 26

•

18 years ago

Another: Firefox hangs on start-up, here's the "bt" and "fin" command in gdb:
---
(gdb) bt
#0  0x90024427 in semaphore_wait_signal_trap ()
#1  0x90028414 in pthread_cond_wait ()
#2  0x00fbe931 in PR_WaitCondVar ()
#3  0x00fbeb91 in PR_Wait ()
#4  0x002ff17a in nsUrlClassifierDBService::EnsureThreadStarted ()
#5  0x002ff7f9 in nsUrlClassifierDBService::CancelStream ()
#6  0x00301b5f in TableUpdateListener::OnStopRequest ()
#7  0x005c96a6 in nsHttpChannel::OnStopRequest ()
#8  0x0034c7f8 in nsInputStreamPump::OnStateStop ()
#9  0x0034cd7d in nsInputStreamPump::OnInputStreamReady ()
#10 0x01872b55 in nsInputStreamReadyEvent::EventHandler ()
#11 0x01847585 in PL_HandleEvent ()
#12 0x0184783e in PL_ProcessPendingEvents ()
#13 0x0184895c in nsEventQueueImpl::ProcessPendingEvents ()
#14 0x0180e439 in NS_ShutdownXPCOM_P ()
#15 0x00003468 in ScopedXPCOMStartup::~ScopedXPCOMStartup ()
#16 0x00006171 in XRE_main ()
#17 0x00003258 in main ()
(gdb) fin
Run till exit from #0  0x90024427 in semaphore_wait_signal_trap ()
0x90028414 in pthread_cond_wait ()
(gdb) 
Run till exit from #0  0x90028414 in pthread_cond_wait ()
0x00fbe931 in PR_WaitCondVar ()
(gdb) 
Run till exit from #0  0x00fbe931 in PR_WaitCondVar ()
0x00fbeb91 in PR_Wait ()
(gdb) 
Run till exit from #0  0x00fbeb91 in PR_Wait ()
0x002ff17a in nsUrlClassifierDBService::EnsureThreadStarted ()
(gdb) 
Run till exit from #0  0x002ff17a in nsUrlClassifierDBService::EnsureThreadStarted ()
---
and this is the sampler.app report:
---
# Report 0 - Timed Samples for firefox-bin (process 7610)
Stacks at 2006-10-22 00:54:24 +0800
Samples (displayed/total): 560/560
Call graph:
    280 Thread_444f
      280 _pthread_body
        280 PR_Select
          280 nsThread::Main(void*)
            280 TimerThread::Run()
              280 PR_WaitCondVar
                280 semaphore_wait_signal_trap
                  280 semaphore_wait_signal_trap
    280 Thread_5a07
      280 start
        280 start
          280 main
            280 XRE_main
              280 ScopedXPCOMStartup::~ScopedXPCOMStartup [in-charge]()
                280 NS_ShutdownXPCOM_P
                  280 nsEventQueueImpl::ProcessPendingEvents()
                    280 PL_ProcessPendingEvents
                      280 PL_HandleEvent
                        280 nsInputStreamReadyEvent::EventHandler(PLEvent*)
                          280 nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream*)
                            280 nsInputStreamPump::OnStateStop()
                              280 nsHttpChannel::OnStopRequest(nsIRequest*, nsISupports*, unsigned)
                                280 TableUpdateListener::OnStopRequest(nsIRequest*, nsISupports*, unsigned)
                                  280 nsUrlClassifierDBService::CancelStream()
                                    280 nsUrlClassifierDBService::EnsureThreadStarted()
                                      280 PR_Wait
                                        280 PR_WaitCondVar
                                          280 semaphore_wait_signal_trap
                                            280 semaphore_wait_signal_trap

Total number in stack (recursive counted multiple, when >=5):

Sort by top of stack, same collapsed (when >= 5):
        semaphore_wait_signal_trap        560
---
hope this helps

Yuxuan Wang

Reporter

Comment 27

•

18 years ago

It hangs many times today, but nearly every time it hangs, I use gdb to attach it, and execute "fin", it will hang at the function ReceiveNextEventCommon and then get out of it and the hang just disappears.
There're 2 exceptions: 1 is comment #26 that hangs on start-up(the ReceiveNextEventCommon hangs are during surfing but not starting up), another is also during surfing:
(gdb) bt
#0  0x90024427 in semaphore_wait_signal_trap ()
#1  0x90028414 in pthread_cond_wait ()
#2  0x00fbe931 in PR_WaitCondVar ()
#3  0x00f0fdd6 in js_GC ()
#4  0x00ee42c6 in JS_GC ()
#5  0x0051e060 in nsJSContext::Notify ()
#6  0x0184a6ae in nsTimerImpl::Fire ()
#7  0x0184ae8f in handleTimerEvent ()
#8  0x01847585 in PL_HandleEvent ()
#9  0x0184783e in PL_ProcessPendingEvents ()
#10 0x90828379 in CFRunLoopRunSpecific ()
#11 0x90827eb5 in CFRunLoopRunInMode ()
#12 0x92dcdb90 in RunCurrentEventLoopInMode ()
#13 0x92dcd297 in ReceiveNextEventCommon ()
#14 0x92e15929 in _AcquireNextEvent ()
#15 0x92e15774 in RunApplicationEventLoop ()
#16 0x0023feb3 in nsAppShell::Run ()
#17 0x002d7f8a in nsAppStartup::Run ()
#18 0x000064f9 in XRE_main ()
#19 0x00003258 in main ()
(gdb) fin
Run till exit from #0  0x90024427 in semaphore_wait_signal_trap ()
0x90028414 in pthread_cond_wait ()
(gdb) 
Run till exit from #0  0x90028414 in pthread_cond_wait ()
0x00fbe931 in PR_WaitCondVar ()
(gdb) 
Run till exit from #0  0x00fbe931 in PR_WaitCondVar ()
0x00f0fdd6 in js_GC ()
(gdb) 
Run till exit from #0  0x00f0fdd6 in js_GC ()

Tony Chang (Google)

Comment 28

•

18 years ago

(In reply to comment #26)

Hmm, this looks like anti-phishing code.  Are you able to get this trace to trigger with anti-phishing off?  It shouldn't invoke TableUpdateListener unless it's on.

Brendan Eich [:brendan]

Assignee

Comment 29

•

18 years ago

Whew, no stuck JS lock (I searched the source and couldn't find any bad paths).  Moving to anti-phishing, cc'ing darin.

/be

Assignee: general → nobody

Component: JavaScript Engine → Phishing Protection

Product: Core → Firefox

QA Contact: general → phishing.protection

Tony Chang (Google)

Comment 30

•

18 years ago

(In reply to comment #28)
> (In reply to comment #26)
> 
> Hmm, this looks like anti-phishing code.  Are you able to get this trace to
> trigger with anti-phishing off?  It shouldn't invoke TableUpdateListener unless
> it's on.

Actually, it doesn't really matter, since it seems like having it on can cause a hang during startup.  Investigating an idea that might be causing this . . .

Yuxuan Wang

Reporter

Comment 31

•

18 years ago

I've disabled anti-phishing(in preferences->Security->Tell me blah blah...) but it also hangs.

Tony Chang (Google)

Comment 32

•

18 years ago

(In reply to comment #31)
> I've disabled anti-phishing(in preferences->Security->Tell me blah blah...) but
> it also hangs.

So I think the startup hang (comment #26) is due to anti-phishing (bug 355399).  The hangs while running seem unrelated, especially if you're not seeing it with anti-phishing off.

timeless

Comment 33

•

18 years ago

*** Bug 356721 has been marked as a duplicate of this bug. ***

Frank Markus

Comment 34

•

18 years ago

On Macbook Pro (Intel 2.16 Core Duo) running OS X 1.4.8, Firefox 2.0 frequently hangs when opening a new page (often from a hyperlink.)  There is no way to unfreeze the program other than to quit or force quit.  Sometimes, resume session re-freezes the program but not always.

Tony Chang (Google)

Comment 35

•

18 years ago

#0  0x9002449f in semaphore_wait_trap ()
#1  0x9000110f in pthread_mutex_lock ()
#2  0x00fbe46c in PR_Lock ()
#3  0x00fbeaf8 in PR_EnterMonitor ()
#4  0x0184e2bf in nsProxyEventObject::GetNewOrUsedProxy ()
#5  0x0184f0e4 in nsProxyObjectManager::GetProxyForObject ()
#6  0x0184ee31 in NS_GetProxyForObject ()
#7  0x002ffd56 in nsUrlClassifierDBService::Exists ()
#8  0x0185d98d in XPTC_InvokeByIndex ()
#9  0x0038ccfb in XPCWrappedNative::CallMethod ()
#10 0x0037ef93 in XPC_WN_CallMethod ()
#11 0x00f22443 in js_Invoke ()
#12 0x00f13316 in js_Interpret ()
#13 0x00f2299b in js_Invoke ()
#14 0x005e6a84 in nsXPCWrappedJSClass::CallMethod ()
#15 0x005e3dcf in nsXPCWrappedJS::CallMethod ()
#16 0x0185da46 in XPTC_InvokeByIndex ()
#17 0x0185dc60 in nsXPTCStubBase::Stub7 ()
#18 0x0185d98d in XPTC_InvokeByIndex ()
#19 0x0038ccfb in XPCWrappedNative::CallMethod ()
#20 0x0037ef93 in XPC_WN_CallMethod ()
#21 0x00f22443 in js_Invoke ()
#22 0x00f13316 in js_Interpret ()
#23 0x00f2299b in js_Invoke ()
#24 0x005e6a84 in nsXPCWrappedJSClass::CallMethod ()
#25 0x005e3dcf in nsXPCWrappedJS::CallMethod ()
#26 0x0185da46 in XPTC_InvokeByIndex ()
#27 0x0185dc92 in nsXPTCStubBase::Stub8 ()
#28 0x0185d98d in XPTC_InvokeByIndex ()
#29 0x0038ccfb in XPCWrappedNative::CallMethod ()
#30 0x0037ef93 in XPC_WN_CallMethod ()
#31 0x00f22443 in js_Invoke ()
#32 0x00f13316 in js_Interpret ()
#33 0x00f2299b in js_Invoke ()
#34 0x005e6a84 in nsXPCWrappedJSClass::CallMethod ()
#35 0x005e3dcf in nsXPCWrappedJS::CallMethod ()
#36 0x0185da46 in XPTC_InvokeByIndex ()
#37 0x0185db98 in nsXPTCStubBase::Stub3 ()
#38 0x002d5b8c in nsDocNavStartProgressListener::Observe ()
#39 0x0184a6ee in nsTimerImpl::Fire ()
#40 0x0184ae8f in handleTimerEvent ()
#41 0x01847585 in PL_HandleEvent ()
#42 0x0184783e in PL_ProcessPendingEvents ()
#43 0x90828379 in CFRunLoopRunSpecific ()
#44 0x90827eb5 in CFRunLoopRunInMode ()
#45 0x92dcdb90 in RunCurrentEventLoopInMode ()
#46 0x92dcd297 in ReceiveNextEventCommon ()
#47 0x92e15929 in _AcquireNextEvent ()
#48 0x92e15774 in RunApplicationEventLoop ()
#49 0x0023feb3 in nsAppShell::Run ()
#50 0x002d7f8a in nsAppStartup::Run ()
#51 0x000064f9 in XRE_main ()
#52 0x00003258 in main ()

So maybe something wrong with how we're using proxy events?

Håkan Waara

Comment 36

•

18 years ago

See also bug 343033 (Firefox trunk builds always hang on quit), where we also see the phishing service in the stack. 

Might it be the same dead-lock causing that bug?

timeless

Comment 37

•

18 years ago

iirc proxy events aren't safe before the thread manager rewrite, if 2.0 doesn't have that rewrite, then they aren't safe :).

can you walk the stack and figure out where the monitor/lock is that's already holding it?

Tony Chang (Google)

Comment 38

•

18 years ago

Attached file backtrace of all threads — Details

thread 9 + thread 1 seems suspicious

Brendan Eich [:brendan]

Assignee

Comment 39

•

18 years ago

(In reply to comment #38)
> Created an attachment (id=244130) [edit]
> backtrace of all threads
> 
> thread 9 + thread 1 seems suspicious

Thread 2 is stuck on the proxy event manager monitor too, but it looks like it's not holding anything.

Thread 9 has the proxy event manager monitor and seems to want the JS atom table lock.  Is it stuck waiting indefinitely, or if you finish in gdb after switching to thread 9, does the thread make forward progress and end up meta-stable (i.e., is this thread live-locking somehow)?

I can't see how the atom table lock is held by any thread shown.  What does 'info threads' disclose?  Any other threads not consecutively numbered, with a hole for threads that exited in the thread id space?

If you have this in gdb, find me on irc.

/be

Tony Chang (Google)

Comment 40

•

18 years ago

Attached file another backtrace (obsolete) — Details

Another gdb backtrace.  I have this held in gdb, so find me if you want more information (info threads only shows the 11 shown here and finish doesn't complete on thread 11).

Tony Chang (Google)

Comment 41

•

18 years ago

Attached file same as last bt with "info threads" included — Details

Attachment #244251 - Attachment is obsolete: true

timjohnson0

Comment 42

•

18 years ago

FWIW, I am able to consistently repro the hang when I turn on auto-proxy config (pointing to a local .pac file).  When I turn off the auto-proxy config, the hang goes away.  [MacBook Pro, 10.4.8, only on FF 2.0)

yaehbhi

Comment 43

•

18 years ago

In my case turning off auto-proxy and changing to manual doesn't help. FireFox still hangs...
I'm on a Mac Mini running 10.4.8

(In reply to comment #42)
> FWIW, I am able to consistently repro the hang when I turn on auto-proxy config
> (pointing to a local .pac file).  When I turn off the auto-proxy config, the
> hang goes away.  [MacBook Pro, 10.4.8, only on FF 2.0)
>

Yuxuan Wang

Reporter

Comment 44

•

18 years ago

Attached file A true deadlock in gdb — Details

This is no doubt a deadlock, the attachment is the threads and backtraces

Roy Rosenzweig

Comment 45

•

18 years ago

I did a total fresh install (deleting profile, etc) and I seem to be rid of the constant hangs that I had (on macbook pro running 10.4.8). I did notice that after i reinstalled the Google toolbar, I had the same problem again. So, I uninstalled that. Is anyone using the Google toolbar?

Yuxuan Wang

Reporter

Comment 46

•

18 years ago

(In reply to comment #45)
> I did a total fresh install (deleting profile, etc) and I seem to be rid of the
> constant hangs that I had (on macbook pro running 10.4.8). I did notice that
> after i reinstalled the Google toolbar, I had the same problem again. So, I
> uninstalled that. Is anyone using the Google toolbar?
> 

Yes I was using Google toolbar. Now I uninstalled it and seems OK now.
But in FX 1.5 with the same version of Google toolbar, FX never hangs.

Roy Rosenzweig

Comment 47

•

18 years ago

google toolbar worked fine for me in FX 1.5 as well.

Yuxuan Wang

Reporter

Comment 48

•

18 years ago

I've disabled Google Toolbar for more than 24-hours and Fx never hangs after that so far.

Bill Northlich

Comment 49

•

18 years ago

Ran ff 11/3 -> today (11/6) on Mac (Intel) without Google toolbar.  No hangs so far.

Frank Markus

Comment 50

•

18 years ago

I fear that this is nothing more than a "me too" posting but I have have yet to have a hang since I disabled the Google toolbar.  I did, however, send a message to Google support for the toolbar bringing this issue to their attention.  So far, all that I have heard back is an auto-reply email.  I hope that someone with more knowledge and/or better contacts can pursue this with them.

Tony Chang (Google)

Comment 51

•

18 years ago

Just to be clear, it sounds like everyone is saying that this only happens with google toolbar, which may mean it's a bug in toolbar, not firefox.  Does this hang happen to anyone not using google toolbar?

Jim

Comment 52

•

18 years ago

(In reply to comment #51)
> Just to be clear, it sounds like everyone is saying that this only happens with
> google toolbar, which may mean it's a bug in toolbar, not firefox.  Does this
> hang happen to anyone not using google toolbar?
> 

I thought I would try FF2 again after moving back to FF1.7.  FF2 would hang on startup nearly everytime I launched it.  I am and was NOT using Google Toolbar.  I installed using only approved Mozilla extensions and NO Google Toolbar.  I turned off Phishing, reran and FF2 immediately hung with the spinning beach ball.  I force quit and reran.  FF2 started up fine and had done so for the last day now.  I do get that pesky spinning beach ball on startup but it doesn't seem to hang anymore.  FF2 seems quite a bit slower than FF1.7 too.  I am using an Intel iMac, 2ghz, 512 mb ram, OS X 10.4.8.

Daniel Veditz [:dveditz]

Comment 53

•

18 years ago

Not enough progress to realistically block 1.8.1.1 unfortunately.

Flags: wanted1.8.1.x+

Flags: blocking1.8.1.1?

Flags: blocking1.8.1.1-

Tony Chang (Google)

Comment 54

•

18 years ago

(In reply to comment #53)
> Not enough progress to realistically block 1.8.1.1 unfortunately.

It seems that the problem is caused specifically by the pagerank feature in toolbar.  One possible work around is on bug 337492, but I agree that this shouldn't block 1.8.1.1.  We're also trying to work around the hang on the toolbar end.

Yuxuan Wang

Reporter

Comment 55

•

18 years ago

I've enabled google toolbar, and disabled PageRank, and fx hangs once again.
This time it hangs on js_GC

Brendan Eich [:brendan]

Assignee

Comment 56

•

18 years ago

(In reply to comment #55)
> I've enabled google toolbar, and disabled PageRank, and fx hangs once again.
> This time it hangs on js_GC

Please attach stack backtraces for all threads.

/be

Yuxuan Wang

Reporter

Comment 57

•

18 years ago

Attached file hangs on js_GC function — Details

This is the backtraces for all threads while a hang occurred in js_GC function

Michael Hovis

Comment 58

•

18 years ago

I am able to reproduce this on Windows XP SP2 on a core duo with a debug build of firefox 2.0, but ONLY when the google toolbar is installed.  I had a dump of this on Windows but unfortunately lost it when I swapped source and destination designations with a tar -czf command.  Sorry.

Brendan Eich [:brendan]

Assignee

Comment 59

•

18 years ago

I'd like to get to the bottom of this.  If it's *only* happening in Mactels, it could be a bug in how compare-and-swap is implemented by jslock.c.  Anyone who can reproduce under gdb, please ping me on IRC.  Tony did last time, we spent a while, but there's more we can try.  It may be worth adding a double-check patch before hand, if you can do your own build.  I'll try to come up with one.

/be

Brendan Eich [:brendan]

Assignee

Comment 60

•

18 years ago

(In reply to comment #58)
> I am able to reproduce this on Windows XP SP2 on a core duo with a debug build
> of firefox 2.0, but ONLY when the google toolbar is installed.  I had a dump of
> this on Windows but unfortunately lost it when I swapped source and destination
> designations with a tar -czf command.  Sorry.

Sorry, I misread this comment.  So if it happens on Windows too, there is probably a cross-platform bug to-do with thin locks (compare-and-swap-optimized locks).  I think Mike Moening saw this in his embedding, cured it via JS_USE_ONLY_NSPR_LOCKS. Cc'ing him.

/be

Michael Hovis

Comment 61

•

18 years ago

Building JS with JS_USE_ONLY_NSPR_LOCKS (or adding JS_NO_THIN_LOCKS = 1 to the js/src makefile) did it!  Hopefully a patched 2.0 will be available soon...?  In the meantime, I'll just use my freshly-built version.  Thanks.

(not reading, please use seth@sspitzer.org instead)

Comment 62

•

18 years ago

> Building JS with JS_USE_ONLY_NSPR_LOCKS (or adding JS_NO_THIN_LOCKS = 1 to the
> js/src makefile) did it!

Thanks for rebuilding and testing, michael!  Did you rebuild on Mac or Windows XP?

> Not enough progress to realistically block 1.8.1.1 unfortunately.

Re-asking for blocking 1.8.1.1, since brendan/michael have made some progress.

Flags: blocking1.8.1.1- → blocking1.8.1.1?

Whiteboard: [Fx 2.0.0.1][potential fix in hand]

Michael Hovis

Comment 63

•

18 years ago

Built on and run on XP, I had forgotten to add all the plugins in the first run and now that I have I'm now getting a crash in nsViewManager... seemingly when flash content is init'd.

Michael Hovis

Comment 64

•

18 years ago

(In reply to comment #63)
> Built on and run on XP, I had forgotten to add all the plugins in the first run
> and now that I have I'm now getting a crash in nsViewManager... seemingly when
> flash content is init'd.
> 

Rebuilt view and the layout manager, relinked browser and voila, no more crash.  This looks to be a good fix.

Jay Patel [:jay]

Comment 65

•

18 years ago

Not blocking for 1.8.1.1, but if we get some more traction on this (looks like Brendan and Tony are already investigating), we should try to get a patch together for 1.8.1.2.

Flags: blocking1.8.1.1? → blocking1.8.1.1-

Brendan Eich [:brendan]

Assignee

Comment 66

•

18 years ago

Attached patch hackaround — Details — Splinter Review

Ride-along hack-around.

/be

Assignee: nobody → brendan

Status: NEW → ASSIGNED

Attachment #246741 - Flags: review?(mrbkap)

Attachment #246741 - Flags: approval1.8.1.1?

Blake Kaplan (:mrbkap) (inactive)

Updated

•

18 years ago

Attachment #246741 - Flags: review?(mrbkap) → review+

Daniel Veditz [:dveditz]

Comment 67

•

18 years ago

Comment on attachment 246741 [details] [diff] [review]
hackaround

approved for 1.8 branch, a=dveditz

Attachment #246741 - Flags: approval1.8.1.1? → approval1.8.1.1+

Brendan Eich [:brendan]

Assignee

Comment 68

•

18 years ago

Hackaround checked into trunk and 1.8 branch:

Checking in jslock.h;
/cvsroot/mozilla/js/src/jslock.h,v  <--  jslock.h
new revision: 3.32; previous revision: 3.31
done

Checking in jslock.h;
/cvsroot/mozilla/js/src/jslock.h,v  <--  jslock.h
new revision: 3.27.22.2; previous revision: 3.27.22.1
done

May as well hit the 1.8.0 branch too.

/be

Brendan Eich [:brendan]

Assignee

Comment 69

•

18 years ago

Comment on attachment 246741 [details] [diff] [review]
hackaround

This is probably needed on the 1.8.0 branch for Firefox 1.5.0.x.  If anyone knows otherwise, please say so quickly.

/be

Attachment #246741 - Flags: approval1.8.0.9?

Daniel Veditz [:dveditz]

Comment 70

•

18 years ago

Comment on attachment 246741 [details] [diff] [review]
hackaround

approved for 1.8.0 branch, a=dveditz for drivers

Attachment #246741 - Flags: approval1.8.0.9? → approval1.8.0.9+

Brendan Eich [:brendan]

Assignee

Comment 71

•

18 years ago

(In reply to comment #64)
> (In reply to comment #63)
> > Built on and run on XP, I had forgotten to add all the plugins in the first run
> > and now that I have I'm now getting a crash in nsViewManager... seemingly when
> > flash content is init'd.
> > 
> 
> Rebuilt view and the layout manager, relinked browser and voila, no more crash.
>  This looks to be a good fix.

Michael, thanks for confirming -- would you please recite your PC's hardware details, particularly CPU type(s)?  I'd like to fix the compare-and-swap code, if it's sound but just has a bug (possibly in the asm).  If it's unsound, first time we've heard of it in a long time, and many, *many* 411/1-800-555-1212 VXML-based AVR calls (tellme.com uses this code, I believe).

/be

Keywords: fixed1.8.1.1

Michael Hovis

Comment 72

•

18 years ago

(In reply to comment #71)
> (In reply to comment #64)
> > (In reply to comment #63)
> > > Built on and run on XP, I had forgotten to add all the plugins in the first run
> > > and now that I have I'm now getting a crash in nsViewManager... seemingly when
> > > flash content is init'd.
> > > 
> > 
> > Rebuilt view and the layout manager, relinked browser and voila, no more crash.
> >  This looks to be a good fix.
> 
> Michael, thanks for confirming -- would you please recite your PC's hardware
> details, particularly CPU type(s)?  I'd like to fix the compare-and-swap code,
> if it's sound but just has a bug (possibly in the asm).  If it's unsound, first
> time we've heard of it in a long time, and many, *many* 411/1-800-555-1212
> VXML-based AVR calls (tellme.com uses this code, I believe).
> 
> /be
> 

Built an 'official' release build and things have been going great for the last week, here's the hardware/os:
OS Name	Microsoft Windows XP Professional
Version	5.1.2600 Service Pack 2 Build 2600
OS Manufacturer	Microsoft Corporation
System Name	***********
System Manufacturer	LENOVO
System Model	2613EAU
System Type	X86-based PC
(T2400 Intel CoreDuo)
Processor	x86 Family 6 Model 14 Stepping 8 GenuineIntel ~1828 Mhz
Processor	x86 Family 6 Model 14 Stepping 8 GenuineIntel ~1828 Mhz
BIOS Version/Date	LENOVO 79ETC1WW (2.01 ), 9/29/2006
SMBIOS Version	2.4
Windows Directory	C:\WINDOWS
System Directory	C:\WINDOWS\system32
Boot Device	\Device\HarddiskVolume1
Locale	United States
Hardware Abstraction Layer	Version = "5.1.2600.2180 (xpsp_sp2_rtm.040803-2158)"
User Name	*****************
Time Zone	Central Standard Time
Total Physical Memory	1,536.00 MB
Available Physical Memory	754.69 MB
Total Virtual Memory	2.00 GB
Available Virtual Memory	1.96 GB
Page File Space	3.35 GB
Page File	C:\pagefile.sys

Greg Yorke

Comment 73

•

18 years ago

This seems to be an Intel-specific problem. I upgraded from a G4 Powerbook to an Intel Core 2 Duo MacBook Pro yesterday. I used Apple's migration utility to copy everything over, including FF, plugins, and prefs, so the FF setup was identical. FF2 on the G4 never crashed; on the Intel it's frozen 5x today, always when opening a new window.

Hope this helps.

Brendan Eich [:brendan]

Assignee

Updated

•

18 years ago

Assignee: brendan → general

Status: ASSIGNED → NEW

Component: Phishing Protection → JavaScript Engine

Flags: review+

Product: Firefox → Core

QA Contact: phishing.protection → general

Target Milestone: --- → mozilla1.9alpha

Daniel Veditz [:dveditz]

Updated

•

18 years ago

Assignee: general → brendan

Reed Loden [:reed]

Updated

•

18 years ago

Status: NEW → ASSIGNED

Whiteboard: [Fx 2.0.0.1][potential fix in hand] → [Fx 2.0.0.1]

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 74

•

18 years ago

Defining JS_USE_ONLY_NSPR_LOCKS caused a slight jump in Ts on luna. Was this expected?

http://build-graphs.mozilla.org/graph/query.cgi?testname=startup&tbox=luna.mozilla.org&autoscale=1&days=7&avg=1&showpoint=2006:11:28:12:10:51,3487

Brian Crowder

Comment 75

•

18 years ago

Yeah, I think it's expected.  The thin locks are better optimized.  Hopefully another patch will come along to restore them to working properly and the preprocessor flag will go away again.

Daniel Veditz [:dveditz]

Comment 76

•

18 years ago

Comment on attachment 246741 [details] [diff] [review]
hackaround

Firefox 1.5.0.x reportedly works well so this isn't critical to get into 1.8.0.9. Time's up, push this request into the next release

Attachment #246741 - Flags: approval1.8.0.9+ → approval1.8.0.10?

Mike Moening

Comment 77

•

18 years ago

The sooner the better. Why wait?  I'm pretty sure I've see the same problems in the 1.5 version of the engine with heavily threaded apps on multi-CPU boxes.

Brendan Eich [:brendan]

Assignee

Comment 78

•

18 years ago

*** Bug 362877 has been marked as a duplicate of this bug. ***

(not reading, please use seth@sspitzer.org instead)

Comment 79

•

18 years ago

Attached file another sample from dan gillmor — Details

(not reading, please use seth@sspitzer.org instead)

Comment 80

•

18 years ago

Attached file one more from dan gillmor — Details

Brendan Eich [:brendan]

Assignee

Comment 81

•

18 years ago

Not critical, but we'd like to win back the lost performance by fixing how compare-and-swap is used.

/be

Severity: critical → major

Daniel Veditz [:dveditz]

Updated

•

18 years ago

Flags: blocking1.8.0.10?

Daniel Veditz [:dveditz]

Updated

•

18 years ago

Flags: blocking1.8.0.10? → blocking1.8.0.10+

Brendan Eich [:brendan]

Assignee

Comment 82

•

18 years ago

*** Bug 360167 has been marked as a duplicate of this bug. ***

Igor Balak

Comment 83

•

18 years ago

This bug still exist with Firefox 2.0.0.1 on Intel Mac OS X 10.4.8.

Brendan Eich [:brendan]

Assignee

Comment 84

•

18 years ago

(In reply to comment #83)
> This bug still exist with Firefox 2.0.0.1 on Intel Mac OS X 10.4.8.

Evidence?  Need thread stack backtraces to prove it's not a different hang.

/be

Jay Patel [:jay]

Comment 85

•

18 years ago

Igor:  Can you provide more details?  We would like to know if this is being seen by others... so any info to help us reproduce will be nice.  Thanks!

Igor Balak

Comment 86

•

18 years ago

I reproduced bug on library Intel Mac OS 10.4.8n system in Europe, but not able to run gdb thread stack trace. I was running Firefox 2.0.0.1 with Firetools Google toolbar 3.0 beta.
I will try to reproduce it again when I will be back and attached stack thread backtraces.

Jay Patel [:jay]

Comment 87

•

18 years ago

Comment on attachment 246741 [details] [diff] [review]
hackaround

Approved for 1.8.0 branch, a=jay for drivers.


Let's get this fix on the 1.8.0 branch since it appears to have fixed the major issue originally reported.

Igor is seeing a hang still, but we should probably log a new bug if evidence is found.

Attachment #246741 - Flags: approval1.8.0.10? → approval1.8.0.10+

Jay Patel [:jay]

Comment 88

•

18 years ago

brendan:  Can we get this landed on the 1.8.0 branch asap?  Want to get this for 1.5.0.10.  Thanks!

Brendan Eich [:brendan]

Assignee

Comment 89

•

18 years ago

Fixed on the 1.8.0 branch:

js/src/jslock.h rev 3.27

/be

Keywords: fixed1.8.0.10

(not reading, please use seth@sspitzer.org instead)

Comment 90

•

18 years ago

gen@mozilla.com has a contact who is also experience freeze.  Gen, can you check if he is using fx 2.0.0.1?

Tony Chung [:tchung]

Comment 91

•

17 years ago

is this fix a candidate for bob's javascript in-testsuite?   setting flag to ?

Flags: in-testsuite?

Bob Clary [:bc] (inactive)

Comment 92

•

17 years ago

(In reply to comment #91)

not really.

Flags: in-testsuite? → in-testsuite-

Tony Chung [:tchung]

Comment 93

•

17 years ago

hi Gen, are you still seeing this repro on your mac os 10.4.7 or 10.4.8?  can you try verifying it and seeing if the fix works for you?   

We in QA here haven been unable to reproduce it on 10.4.8, and dont have a machine running 10.4.7.

jrblier

Comment 94

•

17 years ago

Greetings,

Firefox did often freeze with all Firefox 2 alphas, betas and 2.0. However, this bug was resolved in 2.0.0.1 RCs and the Final release.

I'm running a MacBook Pro with 10.4.8 and I don't think the freezing is related specificly to 10.4.7.

I'm sure you can safely mark this bug as fixed.

jrblier

Comment 95

•

17 years ago

Well well well... I should have tested the new candidate build sooner.

Warning: This bug comes back loud and clear in 2.0.0.2 candidate build:
Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.2) Gecko/20070208 Firefox/2.0.0.2.

I had to backtrack to 2.0.0.1 loosing all my passwords in the process.

I'll try other 2.0.0.2 candidates hoping one of them will have this fixed.

Brendan Eich [:brendan]

Assignee

Comment 96

•

17 years ago

(In reply to comment #95)
> Well well well... I should have tested the new candidate build sooner.
> 
> Warning: This bug comes back loud and clear in 2.0.0.2 candidate build:
> Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.2) Gecko/20070208
> Firefox/2.0.0.2.

Nothing reverted the source code fix. See comment 83 through comment 87, and please report the hang (with any more info, esp. your addons inventory) in a new bug. As always with hangs, the only way to be sure is to catch a deadlock via thread stack backtraces, which won't be easy.

/be

jrblier

Comment 97

•

17 years ago

As I said, I've deleted 2.0.0.2 RC1 and reverted to 2.0.0.1.

I will wait for another 2.0.0.2 RC before trying again and creating a new bug with complete info, if necessary.

As for 2.0.0.1, I have 20 extensions enabled and I have not experienced a single hang since I'm using this version.

Mozilla/5.0 (Macintosh; U; Intel Mac OS X; fr; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1

I'm currently using a French 2.0.0.1 version, but I don't think it's language related.

Is there something specific I should look for in a newer 2.0.0.2 RC2?

Tony Chung [:tchung]

Comment 98

•

17 years ago

hi jacques,

Can you try loading a new profile with 2.0.0.2 RC1 on your mac osx 10.4.7, and repro it then?   That will help us take out the 20 extensions victim if that is the cause of the crash.  
To create multiple profiles, you will need to set MOZ_NO_REMOTE=1 in your mac environment.   Thanks, Tony

Brendan Eich [:brendan]

Assignee

Comment 99

•

17 years ago

Tony's comment is right on, and if you see no hangs on a 2.0.0.2 with a fresh profile, try adding the extensions you use. You could add the first ten, see whether you hang, and binary search those ten if so, else add the rest, confirm the hang, and try removing the last five, etc.

This really should go into a new bug, because the fix for this bug is still in place and there are lots of ways to hang Firefox. So please file one when you get a chance. Thanks,

/be

Gen Kanai [:gen]

Comment 100

•

17 years ago

Attached file Firefox 2 failure — Details

Another failure log.

Brendan Eich [:brendan]

Assignee

Comment 101

•

17 years ago

(In reply to comment #100)
> Created an attachment (id=254892) [details]
> Firefox 2 failure
> 
> Another failure log.

Hi Gen -- that has nothing to do with this bug.  Please file a new bug and get some layout folks (dbaron at least) to have a look. Thanks,

/be

Gen Kanai [:gen]

Comment 102

•

17 years ago

Brendan- thanks for the clarification.  I will do that.

jrblier

Comment 103

•

17 years ago

(In reply to comment #97)
> I will wait for another 2.0.0.2 RC before trying again and creating a new bug
> with complete info, if necessary.
> 

As mentionned, I've waited for 2.0.0.2 RC2. I have it in hand and I can report happily that RC2 works like a charm.

I can now confirm that both 2.0.0.1 and 2.0.0.2 versions for Mac Intel do not hang anymore. Can someone else verify this too?

BTW I was 21 extension loaded.

Cheers,

-J

jrblier

Comment 104

•

17 years ago

BTW I have 21 extensions loaded. [Sorry for the mistake]

- J

Tony Chung [:tchung]

Comment 105

•

17 years ago

hi Jacques, do you mind trying your same profile in 2.002 RC2 and rerun it back on 2.002 RC1?   I am curious why it would start working now since i dont think there has been any code change between these builds regarding this bug.  At least i dont think /be checked anything new in right?

Brendan Eich [:brendan]

Assignee

Comment 106

•

17 years ago

Nothing to do with JS lock optimizations changed since the hackaround fix for this bug went into the branches.  It's impossible to know what caused a hang, but if it happened in a branch build that had this bug's patch (attachment 246741 [details] [diff] [review]), then you should file a new bug. Adding comments here does not help get to the bottom of the (almost certainly) new and different problem.

If by some bad luck the patch was reverted, we'd end up back here when noticing. But I just updated my MOZILLA_1_8_BRANCH tree and still see

/* FIXME: bug 353962 hackaround */
#define JS_USE_ONLY_NSPR_LOCKS  1

in jslock.h.

/be

will-i-am

Comment 107

•

17 years ago

Hanging / crashing here too: see bug 368907 for trace etc. Sometimes up to 10x a day (*VERY annoying*). 

Maybe contact apple as it does send reports to apple, but does not send reports to mozilla?

will-i-am

Comment 108

•

17 years ago

Btw I started noticing the crashes around RC-1 or so... and it didn't go away with any new version. Using intel mac 10.4.9 now with 2.0.0.2.

Brendan Eich [:brendan]

Assignee

Comment 109

•

17 years ago

(In reply to comment #107)
> Hanging / crashing here too: see bug 368907 for trace etc. Sometimes up to 10x
> a day (*VERY annoying*). 

Thanks for reporting in bug 368907. Please don't bother commenting here. The stack trace there shows a bug completely unrelated to this one.

> Maybe contact apple as it does send reports to apple, but does not send reports
> to mozilla?

That's right -- IIRC we don't have talkback for mactel machines, but breakpad will support x86 macs.

/be

kevdig

Comment 110

•

17 years ago

(In reply to comment #73)
> This seems to be an Intel-specific problem. I upgraded from a G4 Powerbook to
> an Intel Core 2 Duo MacBook Pro yesterday. I used Apple's migration utility to
> copy everything over, including FF, plugins, and prefs, so the FF setup was
> identical. FF2 on the G4 never crashed; on the Intel it's frozen 5x today,
> always when opening a new window.
> 
> Hope this helps.
> 
The powerpc version of Mozilla/FF does not use a thin locks implementation in js. So if this is related to thin locks in js it will not show up on ppc.

I have spent the better part of today surfing with 2.0.0.1 with a ppc thin locks implementation and have had no problems. System:

PowerMac 8600, 750GX processor (250 MHz speed)
Linux 2.4.31
dialup connection

I'll bump the speed on the 750GX. Would it be of any help to see what happens on a GigE (7455 @ 1.1 GHz x2) and Linux 2.6.17.7? Or MacOSX Panther? What about x86 Linux?

Should I have seen some hangs? Has this been reproduced on anything other than one of those multi-core beasties?

At the risk of getting smacked, why did the "fix" disable thin locks for everything (i.e. Sparc, PPC (still waiting on 296878 or whatever it is), x86 Linux)?

Brendan Eich [:brendan]

Assignee

Comment 111

•

17 years ago

(In reply to comment #110)
> At the risk of getting smacked, why did the "fix" disable thin locks for
> everything (i.e. Sparc, PPC (still waiting on 296878 or whatever it is), x86
> Linux)?

The "fix" was explicitly labeled a hackaround, so let's call it that. This bug is still open.

Time was pressing and I wasn't willing to bet that the bug in jslock.c was really peculiar to x86 dual core machines. I suspect you are right that there is no such bug for PPC with a thinlock impl -- did you want to post a patch of the support you added?

/be

timeless

Comment 112

•

17 years ago

(In reply to comment #89)
> Fixed on the 1.8.0 branch:

js/src/jslock.h rev 3.27.26.1

kevdig

Comment 113

•

17 years ago

(In reply to comment #111)
 
> The "fix" was explicitly labeled a hackaround, so let's call it that. This bug
> is still open.
> 
Did not mean to step on any toes. I don't know what a hackaround is? I spent way longer than I should have trying to figure out why my compare and swap code was not showing in jslock.o.

> I suspect you are right that there
> is no such bug for PPC with a thinlock impl

Do you think it is possible that this only shows up with dohickeys (extensions & addons) that are only available with MacOS or Windows? (i.e. I don't use many extension and addons.)

Do you think it would be worthwhile to see if any of the MacOS users that see this problem might be able to try to reproduce it on a PPC MacOS box (especially a dual core G5)? Wouldn't that help suggest whether it is a problem with the compare and swap implementation on x86 or something with thinlocks and their use?

Wonder if it would be worth trying the 386 implementation? Assuming (mm? where is a spelling bee winner when you need one?) those instructions are even supported anymore?

> did you want to post a patch of
> the support you added?
> 
If I understand you correctly then it has been languishing in bug 296878 for 2 years.

> /be
>

Michael Moy

Comment 114

•

17 years ago

Mac OSX has a set of atomic compare and swap functions for 32-bits and 64-bits that probably implements what is desired here. There are variants for the normal
operations and there are also "barrier" operations.

From the Mac OSX doc:

These functions are thread and multiprocessor safe. For each function,
there is a version that does and anoother that does not incorporate a
memory barrier. Barriers strictly order memory access on a weakly-ordered weaklyordered
ordered architecture such as PPC. All loads and stores executed in
sequential program order before the barrier will complete before any load
or store executed after the barrier. On a uniprocessor, the barrier
operation is typically a nop. On a multiprocessor, the barrier can be
quite expensive.

Most code will want to use the barrier functions to insure that memory
shared between threads is properly synchronized. For example, if you
want to initialize a shared data structure and then atomically increment
a variable to indicate that the initialization is complete, then you MUST
use OSAtomicIncrement32Barrier() to ensure that the stores to your data
structure complete before the atomic add. Likewise, the consumer of that
data structure MUST use OSAtomicDecrement32Barrier(), in order to ensure
that their loads of the structure are not executed before the atomic
decrement. On the other hand, if you are simply incrementing a global
counter, then it is safe and potentially much faster to use OSAtomicIn-crement32(). OSAtomicIncrement32().
crement32(). If you are unsure which version to use, prefer the barrier
variants as they are safer.

http://developer.apple.com/documentation/Darwin/Reference/ManPages/man3/OSAtomicCompareAndSwap32.3.html

It sounds like the barrier form should be used for generic Mac OSX builds. These functions are callouts so there is the call overhead vs inline assembler but it may be better than whatever we're doing now.

Chad White

Comment 115

•

17 years ago

Here's another data point:  I turned off the phishing feature as well as the autodetect I had set up for my library's proxy server.  I immediately got less hanging up with FF 2.0.0.7.

I'm a novice just trying to figure out how to make FF work better on my Mac.  Thanks to everyone here for sharing your much greater expertise and helping me learn.

Chad White

Comment 116

•

17 years ago

Here's another data point:  I turned off the phishing feature as well as the autodetect I had set up for my library's proxy server.  I immediately got less hanging up with FF 2.0.0.7 on my MacBook (2 GHz Core 2 Duo; 1.6 MHz RAM).

I'm a novice just trying to figure out how to make FF work better on my Mac.  Thanks to everyone here for sharing your much greater expertise and helping me learn.

Brendan Eich [:brendan]

Assignee

Comment 121

•

17 years ago

(In reply to comment #114)
> Mac OSX has a set of atomic compare and swap functions for 32-bits and 64-bits
> that probably implements what is desired here. There are variants for the
> normal
> operations and there are also "barrier" operations.

Thanks for the pointers, but I'd like to know why our circa-1998 x86 asm coding in jslock.c, to implement compare-and-swap correctly on multiprocessors, is incorrect now on new Intel multicores.

Someone with detailed uarch knowledge could probably eyeball our code and say. I'm years out of date and pretty busy elsewhere, but I do want to "fix" this bug (should have filed a subordinate bug for the brute-force disabling change).

So, help diagnosing the actual problem in our jslock.c code on modern multicore x86 systems is appreciated.

/be

Keywords: helpwanted

Brendan Eich [:brendan]

Assignee

Comment 122

•

17 years ago

mmoy, could this bug depend on bug 331545 (if that bug were fixed, would it cure what ails us on Windows)? Hoping you'll take this bug, and let me know what you need to get that bug (331545) moving.

/be

Robert Sayre

Comment 123

•

17 years ago

Attached patch use OSAtomic.h — Details — Splinter Review

This doesn't turn these functions back on in the header file, since comment 121 isn't addressed. I would like to get this reenabled for the beta, since the JSON encoder/decoder can bump pthread_mutex_(un)lock to the top of a profile through js_AtomizeString. See bug 410890 comment 3 for data.

Attachment #296600 - Flags: review?(brendan)

Brendan Eich [:brendan]

Assignee

Comment 124

•

17 years ago

Comment on attachment 296600 [details] [diff] [review]
use OSAtomic.h

r/a=me if this inlines, otherwise reach for a macro. Thanks!

/be

Attachment #296600 - Flags: review?(brendan)

Attachment #296600 - Flags: review+

Attachment #296600 - Flags: approval1.9+

Brendan Eich [:brendan]

Assignee

Comment 125

•

17 years ago

From private communication with mmoy, I have hopes that the hangs here are due to a bug in early Core 2 Duo processors covered by published Intel errata -- that would be good to confirm. If so, do the OS CAS routines cope?

/be

Yuxuan Wang

Reporter

Comment 126

•

17 years ago

(In reply to comment #125)
> From private communication with mmoy, I have hopes that the hangs here are due
> to a bug in early Core 2 Duo processors covered by published Intel errata --
> that would be good to confirm. If so, do the OS CAS routines cope?
> 
> /be
> 

I'm using Core Duo but not Core 2 Duo, and I've experienced this bug

Bill Northlich

Comment 127

•

17 years ago

I also have a core duo and experience this a fair amount

Brendan Eich [:brendan]

Assignee

Comment 128

•

17 years ago

(In reply to comment #126)
> (In reply to comment #125)
> > From private communication with mmoy, I have hopes that the hangs here are due
> > to a bug in early Core 2 Duo processors covered by published Intel errata --
> > that would be good to confirm. If so, do the OS CAS routines cope?
> > 
> > /be
> > 
> 
> I'm using Core Duo but not Core 2 Duo, and I've experienced this bug
> 

(In reply to comment #127)
> I also have a core duo and experience this a fair amount
> 


No, you don't. I turned off compare-and-swap over a year ago. See comment 68. If you see hangs with fresh 2.0.0.x and 3 beta or nightly builds, you are seeing some other bug. Please try to get a stack (luser, is there a way to force a stack via breakpad?), file a new bug if you can get some information other than "hung". Do not bother this bug.

/be

Robert Sayre

Comment 129

•

17 years ago

(In reply to comment #124)
> (From update of attachment 296600 [details] [diff] [review])
> r/a=me if this inlines, otherwise reach for a macro. Thanks!

It inlines at -Os, and it's checked in.

Brendan Eich [:brendan]

Assignee

Comment 130

•

17 years ago

(In reply to comment #128)
> Please try to get a stack (luser, is there a way to
> force a stack via breakpad?)

Argh, I forgot to cc: Ted (AKA "luser" on IRC -- my calling his nick in the last comment could be misread as an abusive remark -- sorry about that!).

/be

(not currently active) Ted Mielczarek

Comment 131

•

17 years ago

You may be able to kill the process with the right signal, but I'm not terribly sure about that on Mac.

Mike Schroepfer

Updated

•

17 years ago

Blocks: 331545

Michael Moy

Comment 132

•

17 years ago

(In reply to comment #125)
> From private communication with mmoy, I have hopes that the hangs here are due
> to a bug in early Core 2 Duo processors covered by published Intel errata --
> that would be good to confirm. If so, do the OS CAS routines cope?

It would be good to identify users that can reproduce this problem and exactly
what hardware and operating system they use. If this is multicore cache invalidation problem or other synchronization problem (either in hardware or software), then setting processor affinity should be a workaround. If setting processor affinity does not solve the problem, then I would conclude that it isn't a multicore problem. On Windows, it is possible to set processor affinity for a running process using the task manager. The process is described at http://articles.techrepublic.com.com/5100-10877-6168870.html

It can be set permanently using the imagecfg.exe tool from Microsoft.

Mac/OSX doesn't have a mechanism to set processor affinity until Leopard and it's an API from what I read. I do not know if there's a tool that you can just run on a process to set affinity. The Apple CHUD tools have a component called CPU pref that will allow the user to turn off one core on the fly. The CHUD tools are located on the installation disks that come with Macs. I think that they can also be downloaded from the Apple Developer Center.

It would be interesting to see if this problem goes away with either processor affinity or with turning a core off on a system. The former is a reasonable though pain-in-the neck workaround while the latter really isn't.

Brendan Eich [:brendan]

Assignee

Comment 133

•

17 years ago

Again, the patch to turn off compare-and-swap landed over a year ago, so to test this bug you have to make your own build re-enabling it (comment out jslock.h line 190).

I have the CHUD CPU pref tool in my menubar and use it often, e.g., to get gdb to work better on firefox.

Comment 106 through comment 111 are worth re-reading, along with comment 39. We have good testimony that this bug only bit dual core Intel machines, and that its symptoms (not other hang bugs of course) went away when I turned off thin locks for all platforms on trunk and 1.8 branch.

Short of turning thin locks back on for Firefox 3 and making sure the hang gets reported unambiguously, I'm not flush with good ideas. If this is a known Intel bug I'd love to read the errata.

/be

Michael Moy

Comment 134

•

17 years ago

The one that looks suspicious to me is AH39 at http://download.intel.com/design/mobile/SPECUPDT/31407918.pdf

The Windows builds at Vector64.com use the intrinsic that translate to compare and swap. I can do a release build on Mac OSX with the old processing for those unable to do their own builds. The Mac OSX builds at Vector64.com use the OSX system call for interlock compare and exchange.

Michael Moy

Comment 135

•

17 years ago

I went back and reread the thread and it's clear that there were more Core Duo users with the problem than Core 2 Duo users. There is an errata that sounds
like it could cause this problem in Core Duo as well: AE44: Logical Processors
May Not Detect Write Back Memory Writes.

Multiprocessor systems may use polling of memory semaphores to sychronize
software activity. Because of this erratum, if a logical processor is
polling a WB memory location while it is being updated by another logical
processor, the update may not be detected. Implication: system may livelock
due to polling loop and undetected semaphore change.

Brendan Eich [:brendan]

Assignee

Updated

•

16 years ago

Blocks: 312238

Brendan Eich [:brendan]

Assignee

Comment 136

•

16 years ago

Attached patch the bug fix! — Details — Splinter Review

This bug was caused by the patch for bug 312238 -- early in the evolution of that bug's patch, a critical js_CompareAndSwap was removed (attachment 213358 [details]). Shame on me for not catching this during code review. This bug was found by Arjan Van De Ven of Intel, to whom I am profoundly grateful. A great day!

/be

Attachment #300723 - Flags: review+

Attachment #300723 - Flags: approval1.9b3?

Attachment #300723 - Flags: approval1.9+

Brendan Eich [:brendan]

Assignee

Updated

•

16 years ago

Flags: blocking1.9+

Mike Beltzner [:beltzner, not reading bugmail]

Comment 137

•

16 years ago

Comment on attachment 300723 [details] [diff] [review]
the bug fix!

a=beltzner

Attachment #300723 - Flags: approval1.9b3? → approval1.9b3+

Brendan Eich [:brendan]

Assignee

Comment 138

•

16 years ago

Fixed:

js/src/jslock.c 3.74
js/src/jslock.h 3.36

We should fix on the 1.8 branch, too.

/be

Status: ASSIGNED → RESOLVED

Closed: 16 years ago

Flags: blocking1.8.1.13?

Resolution: --- → FIXED

Brendan Eich [:brendan]

Assignee

Comment 139

•

16 years ago

Comment on attachment 300723 [details] [diff] [review]
the bug fix!

Patch applies cleanly to the 1.8 branch.

/be

Attachment #300723 - Flags: approval1.8.1.13?

Arjan van de Ven

Comment 140

•

16 years ago

For those interested in the detailed race scenario (and I hope that bugzilla leaves the ascii art in tact):


Scenario; two CPUs, CPU0 and CPU1. Two threads, with say a "me" of 4 and 6 respectively, 4 runs on CPU0, 6 on CPU1.
Ascii art time flow (wide lines, sorry about that)

Time   fl->owner 		CPU0 / thread 4		CPU1 / thread 6		Comment
0	 0                Entering js_Lock		..  				Start situation, lock is not held by anyone
1	 0			is_CompareAndSwap(0,4)	..				This succeeds, "4" gets the lock
2	 4			return from js_Lock	..				all is well, "4" has the lock
3	 4			<actual work>		Entering js_Lock		"6" now tries to get the lock
4	 4			<actual work>		is_CompareAndSwap(0,6)	Lock is contended, CAS fails; going to slowpath
5	 4			<actual work>		entering js_Enqueue	slowpath entered, taking global lock
6	 4			<actual work>		o = 4; n = 5		(lines 978 and 979 of jslock.c)
7	 4			entering js_Unlock	..				"4" is done and is going to unlock the lock
8	 4			if (owner == me) 		..				line 1030 of jslock.c... this returns true
9	 4			..				jsCompareAndSwap(4,5)	ln 980: CompareAndSwap done, wait flag set
10	 5			owner = 0			..				line 1031 of jslock.c: we set the lock to 0
11	 0			..				calling js_SuspendThread	Going to sleep until woken		
12	 0                return;			..				line 1032 of jslock.c: we're done 
13 to eternity		..				waiting for a waker that never comes since the lock is free and nobody
								knows or cares that anyone is waiting for the lock

Arjan van de Ven

Comment 141

•

16 years ago

Attached file Detailed description of the race — Details

Ok that didn't work very well; adding as attachement instead

Brendan Eich [:brendan]

Assignee

Comment 142

•

16 years ago

MikeM: you want the last patch in this bug.

Bob: we should make sure this gets into the js1.7 release (which is not out yet, IIRC -- please correct if wrong). If you are feeling generous, it could go into 1.6.x and probably help some folks, but it's up to you.

/be

Bob Clary [:bc] (inactive)

Comment 143

•

16 years ago

(In reply to comment #142)

1.7.0 has been released but we can do a 1.7.1 shortly. I'll add this to the 1.6.1 blocker list.

Blocks: 371572

Flags: blocking1.8.0.15?

kevdig

Comment 144

•

16 years ago

(In reply to comment #136)
> Created an attachment (id=300723) [details]
> the bug fix!
> 
> This bug was caused by the patch for bug 312238 -- early in the evolution of
> that bug's patch, a critical js_CompareAndSwap was removed (attachment 213358 [details]).
> Shame on me for not catching this during code review. This bug was found by
> Arjan Van De Ven of Intel, to whom I am profoundly grateful. A great day!
> 
> /be
> 

So this was not related at all to those duo beasties, right? I should (or could?) have bumped into this with any of my various x86 builds or the thin lock enabled ppc builds. All on Linux. Any thoughts on why this wasn't seen more widely (more cpus and/or OSes)? I just checked my js_Unlock() code and it has the bad code. Think my being on dial-up could have changed timing enough to prevent me from seeing this?

kevin

Brendan Eich [:brendan]

Assignee

Comment 145

•

16 years ago

(In reply to comment #144)
> So this was not related at all to those duo beasties, right? I should (or
> could?) have bumped into this with any of my various x86 builds

No, not if you did not re-enable thin locks by removing the #define from jslock.h.

> or the thin
> lock enabled ppc builds.

Yes, if you enabled thin locks after the patch for bug 312238 went in, you could see the bug's symptoms :-/.

> All on Linux. Any thoughts on why this wasn't seen
> more widely (more cpus and/or OSes)? I just checked my js_Unlock() code and it
> has the bad code. Think my being on dial-up could have changed timing enough to
> prevent me from seeing this?

Again, for over a year we have simply disabled thin locks to work around the undiagnosed bug.

/be
> 
> kevin
>

Brendan Eich [:brendan]

Assignee

Comment 146

•

16 years ago

If you really had thinlocks enabled in your ppc builds for a while, then I don't know why you didn't see hangs (or more hangs; perhaps a few went by without being analyzed?). The bug introduced by the patch for bug 312238 is a race condition, with unknown likelihood of bad outcomes.

Since we don't share JS objects cross-thread in Firefox, the only source of contention I know of is the JSRuntime's atom table lock (see comment 39). The anti-phishing feature uses cross-thread XPCOM proxying and JS on a background thread (again with no object sharing AFAIK). If you never used anti-phishing, had it disabled, then you probably wouldn't see the symptom.

/be

Arjan van de Ven

Comment 147

•

16 years ago

As to why Duo's saw this more often; that I can explain. It has to do with cpu caches and cross-cpu cache line bouncing. But first lets start with some lottery/statistics theory ;-)

This is a very tight race, it's on a few instructions in size.
Now, if you had a 1 CPU system, you can still hit this, due to the OS scheduler preempting you. The probability that this happens in the race window (2 instructions) in thread one is about one in 10.000.000 (assume 2Ghz cpu and 10msec timeslice). But for this race to happen, the 2nd thread also had to be preempted exactly in the right spot, so another one in 10.000.000 chance. (this is finger-in-the-air statistics, if you nitpick hard you may be able to argue a zero away or a zero extra, doesn't matter).
So the chance this happens on a UP system is about one in 100.000.000.000.000... or in other words, its more likely that you win the lottery two weeks in a row than hit this bug.

Now on a multi cpu system (or a dual core where each core has its own cache), the rules change a little. First of all, things run in parallel so no need to be preempted just right. Second, the timing rules for the race change. Because in the race scenario, you're bouncing cachelines back and forth between the 2 cpus, the window is suddenly dozens to hundreds of cycles (depending on various hardware things). It is still very unlikely that you'll see this on such a system, and that has to do with how cpu caches work. For the race to happen, cpu 0 has the cacheline with the lock in step 14, and needs to have it in step 16. While cpu 1 wants it for step 15 to happen. In a SMP situation, this effectively means that at the time of 15/16, the cache of cpu 0 has a request for the cacheline from cpu 1, but also a request from it's own cpu (I sweep these two together because the other-cpu request is a slower one). Now... caches are generally set up to give preferences to their own cpu in such cases (and the link to the own cpu is orders of magnitude faster anyway so this will happen naturally even without explicit choices)... which means that step 15 generally (but not always of course!) is unlikely to be timed just right for the race. In other words, the fundamental unfairness of the caches (bias to local) saves mozillas bacon here.

Now, enter Core2 DUO... here the 2 cores of the cpu share the cache... which means that the "bias to local" that saved mozilla above suddenly doesn't matter. The cache now has 2 requests, both from local cpus (and both requests are fast, not one slow as above), and who gets it is a coinflip, not a strong bias to cpu 0 -> race is suddenly practical to happen.
This isn't a cpu bug; in fact it's just caused because the cpus can do very fast cache sharing... so the speed of the cpu works against you in this race ;(

Daniel Veditz [:dveditz]

Comment 148

•

16 years ago

This bug is a mess. After fixing one problem a year ago, please file new bugs rather than resurrecting old ones. Our creaky branch-management flags and keywords simply can't handle it.

(In reply to comment #138)
> We should fix on the 1.8 branch, too.

Why? according to comment 136 this is caused by bug 312238 which does not appear to have been fixed on the 1.8 branch. Also, who wrote or reviewed the patch? From the bug it looks like a one-man affair which makes us nervous on the stable branch.

Brendan Eich [:brendan]

Assignee

Comment 149

•

16 years ago

(In reply to comment #148)
> This bug is a mess.

It's worse than you think -- the bug where js_Unlock used an unlocked compare and store instead of js_CompareAndSwap was injected into the 1.8 branch, with this checkin:

revision 3.55.20.1
date: 2006/07/07 02:12:02;  author: mrbkap%gmail.com;  state: Exp;  lines: +63 -30
JS1.7 landing. r=brendan a=schrep

I found this using cvs annotate -rMOZILLA_1_8_BRANCH jslock.c | less. You can confirm using cvs diff.

> After fixing one problem a year ago, please file new bugs
> rather than resurrecting old ones. Our creaky branch-management flags and
> keywords simply can't handle it.

You're right, this bug has been mismanaged a long time. Sorry about that -- I use it as a cautionary tale now, when others think about rolling multiple fixes in under one bug.

> (In reply to comment #138)
> > We should fix on the 1.8 branch, too.
> 
> Why? according to comment 136 this is caused by bug 312238 which does not
> appear to have been fixed on the 1.8 branch.

See above -- bugzilla is not the source of ultimate truth, the cvs repo is.

> Also, who wrote or reviewed the
> patch? From the bug it looks like a one-man affair which makes us nervous on
> the stable branch.

C'mon, that's silly. The patch undoes an incorrect change and the #define that worked around that chnage; it's well understood. No superstitious invocation of crowd wisdom, please.

/be

Daniel Veditz [:dveditz]

Comment 150

•

16 years ago

Comment on attachment 300723 [details] [diff] [review]
the bug fix!

(In reply to comment #149)
> C'mon, that's silly.

To be honest what makes me nervous is that appearing to play favorites for the MoCo CTO might breed disrespect for the tree rules. But on inspection this is a back-out to previously good code, and I can wave sr=dveditz around so I guess it isn't going to be a problem.

approved for 1.8.1.13, a=dveditz

Attachment #300723 - Flags: superreview+

Attachment #300723 - Flags: approval1.8.1.13?

Attachment #300723 - Flags: approval1.8.1.13+

Brendan Eich [:brendan]

Assignee

Comment 151

•

16 years ago

Dan: I think this is strange, and I'm saying so in public. No one invoked "MoCo" or my title (which reporters seem to forget, returning me to "Software Engineer", my starting job position out of grad school -- hey, I'm not proud -- it's what I put on my tax return).

The patch was from Arjan, not me -- he found the problem without knowing the bug history. I don't see why you should disrespect him to avoid "breed[ing] direspect" for tree rules that we are all following. Where is it written that branch fixes must be the product of a committee?

/be

Samuel Sidler (old account; do not CC)

Comment 152

•

16 years ago

(In reply to comment #151)
> Dan: I think this is strange, and I'm saying so in public. No one invoked
> "MoCo" or my title (which reporters seem to forget, returning me to "Software
> Engineer", my starting job position out of grad school -- hey, I'm not proud --
> it's what I put on my tax return).

I might have misunderstood, but I think Dan just meant that this was going outside of how we normally treat branch patches; that is, requiring a new bug be filed so we can track the progress better. Bugzilla wasn't made to handle multiple patches landing on multiple branches and multiple times.

Personally, my initial gut feeling was to not approve this patch in this bug but ask that a new bug be filed to help lessen the confusion. Just because we've done it previously in this bug doesn't mean we should keep doing behavior we know is wrong and hard to follow.

> The patch was from Arjan, not me -- he found the problem without knowing the
> bug history. I don't see why you should disrespect him to avoid "breed[ing]
> direspect" for tree rules that we are all following. Where is it written that
> branch fixes must be the product of a committee?

Again, it could just be the mess in this bug (which is why a new bug would be great in the future), but...

With apologies to Arjan, it's not clear in this bug that the patch was from him. I can track it back to bug 331545, but apparently it's not really a fix for that bug and was instead attached here (instead of a new bug getting filed). There was no comment saying that's where it came from though and, 147 comments in, it's hard to tell that's what happened.

The triage team has a lot of bugs to go through and we have to make quick judgement calls on both the bugs and their patches. Without clear explanations (and clear bugs!) it's hard to do that successfully. There's nothing "written that branch fixes must be the product of a committee" nor do I think that's what has happened here. But some form of driving has always (in my time here) occurred for stability/security branches and, again, it's a lot of bugs and a lot of work to make determines of risk vs reward.

Anyway, my two cents. Regardless, let's get the patch checked in. :)

Samuel Sidler (old account; do not CC)

Comment 153

•

16 years ago

(In reply to comment #152)
> I might have misunderstood, but I think Dan just meant that this was going
> outside of how we normally treat branch patches; that is, requiring a new bug
> be filed so we can track the progress better.

Excuse me... I meant to say... "outside of how we normally treat *multiple* branch patches in the same bug."

Brendan Eich [:brendan]

Assignee

Comment 154

•

16 years ago

(In reply to comment #152)
> (In reply to comment #151)
> > Dan: I think this is strange, and I'm saying so in public. No one invoked
> > "MoCo" or my title (which reporters seem to forget, returning me to "Software
> > Engineer", my starting job position out of grad school -- hey, I'm not proud --
> > it's what I put on my tax return).
> 
> I might have misunderstood, but I think Dan just meant that this was going
> outside of how we normally treat branch patches; that is, requiring a new bug
> be filed so we can track the progress better.

That's fine but it has *nothing* to do with the job title on my business card, and I never invoked that title. Hence my b.s. detector still going off.

> Bugzilla wasn't made to handle
> multiple patches landing on multiple branches and multiple times.

You're right, and I said so myself (comment 149). It's fine to ask for a new bug, although that is mostly pro-forma overhead without actual added value that traces the chain of cause and effect to the actual malfunction being fixed -- tracing that would take you back into the swamp of the original, overloaded regressing bug (bug 312238), which took a composite patch *only part of which contained the regression*; and then to the overloaded tracking bug (this bug).

Yes, it's confusing. No, adding another bug doesn't really help except by summarizing in a fresh page, instead of adding comment 154 and counting. If that kind of summary bug would help, I will file it. I hope to avoid it.

> Personally, my initial gut feeling was to not approve this patch in this bug
> but ask that a new bug be filed to help lessen the confusion.

Your feelings are sound, young Skywalker :-P.

> Just because
> we've done it previously in this bug doesn't mean we should keep doing behavior
> we know is wrong and hard to follow.

Agreed again (did I write something about using this bug as a cautionary tale? I don't need it to be any more cautionary ;-).

> > The patch was from Arjan, not me -- he found the problem without knowing the
> > bug history. I don't see why you should disrespect him to avoid "breed[ing]
> > direspect" for tree rules that we are all following. Where is it written that
> > branch fixes must be the product of a committee?
> 
> Again, it could just be the mess in this bug (which is why a new bug would be
> great in the future), but...
> 
> With apologies to Arjan, it's not clear in this bug that the patch was from
> him. I can track it back to bug 331545,

That bug 331545 has nothing to do with the problem Arjan fixed. No one is nominating that bug's patch for the branch.

> but apparently it's not really a fix
> for that bug and was instead attached here (instead of a new bug getting
> filed).

This bug was about a specific hang, which Arjan's patch precisely fixes. I do not agree that a new bug is needed to track that self-same hang.

The error here lay in patching this bug with the hackaround that #defined JS_USE_ONLY_NSPR_LOCKS (attachment 246741 [details] [diff] [review] "hackaround"), then leaving this bug open. That was the time to close this bug and file a followup bug, or (better) to file a dependent bug to hold the hackaround patch. That was my fault, and once again: mea culpa.

> There was no comment saying that's where it came from though and, 147
> comments in, it's hard to tell that's what happened.

Please re-read comment 136, which completely diagrams the chain of events including a link to the patch that introduced the hang.

> The triage team has a lot of bugs to go through and we have to make quick
> judgement calls on both the bugs and their patches.

I know, and again, my apologies. This *is* confusing, but comment 136 is the summary you want.

> Without clear explanations
> (and clear bugs!) it's hard to do that successfully. There's nothing "written
> that branch fixes must be the product of a committee" nor do I think that's
> what has happened here.

That's a slight exaggeration of what Dan seemed to be arguing for in comment 148 (by arguing against a "one-man affair"). Really, there have been three JS hackers involved in the history of this bug (Feng, Igor, myself), and then Arjan independently identifying the flaw in the code, followed by my mangled bug body forensics.

> But some form of driving has always (in my time here)
> occurred for stability/security branches and, again, it's a lot of bugs and a
> lot of work to make determines of risk vs reward.

If the fundamentals here, which involve hairy things like multicore race conditions, are not easy for anyone to grok, then there's not a lot that new bug window-dressing can do. I would be unhappy if triagers were judging surface and not substance.

I'd rather have an exchange that points out the messy etiology, even if it rehashes the bug mismanagement, than let either my bug management mistakes or (what look to me like) cargo cult or strangly Moco-inside-baseball theories ("CTO"? Why is that not an ad-hominem argument? I'll resign if someone is coloring their judgment of my work based on a damn job title) of 1.8 branch bug risk management slide.

/be

Brendan Eich [:brendan]

Assignee

Comment 155

•

16 years ago

Fixed on the 1.8 branch -- thanks for approval in the face of messy, overloaded bugs (one bug per patch, I've learned my lesson already).

js/src/jslock.c 3.55.20.4
js/src/jslock.h 3.27.22.3

/be

Keywords: helpwanted → fixed1.8.1.13

Daniel Veditz [:dveditz]

Comment 156

•

16 years ago

> The patch was from Arjan, not me

Thanks, that's the answer to my original question.

> tree rules that we are all following.

It was not clear that was the case, at a hurried glance it looked like a
self-review. I'm sorry we're not able to spend much time trying to understand
each bug, and this bug is already confusing.

Keywords: fixed1.8.1.13 → helpwanted

Reed Loden [:reed]

Updated

•

16 years ago

Keywords: helpwanted → fixed1.8.1.13

Christopher Aillon (sabbatical, not receiving bugmail)

Comment 157

•

16 years ago

Bob: which patch is still wanted for 1.8.0.15?  I see something got landed...  Are you willing to champion the rest in?

Brendan Eich [:brendan]

Assignee

Comment 158

•

16 years ago

Attached patch 1.8.0 branch patch — Details — Splinter Review

For some reason the only thing that landed on the 1.8.0 (Firefox 1.5) branch was the #define JS_USE_ONLY_NSPR_LOCKS 1 hackaround to jslock.h. js_Unlock in jslock.c on the 1.8.0 branch correctly uses js_CompareAndSwap to swap 0 for me in tl->owner. So this is all you need, caillon.

/be

Attachment #304263 - Flags: review+

Attachment #304263 - Flags: approval1.8.0.15?

Christopher Aillon (sabbatical, not receiving bugmail)

Comment 159

•

16 years ago

Comment on attachment 304263 [details] [diff] [review]
1.8.0 branch patch

Thanks for the quick followup, brendan!  a=caillon for 1.8.0.15

Attachment #304263 - Flags: approval1.8.0.15? → approval1.8.0.15+

Brendan Eich [:brendan]

Assignee

Comment 160

•

16 years ago

Fixed on the MOZILLA_1_8_0_BRANCH:

js/src/jslock.h 3.27.26.2

/be

Keywords: fixed1.8.0.10 → fixed1.8.0.15

Samuel Sidler (old account; do not CC)

Updated

•

16 years ago

Flags: blocking1.8.1.13? → blocking1.8.1.13+

Alexander Sack

Updated

•

16 years ago

Flags: blocking1.8.0.15? → blocking1.8.0.15-

Alexander Sack

Updated

•

16 years ago

Flags: blocking1.8.0.15- → blocking1.8.0.15+

A Sampler.app trace 18 years ago Yuxuan Wang 200.96 KB, application/octet-stream		Details
shark time profile taken after ff hung 18 years ago Bill Northlich 169.55 KB, application/octet-stream		Details
The text version of my Sampler.app output 18 years ago Yuxuan Wang 10.36 KB, text/plain		Details
A report generated by Shark.app 18 years ago Yuxuan Wang 745 bytes, text/plain		Details
I deadlock after I disabled sessionstore 18 years ago Yuxuan Wang 5.63 KB, text/plain		Details
gdb attached, and execute "fin" command 18 years ago Yuxuan Wang 2.63 KB, text/plain		Details
backtrace of all threads 18 years ago Tony Chang (Google) 7.11 KB, text/plain		Details
another backtrace 18 years ago Tony Chang (Google) 7.56 KB, text/plain		Details
same as last bt with "info threads" included 18 years ago Tony Chang (Google) 8.36 KB, text/plain		Details
A true deadlock in gdb 18 years ago Yuxuan Wang 7.99 KB, text/plain		Details
hangs on js_GC function 18 years ago Yuxuan Wang 9.69 KB, text/plain		Details
hackaround 18 years ago Brendan Eich [:brendan] 1.19 KB, patch	dveditz : approval1.8.1.1+ jay : approval1.8.0.10+	Details \| Diff \| Splinter Review
another sample from dan gillmor 18 years ago (not reading, please use seth@sspitzer.org instead) 8.91 KB, text/plain		Details
one more from dan gillmor 18 years ago (not reading, please use seth@sspitzer.org instead) 6.58 KB, text/plain		Details
Firefox 2 failure 17 years ago Gen Kanai [:gen] 39.35 KB, text/plain		Details
use OSAtomic.h 17 years ago Robert Sayre 1.61 KB, patch	brendan : review+ brendan : approval1.9+	Details \| Diff \| Splinter Review
the bug fix! 16 years ago Brendan Eich [:brendan] 2.20 KB, patch	brendan : review+ dveditz : superreview+ dveditz : approval1.8.1.13+ beltzner : approval1.9b3+ brendan : approval1.9+	Details \| Diff \| Splinter Review
Detailed description of the race 16 years ago Arjan van de Ven 1.40 KB, text/plain		Details
1.8.0 branch patch 16 years ago Brendan Eich [:brendan] 1.13 KB, patch	brendan : review+ caillon : approval1.8.0.next+	Details \| Diff \| Splinter Review