Last Comment Bug 548796 - nsIWifiMonitor causes deadlocks in OS X 10.6.x
: nsIWifiMonitor causes deadlocks in OS X 10.6.x
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Widget: Cocoa (show other bugs)
: unspecified
: x86 Mac OS X
: -- normal (vote)
: ---
Assigned To: Josh Aas
:
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-02-26 04:15 PST by electronic Max
Modified: 2010-03-22 17:34 PDT (History)
9 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
beta1+
needed
.2-fixed
.9-fixed


Attachments
fix v1.0 (2.05 KB, patch)
2010-02-26 10:43 PST, Josh Aas
smichaud: review+
Details | Diff | Splinter Review
fix v1.1 (2.55 KB, patch)
2010-02-26 11:52 PST, Josh Aas
smichaud: review+
jduell.mcbugs: review+
mbeltzner: approval1.9.2.2+
Details | Diff | Splinter Review
1.9.1 branch fix (4.44 KB, patch)
2010-03-08 21:43 PST, Josh Aas
mbeltzner: approval1.9.1.9+
Details | Diff | Splinter Review

Description electronic Max 2010-02-26 04:15:09 PST
User-Agent:       Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6
Build Identifier: Firefox/3.5.5 and Firefox/3.6

This has been a mysterious problem which has been plaguing my plug-in on Macbooks and Macbook Pros ever since the release of Snow Leopard OS X 10.6.0.

I use nsIWifiMonitor in my extension to allow the plug-in to identify when the user returns home. 

After an extended period of use, if and only if I am using WiFi with my extension enabled, the machine enters a state where any application attempting to open a new network socket deadlocks (beachballs) to a halt.  Force-Quit does not even work in this state (and I have to reboot by holding the power button).  This ONLY happens when Firefox is running and when my plug-in is loaded with Wifi scanning turned on.

The problem feels like a race condition, and thus is not directly (deterministically) reproducible using a sequence of steps.  

BUT, it occurs reliably, usually 3-5 times during a 12 hour period. I have isolated the fault to Firefox and the wifi monitor by carefully trying completely separate machines (including a brand new Macbook Pro without any preinstalled software from the store).

It seems "wrong" that a bug in Firefox could cause the whole machine to freeze.  Thus I think it's an interaction between Firefox and a bug in snow leopard that is violating process isolation.  It "feels" like Firefox (or something) is not successfully leaving a critical section of code pertaining to the network (sys call?) which, when other apps try to call, they end up queuing up resulting in a deadlock

To reproduce:
  Get a Macbook or Macbook Pro running 10.6.0 or newer
  Set up an extension with an nsIWifiMonitor scanning.  
  Use actively for a few hours.  

Please advise.

Reproducible: Sometimes

Steps to Reproduce:
I have detailed the steps t
Comment 1 Boris Zbarsky [:bz] (still a bit busy) 2010-02-26 07:13:47 PST
Sounds like something is leaking processes or sockets or some other such kernel resource...

The relevant 10.6 code seems to be http://mxr.mozilla.org/mozilla-central/source/netwerk/wifi/src/osx_corewlan.mm#60 for what it's worth.  My objc is not good enough to see if there's an obvious issue there.
Comment 2 Josh Aas 2010-02-26 10:34:16 PST
That code leaks just about everything and apparently both I reviewed it at one point. I can only assume I totally forgot to look at the file, ew.
Comment 3 Steven Michaud [:smichaud] (Retired) 2010-02-26 10:39:31 PST
Yes, it creates an autorelease pool without releasing it :-)
Comment 4 Josh Aas 2010-02-26 10:43:33 PST
Created attachment 429156 [details] [diff] [review]
fix v1.0
Comment 5 Josh Aas 2010-02-26 10:44:57 PST
We don't release the autorelease pool or the bundle.
Comment 6 Josh Aas 2010-02-26 10:50:06 PST
I suggest we block on 1.9.1, 1.9.2, and 1.9.3. This is bad.
Comment 7 Josh Aas 2010-02-26 11:52:58 PST
Created attachment 429171 [details] [diff] [review]
fix v1.1

This is a more paranoid patch which should release the pool even if the main corewlan code throws an exceptions.
Comment 8 Steven Michaud [:smichaud] (Retired) 2010-02-26 12:08:59 PST
Comment on attachment 429171 [details] [diff] [review]
fix v1.1

Yes, this is better.
Comment 9 Josh Aas 2010-02-26 12:18:55 PST
pushed to mozilla-central

http://hg.mozilla.org/mozilla-central/rev/1e30b2e41326
Comment 10 Josh Aas 2010-02-26 12:20:15 PST
Reporter - thanks for the great bug report. I'm really glad we caught this.

Can you confirm that my patch here fixes the problem? The fix will be in tomorrow's trunk (Minefield) nightly build. Leaving this bug open until the reporter confirms the fix.
Comment 11 electronic Max 2010-02-26 12:26:59 PST
Wow, thanks for the speedy response and fix.  I will test it tomorrow (Saturday my time) and get back to you all.  Testing will probably take 24-48 hours since it normally takes a couple hours to reproduce.

Anyway thanks everyone and will be shortly in touch ~
Comment 12 Daniel Veditz [:dveditz] 2010-02-26 13:21:05 PST
Not going to "block" on it, but do want the patch once you've proved it fixes the problem. Ask for branch approval when you're ready.
Comment 13 Jason Duell [:jduell] (needinfo me) 2010-02-26 19:51:33 PST
Comment on attachment 429171 [details] [diff] [review]
fix v1.1

Necko review:  there's no network logic changed by this patch, just obj memory management, so +r
Comment 14 electronic Max 2010-03-02 14:51:20 PST
I need a bit of clarification on which build(s) have this patch (sorry! I'm not an expert on your nightly-build process).

Minefield (3.7a2) for OS X seems to work without crashing.  So if this is the version that has the patches described in this thread, then it looks like it fixed it! 

Nightly releases of 3.62 (Namoroka for OS X) candidates seem to still exhibit the crashing bug (the last one I downloaded was from 4am March 2) and it's locked up my machine twice.  Does this tree not have the fix yet?

Please lmk.
Comment 15 Boris Zbarsky [:bz] (still a bit busy) 2010-03-02 15:06:42 PST
The 3.7a2 builds are the ones that have this patch, yes.  Resolving fixed based on comment 14.  Thanks for testing those builds!

The 3.6.x builds don't have this fix yet; the fix needs to be approved for that branch first.  Josh, do you want to ask for approvals?
Comment 16 Josh Aas 2010-03-05 14:36:48 PST
Reporter - can you post a stack trace or a crash report ID for a crash from this bug? I'd like to see what it looks like so I can see if it shows up in our crash reports database.
Comment 17 electronic Max 2010-03-05 14:54:25 PST
Hi Josh,

It is rare that it results in a Firefox crash; it usually results in a "beach-ball" (colourwheel) hang of the whole system requiring me to hard-restart (it does not even respond to force-quits!)

Do you know how I might get a stack trace in that condition? Is there a kernel-hotkey for example? (Would it even be relevant if I could get one?)

Thanks
Comment 18 Josh Aas 2010-03-05 18:35:47 PST
Yikes. Don't worry about the stack, thanks.
Comment 19 Mike Beltzner [:beltzner, not reading bugmail] 2010-03-08 10:08:19 PST
a=beltzner
Comment 20 Daniel Veditz [:dveditz] 2010-03-08 14:42:28 PST
(In reply to comment #6)
> I suggest we block on 1.9.1, 1.9.2, and 1.9.3. This is bad.

Did you mean to request approval1.9.1.9? on attachment 429171 [details] [diff] [review] as well, do you need a different 1.9.1 patch, or did this turn out to be less bad than you thought on that branch?

Roughly 3x more 3.5.x users than 3.6.x at the moment (of course that will change once we start serving the upgrade prompt, but 1.9.1 will still have lot of users). If "this is bad" don't we want it fixed?
Comment 21 Josh Aas 2010-03-08 16:50:56 PST
pushed to mozilla-1.9.2

http://hg.mozilla.org/releases/mozilla-1.9.2
Comment 22 Josh Aas 2010-03-08 16:56:37 PST
I do want this fixed in 1.9.1 but we need a new patch there.
Comment 23 Josh Aas 2010-03-08 21:43:07 PST
Created attachment 431291 [details] [diff] [review]
1.9.1 branch fix

Fix for 1.9.1 branch. Synced to mozilla-central version.
Comment 24 Mike Beltzner [:beltzner, not reading bugmail] 2010-03-10 12:34:15 PST
Comment on attachment 431291 [details] [diff] [review]
1.9.1 branch fix

a1.9.1.9=beltzner, please land immediately
Comment 25 Josh Aas 2010-03-10 12:50:12 PST
pushed to mozilla-1.9.1:

http://hg.mozilla.org/releases/mozilla-1.9.1/rev/bccec907649a

and here is the correct link to the 1.9.2 commit:

http://hg.mozilla.org/releases/mozilla-1.9.2/rev/d4bc405a33b9
Comment 26 Al Billings [:abillings] 2010-03-22 14:11:01 PDT
(In reply to comment #14)
> I need a bit of clarification on which build(s) have this patch (sorry! I'm not
> an expert on your nightly-build process).
> 
> Minefield (3.7a2) for OS X seems to work without crashing.  So if this is the
> version that has the patches described in this thread, then it looks like it
> fixed it! 
> 
> Nightly releases of 3.62 (Namoroka for OS X) candidates seem to still exhibit
> the crashing bug (the last one I downloaded was from 4am March 2) and it's
> locked up my machine twice.  Does this tree not have the fix yet?
> 
> Please lmk.

Can you try reproducing the problem with our release candidates for Firefox 3.5.9 and 3.6.2? 

You can get the 3.6.2 beta at ftp://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/3.6.2-candidates/build3/mac/en-US/.

You can get the 3.5.9 beta at ftp://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/3.5.9-candidates/build1/mac/en-US/.
Comment 27 electronic Max 2010-03-22 17:34:58 PDT
Ok! I will test it using the above betas myself and will also get my Mac-equipped students at MIT to run the betas on their machines.  Since reproducing the bug takes time, I will get back to you in 48 or so hours.

Note You need to log in before you can comment on or make changes to this bug.