Closed Bug 614958 Opened 14 years ago Closed 14 years ago

Browser freezing on exit, requires forced quit after going offline/sleeping (event loop spinning forever)

Categories

(Firefox :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 4.0b8
Tracking Status
blocking2.0 --- beta8+

People

(Reporter: rcampbell, Assigned: ehsan.akhgari)

References

Details

(Keywords: dogfood, hang, regression)

Attachments

(11 files, 1 obsolete file)

Attached file stack
buildid: 20101126

See attached stack from force quit.
Keywords: hang
Still seeing this on wake-from-sleep. Requesting blocking.
blocking2.0: --- → ?
Version: unspecified → Trunk
I was running into this as well, and removing Minefield from Applications and re-installing seemed to correct things.
Rob: when you come back from sleep, are you told that Firefox is offline?
I've run into this for at least my past few updates on OS X, including for today's update.
Blocking b8 for investigation; this could affect the upgrade path for our users, which is pretty serious.
blocking2.0: ? → beta8+
I was optimistic. After using it a little, it still hangs. Maybe plugin container, because when I quit, all UI disappears, but the container process is still running:

kevin-needhams-imac-2:mozsrc kneedham$ ps -eaf |grep 63342
  501 63342     1   0   0:37.80 ??         3:35.40 /Applications/Minefield.app/Contents/MacOS/firefox-bin -foreground -foreground
  501 63779 63342   0   0:03.39 ??         0:15.15 /Applications/Minefield.app/Contents/MacOS/plugin-container.app/Contents/MacOS/plugin-container /Library/Internet Plug-Ins/Flash Player.plugin -omnijar /Applications/Minefield.app/Contents/MacOS/omni.jar 63342 gecko-crash-server-pipe.63342 org.mozilla.machname.1946510159 plugin
  501 64094 56040   0   0:00.00 ttys001    0:00.00 grep 63342
I've been getting this too on Linux64 when restarting to get my nightly updates.  The UI goes away, but firefox-bin and plugin-container are still running.
(In reply to comment #3)
> Rob: when you come back from sleep, are you told that Firefox is offline?

yes. Toggling the menu option doesn't change it.
Attached file linux64 stack
Comment #7 was me signed in under the wrong account.
Can't reproduce with today's nightly.  I just restarted a few times, both with and without suspending my laptop briefly.
blocking2.0: beta8+ → ?
catlee, did you mean to clear the blocking b8 flag?
blocking2.0: ? → beta8+
(In reply to comment #12)
> catlee, did you mean to clear the blocking b8 flag?

nope!
This is bug 614286, I think.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → DUPLICATE
Attached file apple report
This has happened to me sometimes, but I have attributed to the number of tabs (40) and content that I have open before I quit and those consume a lot of memory. I've seen this happen before, maybe for two months now.
I'm using a tinderbox build with bug 614286 landed:

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=e1983b9db75d&tochange=1b4041e3c177

Built from http://hg.mozilla.org/mozilla-central/rev/1b4041e3c177

I tested this particular bug by going to File -> Work Offline, and the File menu stayed blue.

Duping other bugs to this one as this has blocking beta8+.
Assignee: nobody → ehsan
Status: RESOLVED → REOPENED
Depends on: 614286
Keywords: dogfood, regression
Resolution: DUPLICATE → ---
Summary: Browser freezing on exit, requires forced quit → Browser freezing on exit, requires forced quit after going offline/sleeping (event loop spinning forever)
(In reply to comment #16)
> I'm using a tinderbox build with bug 614286 landed:
> 
> http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=e1983b9db75d&tochange=1b4041e3c177
> 
> Built from http://hg.mozilla.org/mozilla-central/rev/1b4041e3c177
> 
> I tested this particular bug by going to File -> Work Offline, and the File
> menu stayed blue.
> 
> Duping other bugs to this one as this has blocking beta8+.

I tried reproducing this several times without success?  Can you reproduce this?
I can't consistently reproduce this as per my comment from bug 613998 comment 3, but the way I test it is by just browsing normally and trying File -> Work Offline from time to time. If it hangs, then it's definitely bad, if not I browse some more and try again later.

20 ok? http://hg.mozilla.org/mozilla-central/rev/baa6cc2f72e4
21 bad http://hg.mozilla.org/mozilla-central/rev/be4b69a4babf

Build from 1121 definitely is bad, but because I don't have a deterministic
STR, 20 seems good so far.

Doing File -> Work Offline can trigger this as well as it eventually calls
SetOffline:

nsIOService::SetOffline(int)
 nsSocketTransportService::Shutdown()
  nsThread::Shutdown()
   NS_ProcessNextEvent_P(nsIThread*, int)
    nsThread::ProcessNextEvent(int, int*)
     nsEventQueue::GetEvent(int, nsIRunnable**)
      PR_Wait
       PR_WaitCondVar
        _pthread_cond_wait
         __semwait_signal

With that range, I've guessed at the TCP SYN change bug 592284. I'll re-download the build from november 20 to check again if I can hang.
So, reading comment 25, it seems that this doesn't have anything to do with bug 614286, right?  And since your STR is not the same as comment 0, I'm not sure if you can just use the blocking flag set on this bug for this other problem.

(FWIW, I haven't been able to reproduce this yet.)
OK, I just reproduced a case where the Firefox got stuck in offline mode, and when I attempted to quit, it just hung...
robcee, does this seem accurate? The STR in comment 0 is basically..

1) wake the browser from sleep
2) notice the browser is stuck offline
3) quit firefox and notice it hanging

I took a screenshot from several days ago when I hit this after putting my laptop to sleep:

https://bug613998.bugzilla.mozilla.org/attachment.cgi?id=492586

I'm guessing some offline notification is being sent to eventually call nsIOService::SetOffline(int) which is the same ending call stack for when you do File -> Work Offline. (I originally only hit this when putting my laptop to sleep, but to come up with a STR that didn't involve closing the whole laptop, I used File -> Work Offline.)

And yes, it doesn't seem to have anything to do with bug 614286 other than a bunch of duplicate stuck-offline bugs were marked potentially resolved by that when it initially landed.
Attached file bt from all threads
Mardak, did you also get stuck in offline mode?
Status: REOPENED → ASSIGNED
Yes, after waking from sleep or doing File -> Work Offline, toggling it back to not Work Offline, loading pages in tabs results in "you're working offline!" as well as gmail showing its error message saying it can't reach the servers.
This is probably the cause for bug 573259 failing a lot more often recently then. That is caused by switching back online not working straight away for some reason.
Perhaps not useful by itself as we need to compare it to the number of pushes on those days.. But here's the count of recent comments posted on which days in bug 573259:

   1 2010-11-02
   2 2010-11-04
   2 2010-11-05
   1 2010-11-09
  10 2010-11-10
   5 2010-11-11
   3 2010-11-12
   3 2010-11-13
   5 2010-11-14
   6 2010-11-15
   9 2010-11-16
   7 2010-11-17
   8 2010-11-18
   9 2010-11-19
   3 2010-11-21
   7 2010-11-22
  13 2010-11-23
   8 2010-11-24
   5 2010-11-25
   4 2010-11-26
   2 2010-11-28
  15 2010-11-29
  14 2010-11-30
  11 2010-12-01

In october, there were almost 1 report a day on average (~26 reports).
(In reply to comment #32)
> This is probably the cause for bug 573259 failing a lot more often recently
> then. That is caused by switching back online not working straight away for
> some reason.

What's special about that test that could be affected by this?
(In reply to comment #34)
> (In reply to comment #32)
> > This is probably the cause for bug 573259 failing a lot more often recently
> > then. That is caused by switching back online not working straight away for
> > some reason.
> 
> What's special about that test that could be affected by this?

The test immediately before (browser_offline.js) flips offline and back online again. I have seen browser_relative.js fail and when it does the browser shows the offline error page even though by then it should be online.
I just got hit by this again, I did see some errors in the error console, screenshot coming up, and did a mac analysis, which looks somewhat different than the others here, so I'm attaching it.

buildid 20101201030318, os/x 10.6.5
Comment on attachment 494658 [details]
Live sample on Mac, different than others posted

Whoops, just noticed my sample is the same/very similar to comment 25
Attachment #494658 - Attachment is obsolete: true
I guess that the two notification.xml errors are from me pressing the "sync again" button on the sync-failed notification bar, fwiw.
Ehsan - this looks like a beta 8 long pole - do you have a plan of attack or, even better, a best/worst-case estimate for a fix?
(In reply to comment #35)
> (In reply to comment #34)
> > (In reply to comment #32)
> > > This is probably the cause for bug 573259 failing a lot more often recently
> > > then. That is caused by switching back online not working straight away for
> > > some reason.
> > 
> > What's special about that test that could be affected by this?
> 
> The test immediately before (browser_offline.js) flips offline and back online
> again. I have seen browser_relative.js fail and when it does the browser shows
> the offline error page even though by then it should be online.

Oh, right, makes a lot of sense.

(In reply to comment #42)
> Ehsan - this looks like a beta 8 long pole - do you have a plan of attack or,
> even better, a best/worst-case estimate for a fix?

Not yet, but I have managed to reproduce it, and I'm investigating it heavily right now (got distracted by the tree brokenness yesterday -- sorry).
OK, I figured out a big part of the mystery.

Look at this stack (which is coming from my repeated clicks on File > Work Offline.

Frame 61 shows the SetOffline(PR_TRUE) call.  This function calls nsSocketTransportService::Shutdown which attempts to shut down the thread, which gets us on that event loop.  But before the shutdown is finished, we start processing normal user events, which takes us to frame 0 where SetOffline(PR_FALSE) is called again, which just exits quickly because it has re-entrancy checks.  This causes us to get stuck in offline mode, and will lead us to get trapped in the event loop initiated in frame 59 while shutting down, which is the root cause of this bug.
With the help of bent and sdwilsh, I understood what's happening here.  Here's the problematic thread:

(gdb) thread apply 80 bt

Thread 80 (process 75871):
#0  0x00007fff81080322 in semaphore_wait_signal_trap ()
#1  0x00007fff8108596d in pthread_mutex_lock ()
#2  0x0000000100b348e9 in PR_Lock (lock=0x1291dfc90) at /Users/ehsanakhgari/moz/mozilla-central/nsprpub/pr/src/pthreads/ptsynch.c:206
#3  0x00000001008ca171 in nsAutoLock::nsAutoLock (this=0x12a77ea60, aLock=0x1291dfc90, _notifier=@0x12a77ea90) at nsAutoLock.h:222
#4  0x0000000112a0b16a in nsSocketOutputStream::CloseWithStatus (this=0x12c3cad10, reason=2152136706) at /Users/ehsanakhgari/moz/mozilla-central/netwerk/base/src/nsSocketTransport2.cpp:654
#5  0x0000000112a0663f in nsSocketOutputStream::Close (this=0x12c3cad10) at /Users/ehsanakhgari/moz/mozilla-central/netwerk/base/src/nsSocketTransport2.cpp:540
#6  0x0000000112a075d9 in nsSocketOutputStream::Release (this=0x12c3cad10) at /Users/ehsanakhgari/moz/mozilla-central/netwerk/base/src/nsSocketTransport2.cpp:533
#7  0x00000001129dd6ce in nsCOMPtr<nsIAsyncOutputStream>::~nsCOMPtr (this=0x133d51980) at nsCOMPtr.h:533
#8  0x0000000112a9e89d in nsHttpConnection::~nsHttpConnection (this=0x133d518b0) at /Users/ehsanakhgari/moz/mozilla-central/netwerk/protocol/http/nsHttpConnection.cpp:118
#9  0x0000000112a9c3b7 in nsHttpConnection::Release (this=0x133d518b0) at /Users/ehsanakhgari/moz/mozilla-central/netwerk/protocol/http/nsHttpConnection.cpp:962
#10 0x00000001129dd979 in nsCOMPtr<nsITransportEventSink>::assign_assuming_AddRef (this=0x12c3cacd0, newPtr=0x0) at nsCOMPtr.h:518
#11 0x00000001129dda0e in nsCOMPtr<nsITransportEventSink>::assign_with_AddRef (this=0x12c3cacd0, rawPtr=0x0) at nsCOMPtr.h:1204
#12 0x00000001129ddf2b in nsCOMPtr<nsITransportEventSink>::operator= (this=0x12c3cacd0, rhs=0x0) at nsCOMPtr.h:663
#13 0x0000000112a0a668 in nsSocketTransport::OnSocketDetached (this=0x12c3cabb0, fd=0x11ebac6b0) at /Users/ehsanakhgari/moz/mozilla-central/netwerk/base/src/nsSocketTransport2.cpp:1630
#14 0x0000000112a0ebec in nsSocketTransportService::DetachSocket (this=0x101099e00, sock=0x101099e60) at /Users/ehsanakhgari/moz/mozilla-central/netwerk/base/src/nsSocketTransportService2.cpp:188
#15 0x0000000112a0ee70 in nsSocketTransportService::Run (this=0x101099e00) at /Users/ehsanakhgari/moz/mozilla-central/netwerk/base/src/nsSocketTransportService2.cpp:602
#16 0x00000001008ba96a in nsThread::ProcessNextEvent (this=0x130b43b30, mayWait=1, result=0x12a77ee3c) at /Users/ehsanakhgari/moz/mozilla-central/xpcom/threads/nsThread.cpp:626
#17 0x0000000100831b4e in NS_ProcessNextEvent_P (thread=0x130b43b30, mayWait=1) at nsThreadUtils.cpp:250
#18 0x00000001008bb3f4 in nsThread::ThreadFunc (arg=0x130b43b30) at /Users/ehsanakhgari/moz/mozilla-central/xpcom/threads/nsThread.cpp:278
#19 0x0000000100b3c157 in _pt_root (arg=0x12e90d170) at /Users/ehsanakhgari/moz/mozilla-central/nsprpub/pr/src/pthreads/ptthread.c:187
#20 0x00007fff810b9536 in _pthread_start ()
#21 0x00007fff810b93e9 in thread_start ()

It tries to lock mTransport->mLock in nsSocketOutputStream::CloseWithStatus, but that lock is already locked in nsSocketTransport::OnSocketDetached.
Potentially this was made worse by the 3x socket transport closes:
http://hg.mozilla.org/mozilla-central/rev/be4b69a4babf#l2.149

Each mSocketTransport->Close(reason) calls both mInput.CloseWithStatus(reason) and mOutput.CloseWithStatus(reason).
Depends on: 592284
Attached patch Patch (v1)Splinter Review
Attachment #494806 - Flags: review?(benjamin)
Comment on attachment 494806 [details] [diff] [review]
Patch (v1)

r=me for landing in b8 conditional on you getting a real necko peer to review this later.
Attachment #494806 - Flags: review?(benjamin) → review+
(In reply to comment #46)
> Potentially this was made worse by the 3x socket transport closes:
> http://hg.mozilla.org/mozilla-central/rev/be4b69a4babf#l2.149
> 
> Each mSocketTransport->Close(reason) calls both mInput.CloseWithStatus(reason)
> and mOutput.CloseWithStatus(reason).

It's possible, although maybe not as a direct result of that code, but as a result of us having to destroy three of these objects instead of one.
Comment on attachment 494806 [details] [diff] [review]
Patch (v1)

Asking bz for the real review, but I'll land it right now.
Attachment #494806 - Flags: review?(bzbarsky)
Comment on attachment 494806 [details] [diff] [review]
Patch (v1)

Clearing the request from bz as I now have a real review from a peer.
Attachment #494806 - Flags: review?(bzbarsky)
Blocks: 573259
fwiw the syn code won't actually give you 3x closes (some of those pointers are mutually exclusive).. the max is 2x and it realistically is less than that because a timer has to expire before we go past 1x..

you might consider if the patch queued in 613977 fixes this.. it's not clear to me that it would or would not. the patch is a follow on to the syn code which includes a problem where the socketservice global limit could be exceeded - I have noticed before that really odd lockups can occur when that happens. that patch is just an hour old and needs a review or two.
> Asking bz for the real review, but I'll land it right now.

Fwiw, I seem to recall someone asking me about hangs in this exact code a day or two back, and I told him to do just that for callbacks and event sink.  Should we do this for callbacks as well in this bug?

Also, does that just fix the hang on shutdown?  Or also the getting stuck in offline mode?
http://hg.mozilla.org/mozilla-central/rev/2c24ce211f2b

(In reply to comment #52)
> fwiw the syn code won't actually give you 3x closes (some of those pointers are
> mutually exclusive).. the max is 2x and it realistically is less than that
> because a timer has to expire before we go past 1x..
> 
> you might consider if the patch queued in 613977 fixes this.. it's not clear to
> me that it would or would not. the patch is a follow on to the syn code which
> includes a problem where the socketservice global limit could be exceeded - I
> have noticed before that really odd lockups can occur when that happens. that
> patch is just an hour old and needs a review or two.

I don't see why it that patch would fix this problem.  This is a deadlock situation that we can get into even without your tcp syn patch...

(In reply to comment #53)
> > Asking bz for the real review, but I'll land it right now.
> 
> Fwiw, I seem to recall someone asking me about hangs in this exact code a day
> or two back, and I told him to do just that for callbacks and event sink. 
> Should we do this for callbacks as well in this bug?

I was going to do this for callbacks as well, but it wasn't clear to me if that's needed.  Is it?

> Also, does that just fix the hang on shutdown?  Or also the getting stuck in
> offline mode?

Both, as they're the result of the same underlying problem.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 4.0b8
> Is it?

I don't know, but it's easy to do, and clearly safer, right?

> Both, as they're the result of the same underlying problem.

How does the stuck offline problem result from the Release() being under the lock?
(In reply to comment #55)
> > Is it?
> 
> I don't know, but it's easy to do, and clearly safer, right?

OK, I'll write a patch for that too.

> > Both, as they're the result of the same underlying problem.
> 
> How does the stuck offline problem result from the Release() being under the
> lock?

The original SetOffline call calls mSocketTransportService->Shutdown(), but that never returns, and spins its own event loop, so the second SetOffline call happens while nsIOService::mSettingOffline is true (because the original call hasn't returned yet), so it just returns without doing anything, which means that the offfline status, whatever it was, can't be changed any more.  Since the mSocketTransportService->Shutdown() call only happens when going offline, this can only cause you to get stuck in offline mode.
Attachment #494873 - Flags: review?(bzbarsky)
Comment on attachment 494873 [details] [diff] [review]
Part 2: protect mCallbacks as well

I'd appreciate if someone can land this for me if I'm not around when this is reviewed either by bz or biesi, as I may not be around tonight.
Attachment #494873 - Flags: review?(cbiesinger)
I took the hourly builds for a spin on Mac OS, trying out a few scenarios. Toggling offline/online browsing works fine every time. I tried turning wi-fi off, as well as having the computer go into sleep mode, and I was able to keep browsing and exit the application a couple of times with no hangs and normal exits.

But I tried a few more times and at least a couple of times the application hanged on exit (after several minutes of waiting for it to quit), and I also got a seemingly unrelated crash: bp-b1395957-554b-43a7-8652-3d3a32101202

I'm attaching another sample where the application wouldn't exit.
Comment on attachment 494873 [details] [diff] [review]
Part 2: protect mCallbacks as well

r=me

I don't think this needs additional review from biesi.
Attachment #494873 - Flags: review?(cbiesinger)
Attachment #494873 - Flags: review?(bzbarsky)
Attachment #494873 - Flags: review+
Juan, the crash links you posted don't seem relevant to this bug, but can you attach a sample of the hangs that you observed, please?
One sample is included in comment #59, but I can post a couple more if needed.
(In reply to comment #64)
> One sample is included in comment #59, but I can post a couple more if needed.

Hmm, yes, sorry, I somehow missed that.  So, from a quick look, it doesn't look very similar to this bug.  Do you have STRs for that (I assume not)?  If not, can you please file a new bug on it?  It seems to me that the PSM thread is failing to exit...
Ehsan, juan: there are already some bugs about that exact issue.  Bug 430260 and bug 468736.  In general, they all could be related or duplicates of bug 429304.
Depends on: 616520
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: