Closed Bug 1124880 Opened 5 years ago Closed 5 years ago

shutdownhang in mozilla::net::nsHttpConnectionMgr::Shutdown() on pending UDP socket close

Categories

(Core :: Networking: HTTP, defect, critical)

36 Branch
x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox36 + wontfix
firefox37 + wontfix
firefox38 + fixed
firefox39 + fixed
relnote-firefox --- 36+

People

(Reporter: kairo, Assigned: mayhemer)

References

Details

(Keywords: crash, topcrash-win)

Crash Data

Attachments

(4 files, 6 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-35103a28-d12f-4999-9066-4f6ed2150122.
=============================================================

This is probably what a good part of bug 1103833 is mutating to after the fix of bug 1104317 - at least from the looks of early 36.0b2 data where this constitutes >1% of all crashes.

Find stats and more of those reports at https://crash-stats.mozilla.com/report/list?signature=shutdownhang+%7C+WaitForSingleObjectEx+%7C+WaitForSingleObject+%7C+PR_Wait+%7C+nsThread%3A%3AProcessNextEvent%28bool%2C+bool%2A%29+%7C+NS_ProcessNextEvent%28nsIThread%2A%2C+bool%29+%7C+mozilla%3A%3Anet%3A%3AnsHttpConnectionMgr%3A%3AShutdown%28%29
Important crash, tracking.
This is (for me) exactly the same crash, this time on 38.0 a1
https://crash-stats.mozilla.com/report/index/6ac6325d-a96b-424d-b994-3e14b2150126
but the signature is a little bit different to this
https://crash-stats.mozilla.com/report/index/6b24ad47-c392-406b-ad63-6bf7c2150126
and so this crash on 38.0 a1 is not linked to an 'Related Bug' like this one here
David, could you help on this? thanks
Flags: needinfo?(dmajor)
Patrick, any idea why nsHttpConnectionMgr::Shutdown is taking so long? (For context: bug 1038342 added a watchdog that turns hung shutdowns into crashes)

Here's the stack of the main thread:

WaitForSingleObject
PR_Wait
nsThread::ProcessNextEvent(bool, bool*)
NS_ProcessNextEvent(nsIThread*, bool)
mozilla::net::nsHttpConnectionMgr::Shutdown()
mozilla::net::nsHttpHandler::Observe(nsISupports*, char const*, wchar_t const*)
nsObserverList::NotifyObservers(nsISupports*, char const*, wchar_t const*)
nsObserverService::NotifyObservers(nsISupports*, char const*, wchar_t const*)
nsObserverService::Release()
ScopedXPCOMStartup::~ScopedXPCOMStartup()
XREMain::XRE_main(int, char** const, nsXREAppData const*)
XRE_main
Flags: needinfo?(dmajor) → needinfo?(mcmanus)
its pretty generic.. it just means the socket thread isn't completing.

maybe honza has more insight
Flags: needinfo?(mcmanus) → needinfo?(honzab.moz)
The report this bug was for filed for:
Thread 4 @kernel32.dll!CloseHandleImplementation|ntdll.dll!NtClose
Thread 5 @ntdll.dll!ZwRemoveIoCompletion
Thread 9 (sts) @ws2_32.dll!closesocket|ntdll.dll!NtClose
Thread 35 (dns) @ws2_32.dll!closesocket|ntdll.dll!NtClose

Arbitrary reports from the list: 

https://crash-stats.mozilla.com/report/index/ade2c3b2-0397-4061-878b-7414b2150122
Thread 4 (sts) @ws2_32.dll!connect|ntdll.dll!ZwDeviceIoControlFile
Thread 1 (google breakpad)  @ntdll.dll!NtClose

https://crash-stats.mozilla.com/report/index/8cdee535-2dad-4261-bb34-869072150122
Thread 2 @ntdll.dll!NtRemoveIoCompletion
Thread 4 (sts) @ntdll.dll!NtClose

https://crash-stats.mozilla.com/report/index/4234021a-bc59-4757-aab6-4a8952150122
Thread 4 (sts) @ntdll.dll!NtClose
Thread 36 @ntdll.dll!NtRemoveIoCompletion


There is usually NtClose on the stack.


Seems like a windows bug.
Flags: needinfo?(honzab.moz)
Also, in cases we are stuck on closing a socket in sts it is always a UDP socket.

There are reports with no stuck (being inside poll) on sts too, tho: 
https://crash-stats.mozilla.com/report/index/60d37d17-db0c-4db0-bbe3-ab84b2150122
https://crash-stats.mozilla.com/report/index/d1c60009-08a3-4bac-91cb-9fae72150123
Honza, any idea if we could workaround this windows bug? thanks
Flags: needinfo?(honzab.moz)
Tentatively assigning to me, this needs some deep investigation.  No idea for a workaround, no idea of the cause at all.
Assignee: nobody → honzab.moz
Flags: needinfo?(honzab.moz)
Also note that the comments in the URL given in comment #0 sound like those are really troubled installations, sounds like we would not only run into hangs on shutdown but all over the place.
Summary: crash in shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::net::nsHttpConnectionMgr::Shutdown() → shutdownhang in mozilla::net::nsHttpConnectionMgr::Shutdown()
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::net::nsHttpConnectionMgr::Shutdown()] → [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::net::nsHttpConnectionMgr::Shutdown()] [@ shutdownhang | WaitForSingleObjectEx | WaitForSi…
Crash Signature: , bool) | mozilla::net::nsHttpConnectionMgr::Shutdown()] [@ shutdownhang | WaitForSingleObjectEx | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread* → , bool) | mozilla::net::nsHttpConnectionMgr::Shutdown()] [@ shutdownhang | WaitForSingleObjectEx | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*
I don't see that bug fixed for 36. Wontfix but tracking it for the next releases.
Release Note Request (optional, but appreciated)
[Why is this notable]: Users will see crash notifications more often.
[Suggested wording]: Firefox hangs on shutdown are now reported as crashes, in order to help Mozilla diagnose and fix the specific issues during shutdown. 
[Links (documentation, blog post, etc)]: (not sure. maybe bug 1038342, bug 1104317. a blog post might help explain this.)
relnote-firefox: --- → ?
Added to the release notes. Thanks Liz!
Crash Signature: , bool) | mozilla::net::nsHttpConnectionMgr::Shutdown()] → , bool) | mozilla::net::nsHttpConnectionMgr::Shutdown()] [@ shutdownhang | ntdll.dll@0x3c6bc]
Version: Trunk → 36 Branch
Any idea to hang out (kill) in the Windows TM permanent task Firefox.exe or Nightly.exe which remains after this crash?
The slow UDP close (most common cause of this hang) is also known to chromium guys [1], but there is no solution right now on their side (AFAIK, no deeper check on that).

According [2] closesocket tries to send unsent data and when network has changed (cable unplugged, wifi changed) it might take up to 5 secs.  The blog post claims this is by default since Vista, but I can see reports from version 5.1.2600 (windows XP) tho.  Anyway, one of the reports says Fx crashes when network is switched.

According our shutdown telemetry [3] seems like there is some distribution of slow shutdowns already.  It may be that when some amount of UDP sockets piles up, we are slower and go over the 60s shutdown watchdog limit.


My vote is to force abortive close on UDP sockets, first just on Win, according [4]:

  "If the l_onoff member of the linger structure is nonzero and l_linger member is zero, closesocket is not blocked even if queued data has not yet been sent or acknowledged. This is called a hard or abortive close, because the socket's virtual circuit is reset immediately, and any unsent data is lost."



[1] https://code.google.com/p/chromium/issues/detail?id=165382#c81
[2] http://blogs.msdn.com/b/winsdk/archive/2013/10/09/udp-closesocket-takes-upto-5-seconds-to-return-in-disconnect-remote-host-down-scenario-due-to-pending-data-to-send.aspx
[3] http://telemetry.mozilla.org/#filter=beta%2F36%2FSHUTDOWN_PHASE_DURATION_TICKS_PROFILE_CHANGE_TEARDOWN%2Fsaved_session%2FFirefox&aggregates=Submissions&evoOver=Builds&locked=true&sanitize=true&renderhistogram=Table
[4] https://msdn.microsoft.com/en-us/library/windows/desktop/ms737582%28v=vs.85%29.aspx
Status: NEW → ASSIGNED
SO_LINGER is not purposed for UDP sockets, at least not on Windows.
One way to at least mitigate this would be to create a separate thread for *each* PR_Close() call on each UDP socket.  Then we just need to find a spot to join all the treads during shutdown.  Sounds a bit crazy but I can imagine this be upliftable.
Attached patch v1 (obsolete) — Splinter Review
I so far haven't found any general way to avoid blocks on the socket thread.  I believe select() call may also be slow from time to time (few reports are inside select).

This should mitigate the udp close delay, if it can be avoided by parallelization of close() calls on respective sockets.

Roughly tested.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d9d59454a466
Attachment #8566308 - Flags: review?(mcmanus)
Comment on attachment 8566308 [details] [diff] [review]
v1

Review of attachment 8566308 [details] [diff] [review]:
-----------------------------------------------------------------

the try build is red.

You've done a nice job documenting that this is a known issue and parallelism seems the right mitigation. I wonder if we should be using the existing threadpool mechanism here - maybe unlimited isn't the right thing either.

did you see the one blocked in connect? I wonder if we should be testing sockets at connect time to see if they are blocking or not.

I think we should do something like this, but only expect it to be a partial mitigation. I think it should get a lot of pre-release air time - implications of spawning a lot of threads might take a while to see.

::: netwerk/base/nsUDPSocket.cpp
@@ +114,5 @@
> +  void AddObserver();
> +  void JoinAndRemove();
> +
> +  PRFileDesc *mFd;
> +  PRThread* mThread;

PRThread *mThread (throughout)

@@ +156,5 @@
> +  // Released after the thread is done.
> +  mSelf = this;
> +  mThread = PR_CreateThread(PR_USER_THREAD, ThreadFunc, this,
> +                            PR_PRIORITY_NORMAL, PR_GLOBAL_THREAD,
> +                            PR_JOINABLE_THREAD, 128 * 1024);

that stack seems huge.. and given that we might have a lot in parallel and VM space is an issue on 32 bit platforms...

@@ +165,5 @@
> +
> +  // Observer service must be worked with on the main thread only.
> +  // This method is called usually on the socket thread.
> +  nsCOMPtr<nsIRunnable> event = NS_NewRunnableMethod(
> +    this, &nsUDPSocketCloseThread::AddObserver);

can this deadlock if the main thread is trying to shut down the connection manager?
Attachment #8566308 - Flags: review?(mcmanus) → feedback+
(In reply to Patrick McManus [:mcmanus] from comment #19)
> Comment on attachment 8566308 [details] [diff] [review]
> v1
> 
> Review of attachment 8566308 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> the try build is red.
> 
> You've done a nice job documenting that this is a known issue and
> parallelism seems the right mitigation. I wonder if we should be using the
> existing threadpool mechanism here - maybe unlimited isn't the right thing
> either.

I don't think there will be that much sockets to close, they are also closing during browser lifetime.  We can have our favorite telemetry (I love it :)) to check how far with the numbers we go (can go in with this patch in)

> 
> did you see the one blocked in connect? I wonder if we should be testing
> sockets at connect time to see if they are blocking or not.

I've seen blocks in connect(), select() and some others I think.  I believe there is a general problem with winsock, but google is silent about it.

> 
> I think we should do something like this, but only expect it to be a partial
> mitigation. I think it should get a lot of pre-release air time -
> implications of spawning a lot of threads might take a while to see.

Yes, I am afraid we have to.

> 
> ::: netwerk/base/nsUDPSocket.cpp
> @@ +114,5 @@
> > +  void AddObserver();
> > +  void JoinAndRemove();
> > +
> > +  PRFileDesc *mFd;
> > +  PRThread* mThread;
> 
> PRThread *mThread (throughout)

I never know..

> 
> @@ +156,5 @@
> > +  // Released after the thread is done.
> > +  mSelf = this;
> > +  mThread = PR_CreateThread(PR_USER_THREAD, ThreadFunc, this,
> > +                            PR_PRIORITY_NORMAL, PR_GLOBAL_THREAD,
> > +                            PR_JOINABLE_THREAD, 128 * 1024);
> 
> that stack seems huge.. and given that we might have a lot in parallel and
> VM space is an issue on 32 bit platforms...

Yep, something to research tho.  Nightly is our friend here probably.

> 
> @@ +165,5 @@
> > +
> > +  // Observer service must be worked with on the main thread only.
> > +  // This method is called usually on the socket thread.
> > +  nsCOMPtr<nsIRunnable> event = NS_NewRunnableMethod(
> > +    this, &nsUDPSocketCloseThread::AddObserver);
> 
> can this deadlock if the main thread is trying to shut down the connection
> manager?

No, we spin the MT loop when shutting down.  Also, this just adds an observer, nothing else.
> > VM space is an issue on 32 bit platforms...
> Yep, something to research tho.  Nightly is our friend here probably.

FWIW we don't see a huge amount of address space exhaustion on Nightly these days. (I would guess that 64-bit builds are a big reason, and the daily restarts probably help too)
Attached patch v1.1 (obsolete) — Splinter Review
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1c41b1761691

Now includes telemetry.
Attachment #8566308 - Attachment is obsolete: true
Attachment #8567077 - Flags: review?(mcmanus)
And attempt to fix linux-static-analyzes build fail:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a9546d95b460
Few failures (comment 23):
- netwerk/test/unit/test_udp_multicast.js, socket is expected to be closed on return of nsUDPSocket::Close() ; fixed as: no use of the thread in this method
- Assertion failure: sActiveThreadsCount > 0, on x64 - sActiveThreadsCount is 32bit, hence unaligned ; fixed as: using uintptr_t
- [1], Main app process: killed by out-of-range signal, number 117 + some UDP socket warning in the console ; probably coming from dom\network\tests\test_udpsocket.html and is probably related to the first error in this list


[1] https://treeherder.mozilla.org/logviewer.html#?job_id=5074251&repo=try
Just for your information.
I will open a new bug for the test_udp_multicast.js failure. The patch from bug 1131557 makes it fail reliably on 32bit linux, but the patch is not the cause. Something strange is going on, i am still not sure what is the problem. I am working on it...
Attachment #8567077 - Flags: review?(mcmanus) → review+
Depends on: 1135405
(In reply to Honza Bambas (:mayhemer) from comment #25)
> Few failures (comment 23):
> - netwerk/test/unit/test_udp_multicast.js, socket is expected to be closed
> on return of nsUDPSocket::Close() ; fixed as: no use of the thread in this
> method
> - [1], Main app process: killed by out-of-range signal, number 117 + some
> UDP socket warning in the console ; probably coming from
> dom\network\tests\test_udpsocket.html and is probably related to the first
> error in this list

Looking more at the code, it seems like calling PR_Close directly from ::Close() is nice but won't fix these problems.  I believe there is a listener on the socket class at that time and we go to the socket thread with OnMsgClose that ends up in OnSocketDetached.

So, Dragana, I'll wait for your patch of the failing test first.
Attachment #8567077 - Flags: review+
(In reply to Honza Bambas (:mayhemer) from comment #25)

> - [1], Main app process: killed by out-of-range signal, number 117 + some
> UDP socket warning in the console ; probably coming from
> dom\network\tests\test_udpsocket.html and is probably related to the first
> error in this list
> 
> 
> [1] https://treeherder.mozilla.org/logviewer.html#?job_id=5074251&repo=try

the udp socket warning can be cause by
https://hg.mozilla.org/integration/fx-team/annotate/37e5f630049b/toolkit/modules/secondscreen/SimpleServiceDiscovery.jsm#l163

SimpleServiceDiscovery.jsm does not shut down at all. This service was causing bug 1085266 (comment 13 explains it a bit)
I believe that the following three crash reports are all instances of this bug:

bp-163de6ab-bc47-4e3d-90d5-f91992150304	3/4/2015	16:53
bp-5a081db2-ffde-4d6c-857f-3924c2150304	3/4/2015	10:02
bp-311f753d-21d3-4164-83e8-fe9c92150303	3/3/2015	11:17

The attached debug log is according to the instructions at http://blogs.technet.com/b/markrussinovich/archive/2005/08/17/unkillable-processes.aspx.  (Apologies for the initial flailing.)

I've been hitting this bug a couple of times a day for the last few days; let me know if I can gather any more information to help!  :)
Duplicate of this bug: 1138762
so has this been introduced by bug 1054959?
I looked for a preference to disable device discovery by UDP (I'm assuming that's the part of that bug you're suspecting), but couldn't find anything.  (I searched for "device", "video", and "discover" in about:config.)  If you can point me to it, I'd be happy to turn it off and report back after a week or so.  It's also not obvious to me whether that feature exists in the build I'm using or was backed out; I'm on:

> Build identifier: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0
This corresponds to https://crash-stats.mozilla.com/report/index/2b690fb1-b9a5-4ce5-9c7f-c64192150309.  It happened after being in a 3.5-hour meeting, so on wireless and with a few suspend-resumes thrown in.  But Firefox was shut down during most of the time (I'm almost positive it was shut down during all of the suspend/resume cycles).

In general I've been able to reduce the frequency of this issue by exiting FF whenever I would normally alt-tab away from it, only leaving it active when I'm directly using it.
I dug around a bit in the code in bug 1054959, and found that SimpleServiceDiscovery.jsm uses UDP port 1900.  I created a Windows Firewall rule to block outbound connections on that remote port, and haven't seen this bug since then.  I never would've thought of that -- thank you philipp!
Blocks: 1054959
My problem with Firefox not connecting and not restarting is now gone. I had some trouble with the motherboard on my computer. I reflashed it, as well it will now only work with one stick of ram, but at least I have no problem with Firefox. The motherboard uses the Intel G33/G31 chipset. I don't know if it had gone bad somehow. I did have some trouble with Window Media Player. Hope this may help.
(In reply to Matt McHenry from comment #35)
> Created attachment 8577239 [details]
I am unable to see the properties window (digging into firewall outbound rule,but not able to get the properties window of firefox)
if you able to gives the steps,then it would be great for me.
> Possible work-around: Windows Firewall rule to block SSDP UDP connections
> 
> I dug around a bit in the code in bug 1054959, and found that
> SimpleServiceDiscovery.jsm uses UDP port 1900.  I created a Windows Firewall
> rule to block outbound connections on that remote port, and haven't seen
> this bug since then.  I never would've thought of that -- thank you philipp!
(In reply to Matt McHenry from comment #35)
> Created attachment 8577239 [details]
> Possible work-around: Windows Firewall rule to block SSDP UDP connections
> 
> I dug around a bit in the code in bug 1054959, and found that
> SimpleServiceDiscovery.jsm uses UDP port 1900.  I created a Windows Firewall
> rule to block outbound connections on that remote port, and haven't seen
> this bug since then.  I never would've thought of that -- thank you philipp!

If this is correct, this bug should be fixed in 37+ as tab casting will be disabled by default via bug 1142521. We should be able to tell by the crash volume in 37 Beta 6.
(In reply to Chandan Kumar Baba from comment #37)
> (In reply to Matt McHenry from comment #35)
> > Possible work-around: Windows Firewall rule to block SSDP UDP connections
> 
> I am unable to see the properties window (digging into firewall outbound
> rule,but not able to get the properties window of firefox)
> if you able to gives the steps,then it would be great for me.

At this point it sounds like you might be better off grabbing a FF 37 beta 6+ build.  But if you still want to create a firewall rule, here's how:

Start -> Windows Firewall
right-click on Outbound Rules in the left-hand pane and select "New Rule ..."
select the "Port" radio button and click next
select the "UDP" radio button, enter 1900 in the "Specific remote ports" text box, and click next
select the "block the connection" radio button and click next
select all three of the "Domain", "Private" and "Public" check boxes and click next
name your rule, paste the URL of this bug into the description, and click finish

find your new rule in the list and double-click to open it
on the "Programs and Services" tab, select the "This program" radio button
click the browse button and locate firefox.exe
Duplicate of this bug: 1143277
I still see this crash in 37.0b6 so looks like bug 1142521 did not resolve this issue. We're now too late in 37 to take a fix so I've marked 37 as wontfix.
Attachment #8581758 - Attachment filename: 1124880-new-thread.patch → v1.2
See Also: → 1143866
Comment on attachment 8581758 [details] [diff] [review]
v1.2

Patrick, I've added telemetry and few other cosmetic changes.  It's not for carrying r+, so asking again for a full r.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a800b82c5725
Attachment #8581758 - Flags: review?(mcmanus)
Attachment #8581758 - Attachment description: 1124880-new-thread.patch → v1.2
Attachment #8581758 - Attachment filename: v1.2 → 1124880-new-thread.patch
Attached patch v1.3 (obsolete) — Splinter Review
Fixed (finally!) the thread count assertion failure.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f5743ef11575
Attachment #8582498 - Flags: review?(mcmanus)
Attachment #8581758 - Attachment is obsolete: true
Attachment #8581758 - Flags: review?(mcmanus)
Comment on attachment 8582498 [details] [diff] [review]
v1.3

Review of attachment 8582498 [details] [diff] [review]:
-----------------------------------------------------------------

if the problem persists because a single thread an block too long I guess we can move to a timeout and kill the thread (and leak the socket) approach, right?

::: netwerk/base/nsUDPSocket.cpp
@@ +137,5 @@
> +  static bool sPastShutdown;
> +
> +  // Active threads (roughly) counter, modified only on the main thread
> +  // and used only for telemetry reports.
> +  static uintptr_t sActiveThreadsCount;

not really a review comment, but why uintptr_t if the var never holds a ptr? it seems accumulate just wants uint32_t
Attachment #8582498 - Flags: review?(mcmanus) → review+
(In reply to Patrick McManus [:mcmanus] from comment #45)
> Comment on attachment 8582498 [details] [diff] [review]
> v1.3
> 
> Review of attachment 8582498 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> if the problem persists because a single thread an block too long I guess we
> can move to a timeout and kill the thread (and leak the socket) approach,
> right?

Yes, probably have it win specific.  TerminateThread would be the way.

> 
> ::: netwerk/base/nsUDPSocket.cpp
> @@ +137,5 @@
> > +  static bool sPastShutdown;
> > +
> > +  // Active threads (roughly) counter, modified only on the main thread
> > +  // and used only for telemetry reports.
> > +  static uintptr_t sActiveThreadsCount;
> 
> not really a review comment, but why uintptr_t if the var never holds a ptr?
> it seems accumulate just wants uint32_t

My poor attempt for alignment enforcement.  uint32_t it is.
Attached patch v1.3.1 (obsolete) — Splinter Review
Attachment #8582498 - Attachment is obsolete: true
Attachment #8583239 - Flags: review+
Don't ever use TerminateThread. If the thread is in any critical section (including the allocator) the whole process will deadlock.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #48)
> Don't ever use TerminateThread. If the thread is in any critical section
> (including the allocator) the whole process will deadlock.

And I think that is our case, I believe NtClose keeps one.  So, we should slowly turn to MS for help here...
Keywords: checkin-needed
Attached patch v1.3.2 (obsolete) — Splinter Review
Attachment #8583239 - Attachment is obsolete: true
Attachment #8584524 - Flags: review+
Flags: needinfo?(honzab.moz)
Keywords: checkin-needed
Something's seriously wrong with that attachment.
Keywords: checkin-needed
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #53)
> Something's seriously wrong with that attachment.

Hmm.. how that happened...  Re-attaching, sorry.
Attached patch v1.3.2Splinter Review
Attachment #8584524 - Attachment is obsolete: true
Attachment #8584632 - Flags: review+
Seems ok now.
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/88b311e87678
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
FWIW, there is a number of bugs saying "Firefox can't connect, hangs on shutdown, cannot be killed" recently, and at least one crash id linking the hang to the stack here.

bug 1125686 (I CCed everyone there), bug 1137994, bug 1105577, bug 1142981, bug 1141627, bug 1143969, bug 646189 comment 13.

...and a forum thread saying this is new in 36 http://forums.mozillazine.org/viewtopic.php?f=38&t=2916383
Honza, can we have an uplift to beta? Thanks
Flags: needinfo?(honzab.moz)
1. 

Some of the reports in comment 60 show we are stuck in PR_Poll.  This could be an effect of a known bug with signaling through the pollable event.  It may be that there is infinite timeout put to PR_Poll.

We could modify nsSocketTransportService::PollTimeout() to return no more then few seconds timeout (5/7/10 or so) to let the STS thread loop occasionally.  With Dragana's patch for event processing we might fix this bug.


2. 

https://crash-stats.mozilla.com/report/index/5ba8f826-a4b2-4131-a63c-e4b112150404#allthreads, thread 11 (socket thread) shows a dialog via nsAutodial::DialDefault:

RasDialDlgW
nsAutodial::DialDefault(wchar_t const*)
nsNativeConnectionHelper::OnConnectionFailed(wchar_t const*)
nsSocketTransport::RecoverFromError()
mozilla::net::nsHttpConnectionMgr::nsHalfOpenSocket::OnTransportStatus(nsITransport*, nsresult, __int64, __int64)


3.

Some reports show we are stuck in connect(), few of them show we are stuck in close() (nsNSSSocketInfo::CloseSocketAndDestroy).

4.

https://crash-stats.mozilla.com/report/index/5acfd3bf-631b-46fd-b67b-731e42150407#allthreads, thread 9 
https://crash-stats.mozilla.com/report/index/73b21efe-1098-4285-b57d-771342150402#allthreads, thread 8

NtWaitForSingleObject
ntdll.dll@0x1e1d6
RtlpEnterCriticalSectionContended
RtlpEnterCriticalSectionContended
PR_Lock
mozilla::net::nsWSAdmissionManager::GetSessionCount(int&)
mozilla::net::WebSocketChannel::StopSession(nsresult)
mozilla::net::WebSocketChannel::ReleaseSession()
mozilla::net::WebSocketChannel::ProcessInput(unsigned char*, unsigned int)
mozilla::net::WebSocketChannel::OnInputStreamReady(nsIAsyncInputStream*)
nsInputStreamReadyEvent::Run()
nsThread::ProcessNextEvent(bool, bool*)
NS_ProcessNextEvent(nsIThread*, bool)
nsSocketTransportService::Run()


5.

https://crash-stats.mozilla.com/report/index/27ae8618-8676-436d-8056-c6db72150406#allthreads, thread 5
https://crash-stats.mozilla.com/report/index/bb6d4bd5-75b5-475d-a926-4207d2150404#allthreads, thread 4

PR_Lock
nsSocketTransportService::Dispatch(nsIRunnable*, unsigned int)
  http://hg.mozilla.org/releases/mozilla-aurora/annotate/239508ee646f/netwerk/base/nsSocketTransportService2.cpp#l128
nsSocketTransportService::Run()
  http://hg.mozilla.org/releases/mozilla-aurora/annotate/239508ee646f/netwerk/base/nsSocketTransportService2.cpp#l815
nsThread::ProcessNextEvent(bool, bool*)
NS_ProcessNextEvent(nsIThread*, bool)


I'll file bugs.
mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*)
MessageLoop::RunHandler()
MessageLoop::Run()
nsThread::ThreadFunc(void*)
_PR_NativeRunThread
pr_root
_callthreadstartex
Flags: needinfo?(honzab.moz)
Depends on: 1152040
Depends on: 1152041
Depends on: 1152046
Depends on: 1152047
Depends on: 1152048
(In reply to Sylvestre Ledru [:sylvestre] from comment #61)
> Honza, can we have an uplift to beta? Thanks

I can, what is the deadline for it? (kinda busy now)
Honzam, today please (or tomorrow morning if you can't). Stability is not good in 38, any fix is good to take! Thanks
Flags: needinfo?(honzab.moz)
There needs to be uplifted also patch from bug 1135405 then, AFAIK.
Comment on attachment 8584632 [details] [diff] [review]
v1.3.2

applies clearly on Beta but not tested on try.

LAND AFTER BUG 1135405!

Approval Request Comment
[Feature/regressing bug #]: none
[User impact if declined]: shutdown hang, inability to browse
[Describe test coverage new/current, TreeHerder]: landed on m-c on 2015-03-29, currently on m-a
[Risks and why]: this not a simple patch and carries some risk, tho should be small.  This creates new threads, there is no limit for it.
[String/UUID change made/needed]: none
Flags: needinfo?(honzab.moz)
Attachment #8584632 - Flags: approval-mozilla-beta?
Comment on attachment 8584632 [details] [diff] [review]
v1.3.2

Shutdown hangs have been an important issue lately. I am taking the risk as we are early in the beta cycle and we can always backout this change if it introduces more regressions that it fixes.
Should be in 38 beta 3

Thanks Honza!
Attachment #8584632 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Robert, you mentioned during that the last channel meeting that we still have crashes caused by this bug. Do you confirm?
Flags: needinfo?(kairo)
Yes, it should have been fixed in 38.0b3 but it's still at #9 with 1.4% of all crashes in 38.0b4. I'm reopening.
Status: RESOLVED → REOPENED
Flags: needinfo?(kairo)
Resolution: FIXED → ---
kairo, I filed new bugs I can see stacks for, see blocks.  This one is fixed.
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
Summary: shutdownhang in mozilla::net::nsHttpConnectionMgr::Shutdown() → shutdownhang in mozilla::net::nsHttpConnectionMgr::Shutdown() on pending UDP socket close
If you want an umbrella bug for this issue, please open a new one and move the blocking bugs there (and make this one blocking it too, left closed).  Thanks.
New bugs are fine as long as they have the right entries in crash signatures and the crash keywords set. Thanks for looking into this!
Honza, this bug is still presented during our channel meeting and I would like to see it fixed for 38.
I looked at the "block" and it seems unrelated (send video to device). Are you sure about the bug number? Thanks
Flags: needinfo?(honzab.moz)
Keywords: topcrash-win
(In reply to Sylvestre Ledru [:sylvestre] from comment #74)
> Honza, this bug is still presented during our channel meeting and I would
> like to see it fixed for 38.
> I looked at the "block" and it seems unrelated (send video to device). Are
> you sure about the bug number? Thanks

It probably means someone believes this bug is caused by that one (is its regression).  What do you want me to do?

The patch here is on beta (38).  Other bugs with different cause but identical symptoms are tracked.  I don't have time to work on them right now tho.
Flags: needinfo?(honzab.moz)
It means that this is the only bug on file that is connected with the signature (see comment #73), and the signature is still happening in large enough volume to be a concern.

Is there any bug filed that should be about the majority of the issues left with this signature?
Flags: needinfo?(honzab.moz)
What about all of the open "Depends on" bugs?
Flags: needinfo?(honzab.moz)
(In reply to Honza Bambas (:mayhemer) from comment #77)
> What about all of the open "Depends on" bugs?

None of them have the crash signature and crash keyword set, so they do not appear on crash-stats for the signature.
(In reply to Honza Bambas (:mayhemer) from comment #75)
> The patch here is on beta (38).  Other bugs with different cause but
> identical symptoms are tracked.  I don't have time to work on them right now
> tho.

Also none of them are tracked for any release via tracking flags. So the meaning of "tracked" is an issue as well. ;-)
Blocks: 1158189
I cloned this to bug 1158189.  Please set the flags, I don't have perms.
No longer blocks: 1054959
No longer depends on: 1152041, 1152046, 1152047, 1152048, 1135405, 1152040
You need to log in before you can comment on or make changes to this bug.