Closed Bug 698882 Opened 14 years ago Closed 9 years ago

Deadlock in nsSocketTransportService [tbird] PR_SetPollableEvent

Categories

(Core :: Networking, defect)

x86
All
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
firefox47 --- affected
firefox48 --- fixed

People

(Reporter: Irving, Assigned: mcmanus)

References

(Blocks 1 open bug)

Details

(Keywords: hang, Whiteboard: [necko-active])

Attachments

(5 files, 11 obsolete files)

1007 bytes, application/octet-stream
Details
27.08 KB, application/octet-stream
Details
30.98 KB, patch
dragana
: review+
Details | Diff | Splinter Review
31.10 KB, patch
Details | Diff | Splinter Review
30.95 KB, patch
Details | Diff | Splinter Review
Caught a difficult-to-reproduce Thunderbird hang in nsSocketTransportService. Here are the backtraces of blocked threads; most of the interesting info is in the second trace (the necko thread) I was running under Xcode / GDB when this locked up, so I spent a pile of time poking around and trying to understand what was up. Unfortunately after a while Xcode crashed (yay! blunt tools!) so I no longer have access to the state as of the lock up. Next time I'll try to make a core dump... As far as I could tell, there was an issue between the PollableEvent loopback pipe and the nsSocketTransportService - nsSocketTransportService::OnDispatchedEvent was blocked trying to write to the pipe, which implies that the pipe was full, but the poll performed up the stack in DoPollIteration() did not show the other end of the pipe as readable. Main UI thread: #2 0x000000010008b7dd in PR_Lock at /Users/ireid/tbird/comm-central/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:206 #3 0x0000000102df3d16 in mozilla::Mutex::Lock() at /Users/ireid/tbird/objdir-comm-central-permissions/mozilla/xpcom/build/BlockingResourceBase.cpp:261 #4 0x0000000101250c84 in mozilla::MutexAutoLock::MutexAutoLock(mozilla::Mutex&, mozilla::GuardObjectNotifier const&) () #5 0x00000001012bea2f in nsSocketTransportService::GetThreadSafely() () #6 0x00000001012beb68 in nsSocketTransportService::Dispatch(nsIRunnable*, unsigned int) () #7 0x0000000102e3d46e in nsAStreamCopier::PostContinuationEvent_Locked() () #8 0x0000000102e3dd65 in nsAStreamCopier::PostContinuationEvent() () #9 0x0000000102e3deac in nsAStreamCopier::Start(nsIInputStream*, nsIOutputStream*, nsIEventTarget*, void (*)(void*, unsigned int), void*, unsigned int, bool, bool) () #10 0x0000000102e3d19f in NS_AsyncCopy(nsIInputStream*, nsIOutputStream*, nsIEventTarget*, nsAsyncCopyMode, unsigned int, void (*)(void*, unsigned int), void*, bool, bool, nsISupports**) () #11 0x00000001012badff in nsSocketTransport::OpenInputStream(unsigned int, unsigned int, unsigned int, nsIInputStream**) () #12 0x0000000102a3c9d2 in nsImapProtocol::SetupWithUrl(nsIURI*, nsISupports*) () #13 0x0000000102a3d921 in nsImapProtocol::LoadImapUrl(nsIURI*, nsISupports*) () #14 0x00000001029e9361 in nsImapIncomingServer::GetImapConnectionAndLoadUrl(nsIEventTarget*, nsIImapUrl*, nsISupports*) () #15 0x0000000102a58971 in nsImapService::GetImapConnectionAndLoadUrl(nsIEventTarget*, nsIImapUrl*, nsISupports*, nsIURI**) () #16 0x0000000102a667f6 in nsImapService::SelectFolder(nsIEventTarget*, nsIMsgFolder*, nsIUrlListener*, nsIMsgWindow*, nsIURI**) () #17 0x0000000102a0ff1a in nsImapMailFolder::UpdateFolderWithListener(nsIMsgWindow*, nsIUrlListener*) () #18 0x00000001029ebc25 in nsImapMailFolder::UpdateFolder(nsIMsgWindow*) () the "necko" thread #1 0x0000000100096145 in poll () #2 0x000000010008d7b0 in pt_poll_now () #3 0x000000010008db6a in pt_Continue () #4 0x000000010008e96e in pt_Write () #5 0x000000010006b422 in PR_Write () #6 0x000000010006ef02 in PR_SetPollableEvent () #7 0x00000001012bec62 in nsSocketTransportService::OnDispatchedEvent(nsIThreadInternal*) at /Users/ireid/tbird/comm-central/mozilla/netwerk/base/src/nsSocketTransportService2.cpp:588 #8 0x0000000102e61360 in nsThread::PutEvent(nsIRunnable*) at /Users/ireid/tbird/comm-central/mozilla/xpcom/threads/nsThread.cpp:397 #9 0x0000000102e62830 in nsThread::Dispatch(nsIRunnable*, unsigned int) at /Users/ireid/tbird/comm-central/mozilla/xpcom/threads/nsThread.cpp:435 #10 0x00000001012bebe0 in nsSocketTransportService::Dispatch(nsIRunnable*, unsigned int) at /Users/ireid/tbird/comm-central/mozilla/netwerk/base/src/nsSocketTransportService2.cpp:140 #11 0x0000000102e3d46e in nsAStreamCopier::PostContinuationEvent_Locked() at /Users/ireid/tbird/comm-central/mozilla/xpcom/io/nsStreamUtils.cpp:467 #12 0x0000000102e3dd65 in nsAStreamCopier::PostContinuationEvent() at /Users/ireid/tbird/comm-central/mozilla/xpcom/io/nsStreamUtils.cpp:458 #13 0x0000000102e3dd9b in nsAStreamCopier::OnOutputStreamReady(nsIAsyncOutputStream*) at /Users/ireid/tbird/comm-central/mozilla/xpcom/io/nsStreamUtils.cpp:428 #14 0x00000001012b8525 in nsSocketOutputStream::OnSocketReady(unsigned int) at /Users/ireid/tbird/comm-central/mozilla/netwerk/base/src/nsSocketTransport2.cpp:514 #15 0x00000001012b894d in nsSocketTransport::OnSocketReady(PRFileDesc*, short) at /Users/ireid/tbird/comm-central/mozilla/netwerk/base/src/nsSocketTransport2.cpp:1531 #16 0x00000001012c035c in nsSocketTransportService::DoPollIteration(bool) () #17 0x00000001012c068a in nsSocketTransportService::Run() () #18 0x0000000102e6188e in nsThread::ProcessNextEvent(bool, bool*) () #19 0x0000000102defe26 in NS_ProcessNextEvent_P(nsIThread*, bool) () #20 0x0000000102e623c5 in nsThread::ThreadFunc(void*) at /Users/ireid/tbird/comm-central/mozilla/xpcom/threads/nsThread.cpp:272 IMAP thread for host I was copying to #1 0x00007fff870cb881 in _pthread_cond_wait () #2 0x000000010008c094 in PR_WaitCondVar () #3 0x000000010008c813 in PR_Wait () #4 0x0000000102df3137 in mozilla::ReentrantMonitor::Wait(unsigned int) at /Users/ireid/tbird/objdir-comm-central-permissions/mozilla/xpcom/build/BlockingResourceBase.cpp:346 #5 0x0000000101361e5e in mozilla::ReentrantMonitorAutoEnter::Wait(unsigned int) () #6 0x0000000102e3a09f in nsPipeInputStream::Wait() at /Users/ireid/tbird/comm-central/mozilla/xpcom/io/nsPipe3.cpp:653 #7 0x0000000102e3b697 in nsPipeInputStream::ReadSegments(unsigned int (*)(nsIInputStream*, void*, char const*, unsigned int, unsigned int, unsigned int*), void*, unsigned int, unsigned int*) () #8 0x0000000102e38f91 in nsPipeInputStream::Read(char*, unsigned int, unsigned int*) () #9 0x0000000102779837 in nsMsgLineStreamBuffer::ReadNextLine(nsIInputStream*, unsigned int&, bool&, unsigned int*, bool) () #10 0x0000000102a2cb26 in nsImapProtocol::CreateNewLineFromSocket() () #11 0x0000000102a3fa2c in nsImapProtocol::EstablishServerConnection() () #12 0x0000000102a44a7b in nsImapProtocol::ProcessCurrentURL() () #13 0x0000000102a35a06 in nsImapProtocol::ImapThreadMainLoop() () #14 0x0000000102a3d189 in nsImapProtocol::Run() () #15 0x0000000102e6188e in nsThread::ProcessNextEvent(bool, bool*) () #16 0x0000000102defe26 in NS_ProcessNextEvent_P(nsIThread*, bool) () #17 0x0000000102e623c5 in nsThread::ThreadFunc(void*) () #18 0x000000010009304b in _pt_root at /Users/ireid/tbird/comm-central/mozilla/nsprpub/pr/src/pthreads/ptthread.c:187
Thanks a lot for taking the time to investigate this. Is this a hang you've seen frequently, or you just saw once?
(In reply to Jason Duell (:jduell) from comment #1) > Thanks a lot for taking the time to investigate this. > > Is this a hang you've seen frequently, or you just saw once? I've only caught this exact backtrace once, but :bienvenu has hung Thunderbird reliably in a few other ways; I'm not sure if his hangs have a thread stopped in exactly the same place or not.
Blocks: 713253
Sid0 is now hitting this pretty reliably on Windows. Confirmed it's the same problem by looking at stack traces.
OS: Mac OS X → All
Please check on bug 711787. It (or the cause of it) might be related if this is something new.
Firefox is seeing quite a few crashes in PR_SetPollableEvent(); worth keeping an eye on these in case they are related: https://bugzilla.mozilla.org/show_bug.cgi?id=709847 https://bugzilla.mozilla.org/show_bug.cgi?id=662330 Also, Bug 649323 appears to be another dupe. It's older, anyone else have a preference as to which direction we set the duplication?
See Also: → 709847, 662330
Blocks: 672913, 649323
Severity: major → critical
Keywords: hang
Blocks: 535070
We may want to make the pollable event's socket or pipe buffer be a bit larger and in PR_WaitForPollableEvent read more then just 1024 bytes.
There are a number of ways we could make the problem more survivable, but I think they'd just end up covering over a more important underlying bug. As far as I can tell, we're already getting into a situation where there are 16k more notifications posted to the network thread than it has been able to read; since we *should* be reading from the notification socket once every time we wake up from select() I'm concerned that there's a nasty loop happening somewhere below the socket thread's main loop, or a path through the main loop that doesn't get to the place where we handle the notification socket.
If we are inside the event handling loop, we may not need to call on PR_SetPollableEvent. If we are inside that loop [1], then we will always loop while events are in the queue. When events are detected prior to call to PR_Poll, we exit immediately from PR_Poll regardless pollable event has been set or not. Though, simpler option would be to prevent duplicate calls to PR_SetPollableEvent between calls to PR_WaitForPollableEvent (may need one more lock enter near PR_WaitForPollableEvent, but precise atomicity here is not that important). [1] http://hg.mozilla.org/mozilla-central/annotate/8ea5c983743f/netwerk/base/src/nsSocketTransportService2.cpp#l635
I agree with both of those points; we don't need to write a byte into the pollable event pipe if we're already in the networking thread, because we'll always process any pending events before we do a blocking poll. We also don't need more than one byte in the pipe for each time around the networking main loop, because we'll process all events no matter how many times PR_SetPollableEvent() is called. However, I don't feel comfortable with fixing the problem in either of these ways until I understand *why* so many bytes are being written into the event pipe during a single pass through the main networking loop. I'm nervous that there is some underlying problem that will be even harder to isolate and fix if we make the pollable event mechanism less sensitive. And if we do find the reason that the pipe is getting clogged, we may not need to make the pollable event more complicated after all.
Blocks: 726432
Summary: Deadlock in nsSocketTransportService → Deadlock in nsSocketTransportService [tbird]
Whiteboard: [necko-would-take]
Summary: Deadlock in nsSocketTransportService [tbird] → Deadlock in nsSocketTransportService [tbird] PR_SetPollableEvent
Whiteboard: [necko-would-take] → [necko-backlog]
Depends on: 662330
Attached patch newpollable (obsolete) — Splinter Review
what do you guys think of replacing mThreadEvent with this? We could easily merge it with Daniel's work. Basically its a necko level implementation of a similar thing to pr_pollable event.. but 1] it fixes the socket thread deadlock where the socket thread blocks on trying to add to the event queue (and the socket thread is the only thing that clears it).. also optimizes away the pipe nonsense in that case 2] never puts more than 1 event in the queue.. so other threads can't block waiting for the socket thread to service it PR_FileDesc2NativeHandle 3] uses non blocking fd's.. so even if we screw up the logic, it still won't block 4] lets us do better error checking and evolve the implementation faster here than having it tied to nspr.. in general less nspr is good.
Attachment #8718619 - Flags: feedback?(dd.mozilla)
Attachment #8718619 - Flags: feedback?(daniel)
Attachment #8718619 - Attachment is patch: true
Attachment #8718619 - Attachment mime type: application/mbox → text/plain
(that version should lose the mlock from the new class - locking is handled in nsSocketTransportService and has to be because of the occasional del/new-replace behavior)
Comment on attachment 8718619 [details] [diff] [review] newpollable Review of attachment 8718619 [details] [diff] [review]: ----------------------------------------------------------------- I'm strongly in favor of this! It will make the system less likely to fill the buffer and block when things go wrong. I'm also thinking we could add a counter in the Signal() method for debug builds at least, and we could use that as a "high water mark" and possibly trigger a warning at some level. "X signals sent without the event having been polled".
Attachment #8718619 - Flags: feedback?(daniel) → feedback+
Comment on attachment 8718619 [details] [diff] [review] newpollable Review of attachment 8718619 [details] [diff] [review]: ----------------------------------------------------------------- Yes, we should do this.
Attachment #8718619 - Flags: feedback?(dd.mozilla)
Attachment #8718619 - Flags: feedback?
Attachment #8718619 - Flags: feedback+
Comment on attachment 8718619 [details] [diff] [review] newpollable Review of attachment 8718619 [details] [diff] [review]: ----------------------------------------------------------------- ::: netwerk/base/PollableEvent.cpp @@ +97,5 @@ > + if (!mWriteFD) { > + SOCKET_LOG(("PollableEventSignal Failed on no FD\n")); > + return false; > + } > + if(PR_GetCurrentThread() == gSocketThread) { if( ::: netwerk/base/nsSocketTransportService2.cpp @@ +1131,5 @@ > DetachSocket(mActiveList, &mActiveList[i]); > } > > if (n != 0 && mPollList[0].out_flags == PR_POLL_READ) { > + DebugMutexAutoLock lock(mLock); designing it that we don't need this lock being entered every time would be nice I could see the pollable event be a direct member of nsSocketTransportService and just use its internal lock for this "reset" case
I actually have a lot of thoughts on locking :) First, reads to writes exist as a 1:N ratio (because we read out N events in one call.) So the write path is more important. In m-c all writes are actually double locked. Once by nsSocketTransportService::mLock (protecting the swap of a new mThreadEvent) and once at the system level in write(). The new code improves that signficantly: * a subset of writes now take 0 locks because they are invoked on the socket thread. Anecdotally that's 20+% of them from a smoke test. * Another subset of writes now take just 1 lock (the STS mutex) because they can skip the pipe write() if mSignaled is already set. Anecdotally that's 50+% of them. So that's a reduction of ~45% locks on the write path. Now the read path, as noted, picks up a lock it didn't have before (except on case of failure, which we will ignore). This is because it needs to create a sequence point for the write path so that it can atomically consider whether or not to skip the write() if mSignaled is already set - it allows us to skip a lot of IO operations. So the write path is roughly halved and the read path is roughly doubled.. but we call read less than write, so that's a win I think :) I'm not sure what inlining the pollableevent into STS does to make this better.. I think the same amount of locking still has to go on.
Assignee: nobody → mcmanus
Whiteboard: [necko-backlog] → [necko-active]
Attached patch mozilla::net::PollableEvent (obsolete) — Splinter Review
Attachment #8718994 - Flags: review?(honzab.moz)
Comment on attachment 8718994 [details] [diff] [review] mozilla::net::PollableEvent Review of attachment 8718994 [details] [diff] [review]: ----------------------------------------------------------------- double up the review here
Attachment #8718994 - Flags: review?(dd.mozilla)
updated patch coming with explanation of change.. but just in case you've been looking at the old one, this is the interdiff diff --git a/netwerk/base/nsSocketTransportService2.cpp b/netwerk/base/nsSocketTransportService2.cpp index 5af9887..2d7e33d 100644 --- a/netwerk/base/nsSocketTransportService2.cpp +++ b/netwerk/base/nsSocketTransportService2.cpp @@ -428,43 +428,46 @@ nsSocketTransportService::PollTimeout() SOCKET_LOG(("poll timeout: none\n")); return NS_SOCKET_POLL_TIMEOUT; } SOCKET_LOG(("poll timeout: %lu\n", minR)); return PR_SecondsToInterval(minR); } int32_t -nsSocketTransportService::Poll(bool wait, uint32_t *interval, +nsSocketTransportService::Poll(uint32_t *interval, TimeDuration *pollDuration) { PRPollDesc *pollList; uint32_t pollCount; PRIntervalTime pollTimeout; *pollDuration = 0; + // If there are pending events for this thread then + // DoPollIteration() should service the network without blocking. + bool pendingEvents = false; + mThread->HasPendingEvents(&pendingEvents); + if (mPollList[0].fd) { mPollList[0].out_flags = 0; pollList = mPollList; pollCount = mActiveCount + 1; - pollTimeout = PollTimeout(); + pollTimeout = pendingEvents ? PR_INTERVAL_NO_WAIT : PollTimeout(); } else { // no pollable event, so busy wait... pollCount = mActiveCount; if (pollCount) pollList = &mPollList[1]; else pollList = nullptr; - pollTimeout = PR_MillisecondsToInterval(25); + pollTimeout = + pendingEvents ? PR_INTERVAL_NO_WAIT : PR_MillisecondsToInterval(25); } - if (!wait) - pollTimeout = PR_INTERVAL_NO_WAIT; - PRIntervalTime ts = PR_IntervalNow(); TimeStamp pollStart; if (mTelemetryEnabledPref) { pollStart = TimeStamp::NowLoRes(); } SOCKET_LOG((" timeout = %i milliseconds\n", @@ -756,16 +759,25 @@ nsSocketTransportService::SetAutodialEnabled(bool value) { mAutodialEnabled = value; return NS_OK; } NS_IMETHODIMP nsSocketTransportService::OnDispatchedEvent(nsIThreadInternal *thread) { + if(PR_GetCurrentThread() == gSocketThread) { + // this check is redundant to one done inside ::Signal(), but + // we can do it here and skip obtaining the lock - given that + // this is a relatively common occurance its worth the + // redundant code + SOCKET_LOG(("OnDispatchedEvent Same Thread Skip Signal\n")); + return NS_OK; + } + DebugMutexAutoLock lock(mLock); if (mPollableEvent) { mPollableEvent->Signal(); } return NS_OK; } NS_IMETHODIMP @@ -843,50 +855,43 @@ nsSocketTransportService::Run() // If there is too many pending events queued, we will run some poll() // between them and the following variable is cumulative time spent // blocking in poll(). TimeDuration pollDuration; for (;;) { bool pendingEvents = false; - thread->HasPendingEvents(&pendingEvents); numberOfPendingEvents = 0; numberOfPendingEventsLastCycle = 0; if (mTelemetryEnabledPref) { startOfCycleForLastCycleCalc = TimeStamp::NowLoRes(); startOfNextIteration = TimeStamp::NowLoRes(); } pollDuration = 0; do { if (mTelemetryEnabledPref) { pollCycleStart = TimeStamp::NowLoRes(); } - // If there are pending events for this thread then - // DoPollIteration() should service the network without blocking. - DoPollIteration(!pendingEvents, &singlePollDuration); + DoPollIteration(&singlePollDuration); if (mTelemetryEnabledPref && !pollCycleStart.IsNull()) { Telemetry::Accumulate(Telemetry::STS_POLL_BLOCK_TIME, singlePollDuration.ToMilliseconds()); Telemetry::AccumulateTimeDelta( Telemetry::STS_POLL_CYCLE, pollCycleStart + singlePollDuration, TimeStamp::NowLoRes()); pollDuration += singlePollDuration; } - // If nothing was pending before the poll, it might be now - if (!pendingEvents) { - thread->HasPendingEvents(&pendingEvents); - } - + thread->HasPendingEvents(&pendingEvents); if (pendingEvents) { if (!mServingPendingQueue) { nsresult rv = Dispatch(NS_NewRunnableMethod(this, &nsSocketTransportService::MarkTheLastElementOfPendingQueue), nsIEventTarget::DISPATCH_NORMAL); if (NS_FAILED(rv)) { NS_WARNING("Could not dispatch a new event on the " "socket thread."); @@ -1002,22 +1007,21 @@ nsSocketTransportService::Reset(bool aGuardLocals) DetachSocketWithGuard(aGuardLocals, mActiveList, i); } for (i = mIdleCount - 1; i >= 0; --i) { DetachSocketWithGuard(aGuardLocals, mIdleList, i); } } nsresult -nsSocketTransportService::DoPollIteration(bool wait, TimeDuration *pollDuration) +nsSocketTransportService::DoPollIteration(TimeDuration *pollDuration) { - SOCKET_LOG(("STS poll iter [%d]\n", wait)); + SOCKET_LOG(("STS poll iter\n")); int32_t i, count; - // // poll loop // // walk active list backwards to see if any sockets should actually be // idle, then walk the idle list backwards to see if any idle sockets // should become active. take care to check only idle sockets that // were idle to begin with ;-) // @@ -1059,28 +1063,26 @@ nsSocketTransportService::DoPollIteration(bool wait, TimeDuration *pollDuration) #if defined(XP_WIN) // 30 active connections is the historic limit before firefox 7's 256. A few // windows systems have troubles with the higher limit, so actively probe a // limit the first time we exceed 30. if ((mActiveCount > 30) && !mProbedMaxCount) ProbeMaxCount(); #endif - // Measures seconds spent while blocked on PR_Poll uint32_t pollInterval; - int32_t n = 0; #if !defined(MOZILLA_XPCOMRT_API) if (!gIOService->IsNetTearingDown()) { // Let's not do polling during shutdown. - n = Poll(wait, &pollInterval, pollDuration); + n = Poll(&pollInterval, pollDuration); } #else - n = Poll(wait, &pollInterval, pollDuration); + n = Poll(&pollInterval, pollDuration); #endif // defined(MOZILLA_XPCOMRT_API) if (n < 0) { SOCKET_LOG((" PR_Poll error [%d] os error [%d]\n", PR_GetError(), PR_GetOSError())); } else { // diff --git a/netwerk/base/nsSocketTransportService2.h b/netwerk/base/nsSocketTransportService2.h index b27e034..a393188 100644 --- a/netwerk/base/nsSocketTransportService2.h +++ b/netwerk/base/nsSocketTransportService2.h @@ -194,21 +194,19 @@ private: // // first element of the poll list is mThreadEvent (or null if the pollable // event cannot be created). //------------------------------------------------------------------------- PRPollDesc *mPollList; /* mListSize + 1 entries */ PRIntervalTime PollTimeout(); // computes ideal poll timeout - nsresult DoPollIteration(bool wait, - mozilla::TimeDuration *pollDuration); + nsresult DoPollIteration(mozilla::TimeDuration *pollDuration); // perfoms a single poll iteration - int32_t Poll(bool wait, - uint32_t *interval, + int32_t Poll(uint32_t *interval, mozilla::TimeDuration *pollDuration); // calls PR_Poll. the out param // interval indicates the poll // duration in seconds. // pollDuration is used only for // telemetry //-------------------------------------------------------------------------
Attached patch mozilla::net::PollableEvent (obsolete) — Splinter Review
Attachment #8719034 - Flags: review?(honzab.moz)
Attachment #8719034 - Flags: review?(dd.mozilla)
Attachment #8718994 - Attachment is obsolete: true
Attachment #8718994 - Flags: review?(honzab.moz)
Attachment #8718994 - Flags: review?(dd.mozilla)
basically "hasPendingEvents" was determined on the socket thread and then passed to dopolliteration which passed it to poll which used it to set the timeout to pr_poll(). however, dopolliteration can call DetachSocket() which can have the side effect of getting ::OnDispatchedEvent called.. which meant that the hasPendingEvents value previously calculated was now inaccurate. so I stopped passing that value along the stack and just calculated it in poll() close to where it is used. I also got rid of a redundant call to PollTimeout() [which separately is an absurd legacy function we could improve] in the fairly common case where hasPendingEvents is true and the PollTimeout() result was just overwritten. I also duplicated the "same-thread?" check in ::Signal outside of the function in its most common path - OnDispatchedEvent simply because it can be done without the lock and is a common enough occurrance to warrant the code duplication for optimization.
Attached patch mozilla::net::PollableEvent (obsolete) — Splinter Review
Attachment #8719163 - Flags: review?(honzab.moz)
Attachment #8719163 - Flags: review?(dd.mozilla)
Attachment #8719034 - Attachment is obsolete: true
Attachment #8719034 - Flags: review?(honzab.moz)
Attachment #8719034 - Flags: review?(dd.mozilla)
(In reply to Patrick McManus [:mcmanus] from comment #25) > Created attachment 8719163 [details] [diff] [review] > mozilla::net::PollableEvent just nullcheck mthread on shutdown diff --git a/netwerk/base/nsSocketTransportService2.cpp b/netwerk/base/nsSocketTransportService2.cpp --- a/netwerk/base/nsSocketTransportService2.cpp +++ b/netwerk/base/nsSocketTransportService2.cpp @@ -439,17 +439,21 @@ nsSocketTransportService::Poll(uint32_t PRPollDesc *pollList; uint32_t pollCount; PRIntervalTime pollTimeout; *pollDuration = 0; // If there are pending events for this thread then // DoPollIteration() should service the network without blocking. bool pendingEvents = false; - mThread->HasPendingEvents(&pendingEvents); + if (mThread) { + mThread->HasPendingEvents(&pendingEvents); + } else { // shutdown awaits! + pendingEvents = true; + } if (mPollList[0].fd) { mPollList[0].out_flags = 0; pollList = mPollList; pollCount = mActiveCount + 1; pollTimeout = pendingEvents ? PR_INTERVAL_NO_WAIT : PollTimeout(); } else {
Comment on attachment 8719163 [details] [diff] [review] mozilla::net::PollableEvent Review of attachment 8719163 [details] [diff] [review]: ----------------------------------------------------------------- I am giving it a r+ because the first comment can, should be done in a separate bug. ::: netwerk/base/PollableEvent.cpp @@ +109,5 @@ > + if (status != 1) { > + NS_WARNING("PollableEventSignal Failed\n"); > + SOCKET_LOG(("PollableEventSignal Failed\n")); > + } > + return (status == 1); This function can be used to discover that we are hanging in a poll and do some repairs. Maybe closing of mReadFD will cause the hanging thread to get out of poll? I know that serving onsocketready can take longer time, but we can try to repair the pipe if the signal() call, that have set mSignaled(so the first one in a cycle) was something like 500ms ago. Also if we fail in Signal() we can try to close mReadFD, or in other way cause an error on mReadFD, maybe that will cause the poll to return. All of this only on windows :) ::: netwerk/base/nsSocketTransportService2.cpp @@ +131,5 @@ > nsSocketTransportService::~nsSocketTransportService() > { > NS_ASSERTION(NS_IsMainThread(), "wrong thread"); > NS_ASSERTION(!mInitialized, "not shutdown properly"); > Remove blank space when you are already here. @@ +445,5 @@ > + // DoPollIteration() should service the network without blocking. > + bool pendingEvents = false; > + if (mThread) { > + mThread->HasPendingEvents(&pendingEvents); > + } else { // shutdown awaits! Can you explain this.
Attachment #8719163 - Flags: review?(dd.mozilla) → review+
Comment on attachment 8719163 [details] [diff] [review] mozilla::net::PollableEvent Review of attachment 8719163 [details] [diff] [review]: ----------------------------------------------------------------- The mMaxTimePerPollIter thing is IMO wrong (now I'm not complaining about this patch). It could happen that a previously scheduled event may be executed later than a next socket handler is executed. I mean: expected order: - dispatch(a) - a->Run() - OnSocketReady but we may do: - dispatch(a) - OnSocketReady() -> I assume and behave as if |a| ran (but actually didn't) - a->Run() I don't know if there is a code that might suffer from this reordering, but I think it would be good to avoid it anyway. I tho still smell a dead fish here.. the code is pretty much overcomplicated and something can be overlooked. ::: netwerk/base/PollableEvent.cpp @@ +75,5 @@ > + > +PollableEvent::~PollableEvent() > +{ > + if (mWriteFD) { > + PR_Close(mWriteFD); to prevent being blocked, maybe read first from the read side? easily done with Clear() call. would use of shutdown() lower any possibility of being blocked here? @@ +96,5 @@ > + if (!mWriteFD) { > + SOCKET_LOG(("PollableEventSignal Failed on no FD\n")); > + return false; > + } > + if(PR_GetCurrentThread() == gSocketThread) { if( @@ +107,5 @@ > + mSignaled = true; > + int32_t status = PR_Write(mWriteFD, "M", 1); > + if (status != 1) { > + NS_WARNING("PollableEventSignal Failed\n"); > + SOCKET_LOG(("PollableEventSignal Failed\n")); "PollableEvent::Signal" ? when people are going to search for it in the code... applies to other logs as well @@ +115,5 @@ > + > +bool > +PollableEvent::Clear() > +{ > + // necessary because of the "dont signal on socket thread optimization " at the end @@ +121,5 @@ > + > + SOCKET_LOG(("PollableEventClear\n")); > + mSignaled = false; > + char buf[2048]; > + int32_t status = PR_Read(mReadFD, buf, 2048); Maybe do a non-null check on mReadFD just for sure? @@ +133,5 @@ > + } > + if (status > 1) { > + MOZ_ASSERT(false); > + SOCKET_LOG(("PollableEventClear Unexpected events\n")); > + return true; maybe call Clear() recursively (maybe with a limit) here to wipe the whole queue? I know this must not happen, but anyway. up to you. ::: netwerk/base/nsSocketTransportService2.cpp @@ +453,5 @@ > if (mPollList[0].fd) { > mPollList[0].out_flags = 0; > pollList = mPollList; > pollCount = mActiveCount + 1; > + pollTimeout = pendingEvents ? PR_INTERVAL_NO_WAIT : PollTimeout(); this is a very good change! before that we could get stuck: - there is no active socket - we loop through the pending events, but not make it on mMaxTimePerPollIter time - there are events pending left - we go to PR_Poll with an arbitrary timeout value - we don't wake up sooner until someone posts something @@ +767,5 @@ > > NS_IMETHODIMP > nsSocketTransportService::OnDispatchedEvent(nsIThreadInternal *thread) > { > + if(PR_GetCurrentThread() == gSocketThread) { if( @@ -1064,5 @@ > // limit the first time we exceed 30. > if ((mActiveCount > 30) && !mProbedMaxCount) > ProbeMaxCount(); > #endif > - patrick please don't remove blank lines this way. well put blank lines make code much more easily readable. @@ +1154,2 @@ > } > + if (!mPollableEvent) { maybe just: mPollableEvent.reset(new PollableEvent()); if (!mPollableEvent->Valid()) { mPollableEvent = nullptr; NS_WARN... etc? }
Attachment #8719163 - Flags: review?(honzab.moz)
Attachment #8719163 - Flags: review?(dd.mozilla)
Attachment #8719163 - Flags: review+
Attachment #8719163 - Flags: review?(dd.mozilla) → review+
(In reply to Honza Bambas (:mayhemer) from comment #29) > Comment on attachment 8719163 [details] [diff] [review] > mozilla::net::PollableEvent > > Review of attachment 8719163 [details] [diff] [review]: > ----------------------------------------------------------------- > > The mMaxTimePerPollIter thing is IMO wrong (now I'm not complaining about > this patch). It could happen that a previously scheduled event may be > executed later than a next socket handler is executed. I mean: > > expected order: > - dispatch(a) > - a->Run() > - OnSocketReady > > but we may do: > - dispatch(a) > - OnSocketReady() -> I assume and behave as if |a| ran (but actually didn't) > - a->Run() > > > I don't know if there is a code that might suffer from this reordering, but > I think it would be good to avoid it anyway. > > > I tho still smell a dead fish here.. the code is pretty much > overcomplicated and something can be overlooked. This use to be even worse. We use to do only one event per cycle so it was always: - dispatch(a) - OnSocketReady() - a->Run() adding mMaxTimePerPollIter make it do more than one event. > ::: netwerk/base/nsSocketTransportService2.cpp > @@ +453,5 @@ > > if (mPollList[0].fd) { > > mPollList[0].out_flags = 0; > > pollList = mPollList; > > pollCount = mActiveCount + 1; > > + pollTimeout = pendingEvents ? PR_INTERVAL_NO_WAIT : PollTimeout(); > > this is a very good change! before that we could get stuck: > > - there is no active socket > - we loop through the pending events, but not make it on mMaxTimePerPollIter > time > - there are events pending left > - we go to PR_Poll with an arbitrary timeout value > - we don't wake up sooner until someone posts something > No, we were always checking if there is events in queue before calling poll: http://mxr.mozilla.org/mozilla-central/source/netwerk/base/nsSocketTransportService2.cpp#912 and http://mxr.mozilla.org/mozilla-central/source/netwerk/base/nsSocketTransportService2.cpp#883
(In reply to Dragana Damjanovic [:dragana] from comment #30) > (In reply to Honza Bambas (:mayhemer) from comment #29) > > Comment on attachment 8719163 [details] [diff] [review] > > mozilla::net::PollableEvent > > > > Review of attachment 8719163 [details] [diff] [review]: > > ----------------------------------------------------------------- > > > > The mMaxTimePerPollIter thing is IMO wrong (now I'm not complaining about > > this patch). It could happen that a previously scheduled event may be > > executed later than a next socket handler is executed. I mean: > > > > expected order: > > - dispatch(a) > > - a->Run() > > - OnSocketReady > > > > but we may do: > > - dispatch(a) > > - OnSocketReady() -> I assume and behave as if |a| ran (but actually didn't) > > - a->Run() > > > > > > I don't know if there is a code that might suffer from this reordering, but > > I think it would be good to avoid it anyway. > > > > > > I tho still smell a dead fish here.. the code is pretty much > > overcomplicated and something can be overlooked. > > This use to be even worse. We use to do only one event per cycle so it was > always: > - dispatch(a) > - OnSocketReady() > - a->Run() > > adding mMaxTimePerPollIter make it do more than one event. So, what is the status with this patch? > > > > ::: netwerk/base/nsSocketTransportService2.cpp > > @@ +453,5 @@ > > > if (mPollList[0].fd) { > > > mPollList[0].out_flags = 0; > > > pollList = mPollList; > > > pollCount = mActiveCount + 1; > > > + pollTimeout = pendingEvents ? PR_INTERVAL_NO_WAIT : PollTimeout(); > > > > this is a very good change! before that we could get stuck: > > > > - there is no active socket > > - we loop through the pending events, but not make it on mMaxTimePerPollIter > > time > > - there are events pending left > > - we go to PR_Poll with an arbitrary timeout value > > - we don't wake up sooner until someone posts something > > > > No, we were always checking if there is events in queue before calling poll: > http://mxr.mozilla.org/mozilla-central/source/netwerk/base/ > nsSocketTransportService2.cpp#912 > and > http://mxr.mozilla.org/mozilla-central/source/netwerk/base/ > nsSocketTransportService2.cpp#883 Maybe I'm wrong on this point. Any event dispatch sets the pollable event. So PR_Poll should have exit immediately.
Attachment #8718619 - Attachment is obsolete: true
Blocks: 662330
No longer depends on: 662330
Comment on attachment 8719163 [details] [diff] [review] mozilla::net::PollableEvent Review of attachment 8719163 [details] [diff] [review]: ----------------------------------------------------------------- ::: netwerk/base/PollableEvent.cpp @@ +109,5 @@ > + if (status != 1) { > + NS_WARNING("PollableEventSignal Failed\n"); > + SOCKET_LOG(("PollableEventSignal Failed\n")); > + } > + return (status == 1); I agree - other bug. I think we should try and land the more extreme "OMG this is broken" fixes in daniel's patch.. ::: netwerk/base/nsSocketTransportService2.cpp @@ +445,5 @@ > + // DoPollIteration() should service the network without blocking. > + bool pendingEvents = false; > + if (mThread) { > + mThread->HasPendingEvents(&pendingEvents); > + } else { // shutdown awaits! ha - good comment. It was obviously null here: http://archive.mozilla.org/pub/firefox/try-builds/mcmanus@ducksong.com-fbcbc97040eed0a7c139bfe9a7454b044612083b/try-macosx64/try_yosemite_r7_test-xpcshell-bm107-tests1-macosx-build120.txt.gz from the try run in comment 22 (which didn't have this check) but looking at it harder, this is actually at startup - a totally different race. Thanks! mthread is done locked on the main thread after init.. so the thread can already be running and not know mthread. I've changed this usage of it to be a raw pointer (accessible only to socket thread) set in the run method. @@ +1143,2 @@ > // acknowledge pollable event (wait should not block) > + if (!mPollableEvent || !mPollableEvent->Clear()) { I updated this to mPollableEvent && !pe->clear().. fyi
Comment on attachment 8719163 [details] [diff] [review] mozilla::net::PollableEvent Review of attachment 8719163 [details] [diff] [review]: ----------------------------------------------------------------- ::: netwerk/base/nsSocketTransportService2.cpp @@ +832,1 @@ > mPollList[0].in_flags = PR_POLL_READ; forgot one important thing: can you also add PR_POLL_EXCEPT here?
Attached patch mozilla::net::PollableEvent (obsolete) — Splinter Review
Attachment #8719163 - Attachment is obsolete: true
Attachment #8720332 - Flags: review+
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Blocks: 1249711
May crash https://crash-stats.mozilla.com/report/index/03bfefe5-c623-4744-a992-b296d2160221 be triggered by this commit? Crashed 100% on startup.
https://hg.mozilla.org/integration/mozilla-inbound/rev/6183d3a293a98ef1313fd57bbbdafcee0783b29f Bug 698882 - backout threadevent changes due to crashes with McAfee Firefox Hook r=backout
(In reply to Alexander Usyskin from comment #43) > May crash > https://crash-stats.mozilla.com/report/index/03bfefe5-c623-4744-a992- > b296d2160221 be triggered by this commit? Crashed 100% on startup. yeah it looks like gcagff.dll (McAfee Firefox Hook) is crashing with this patch. lovely what those damn LSPs do. But its just on ctor of the object so it shouldn't be hard to figure out. I backed it out and will test and fix with that software if I can install it. Thanks for reporting it.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [necko-active] → [necko-active][leave-open]
(In reply to Patrick McManus [:mcmanus] from comment #45) > (In reply to Alexander Usyskin from comment #43) > > May crash > > https://crash-stats.mozilla.com/report/index/03bfefe5-c623-4744-a992- > > b296d2160221 be triggered by this commit? Crashed 100% on startup. > > yeah it looks like gcagff.dll (McAfee Firefox Hook) is crashing with this > patch. lovely what those damn LSPs do. But its just on ctor of the object so > it shouldn't be hard to figure out. > > I backed it out and will test and fix with that software if I can install it. > > Thanks for reporting it. afaict fgcagff.dll is the mcafee data loss prevention module.. which doesn't seem to be a consumer grade item that I can easily download and test with. If I make a few guesses at how to resolve this, would you Alexander be able to test the builds?
Flags: needinfo?(sanniu)
Depends on: 1250979
https://bugzilla.mozilla.org/show_bug.cgi?id=1250979 has info on a talos test regression that needs to be sorted out before relanding
(In reply to Patrick McManus [:mcmanus] from comment #48) > https://bugzilla.mozilla.org/show_bug.cgi?id=1250979 has info on a talos > test regression that needs to be sorted out before relanding That was wrong on my part. The "regression" was on 6183d3a293a9 which was the backout of this patch (due to crashes). So the patch is actually a performance win (as expected). good news. I do like this patch.
(In reply to Patrick McManus [:mcmanus] from comment #47) > (In reply to Patrick McManus [:mcmanus] from comment #45) > > (In reply to Alexander Usyskin from comment #43) > > > May crash > > > https://crash-stats.mozilla.com/report/index/03bfefe5-c623-4744-a992- > > > b296d2160221 be triggered by this commit? Crashed 100% on startup. > > > > yeah it looks like gcagff.dll (McAfee Firefox Hook) is crashing with this > > patch. lovely what those damn LSPs do. But its just on ctor of the object so > > it shouldn't be hard to figure out. > > > > I backed it out and will test and fix with that software if I can install it. > > > > Thanks for reporting it. > > afaict fgcagff.dll is the mcafee data loss prevention module.. which doesn't > seem to be a consumer grade item that I can easily download and test with. > > If I make a few guesses at how to resolve this, would you Alexander be able > to test the builds? Yes, I would be glad to help. I can test you engineering builds.
Flags: needinfo?(sanniu)
(oh, and those builds just started.. so if you're too quick to download them you might need to wait a few hours for them to be complete)
Flags: needinfo?(sanniu)
thanks. any links from about:crashes ?
pr_getnameforidentity (and much of nspr) does some definitely non-threadsafe lazy initialization behavior.. but the xpcom startup code happens to run this at a safe single threaded early time (PR_Getenv() triggers the code).. so one more reason to be suspicious of nspr, but unlikely the issue here.
pr_getnameforidentity is part of nspr, but is not used by gecko - it is used by this dll (and is what crashes). it is definitely not threadsafe with respect to PR_GetUniqueIdentity() which is widely used in gecko, and is used on several threads. so that's bad. But I'm not sure why this patch would expose that problem.
Hi Alexander, thanks for your help here. I have a few more builds for you to try - I'm sorry I don't have a better way to test this out. I'm not sure if the answer is a race condition inside NSPR or a side effect of how the library is implemented. Although I'm confident I've identified a race condition, your crash is very consistent and that makes it seem like it probably isn't the underlying cause. So I have 6 builds, can you download them and try them out? https://treeherder.mozilla.org/#/jobs?repo=try&revision=d0341e69020e https://treeherder.mozilla.org/#/jobs?repo=try&revision=780745ff0752 https://treeherder.mozilla.org/#/jobs?repo=try&revision=a5e540f3069a https://treeherder.mozilla.org/#/jobs?repo=try&revision=aa7132a84883 https://treeherder.mozilla.org/#/jobs?repo=try&revision=844426095377 https://treeherder.mozilla.org/#/jobs?repo=try&revision=d87341ca8d00
Flags: needinfo?(sanniu)
(In reply to Patrick McManus [:mcmanus] from comment #71) > Hi Alexander, thanks for your help here. I have a few more builds for you to > try - I'm sorry I don't have a better way to test this out. I'm not sure if > the answer is a race condition inside NSPR or a side effect of how the > library is implemented. Although I'm confident I've identified a race > condition, your crash is very consistent and that makes it seem like it > probably isn't the underlying cause. > > So I have 6 builds, can you download them and try them out? > > https://treeherder.mozilla.org/#/jobs?repo=try&revision=d0341e69020e crash https://crash-stats.mozilla.com/report/index/bp-a1f9c531-d78a-43c2-9515-1bca92160316 https://crash-stats.mozilla.com/report/index/bp-bfc9868e-b934-4ae6-a06e-c91a72160316 > https://treeherder.mozilla.org/#/jobs?repo=try&revision=780745ff0752 ok > https://treeherder.mozilla.org/#/jobs?repo=try&revision=a5e540f3069a ok > https://treeherder.mozilla.org/#/jobs?repo=try&revision=aa7132a84883 crash https://crash-stats.mozilla.com/report/index/bp-1f32cf24-9e06-4ba7-992b-21cbb2160316 > https://treeherder.mozilla.org/#/jobs?repo=try&revision=844426095377 crash https://crash-stats.mozilla.com/report/index/bp-283e5a8a-3290-4af5-abbf-186c32160316 > https://treeherder.mozilla.org/#/jobs?repo=try&revision=d87341ca8d00 crash https://crash-stats.mozilla.com/report/index/bp-8dc95941-e4a1-4f72-afd5-6f7b02160316 Second build and all subsequent triggered "unknown app" win8.1 protection window
Flags: needinfo?(sanniu)
Blocks: 1158189
Alexander, huge help here. Thank you. McAfee seems to be hardcoded to expect an NSPR layer on top of the read side of the localhost pair (as it happens to be in released code). We can give it one fairly painlessly, which seems to be the path of least resistance. (that's what worked in builds 2 and 3).
Attached patch mozilla::net::PollableEvent (obsolete) — Splinter Review
Attachment #8731262 - Flags: review?(dd.mozilla)
Attachment #8720332 - Attachment is obsolete: true
Comment on attachment 8731262 [details] [diff] [review] mozilla::net::PollableEvent Review of attachment 8731262 [details] [diff] [review]: ----------------------------------------------------------------- The primary changes you will find here 1] pollableEvent ctor moved to socket thread. No reason to do that stuff during module construction on the main thread. Also makes the new lazy init threadsafe to do 2] an empty nspr layer that mirrors the one that nspr pollable event was using. It seems we have an LSP hardcoded to believe that is there. Its ugly but I think performance wise its a lateral move.. the same thing happens on current releases and is dwarfed by the IO event anyhow (plus we call this path less which is a benefit of the patch overall). I only put it on the socket and not the pipe. 3] mem leak trackers for PollableEvent
Comment on attachment 8731262 [details] [diff] [review] mozilla::net::PollableEvent Review of attachment 8731262 [details] [diff] [review]: ----------------------------------------------------------------- oh, it also makes the rwin change that should satisfy the "sg tcp optimizer" case. Can we verify that?
(In reply to Patrick McManus [:mcmanus] from comment #77) > Comment on attachment 8731262 [details] [diff] [review] > mozilla::net::PollableEvent > > Review of attachment 8731262 [details] [diff] [review]: > ----------------------------------------------------------------- > > oh, it also makes the rwin change that should satisfy the "sg tcp optimizer" > case. Can we verify that? I will test it.
comment 79 brings a localhost/pair implementation into PollableEvent (instead of using PR_NewTCPSocketPair()) so that the recv window can be set before connection, required by the "sg tcp optimizer" bug. let's see how try feels about that before asking for review :)
See Also: → 1257250
Attached file old.pcap
this is firefox release.
Attached file new.pcap
This is the last try build comment 79. As soon as you change rwin auto tuning is turned off for the socket. So the "sg tcp optimier" bug is fixed.
Attached patch mozilla::net::PollableEvent (obsolete) — Splinter Review
Attachment #8731392 - Flags: review?(dd.mozilla)
Attachment #8731262 - Attachment is obsolete: true
Attachment #8731262 - Flags: review?(dd.mozilla)
Thanks so much for those tests! One thing worries me.. in the new.cap the rwin for the receiver seems to be perpetually shrinking with each ack. As if firefox is not reading bytes into the application layer. but I've played around with this on linux and see the same thing and then it sort of warps around and catches up after a few thousand. I've added logging and (on linux - when socket is forced) the reads and writes are balanced. It is probably the auto-tuning algorithm that I'm seeing.. still.. But just to be sure, can you run a test on windows on the "sg tcp optimizer" host that moves around 64K events or whatever you need to be sure we won't exhaust that window? how did you capture localhost traffic on windows?
Attached patch mozilla::net::PollableEvent (obsolete) — Splinter Review
Attachment #8731408 - Flags: review?(dd.mozilla)
Attachment #8731392 - Attachment is obsolete: true
Attachment #8731392 - Flags: review?(dd.mozilla)
alright, I'll stop fussing with it now :) the comment 86 patch contains logs for the read/write return values so we can see if for some reason they don't balance.. and it also skips creating the pollable event if we're on the child process (as this will create a unused localhost pair in the child.. which we must be doing with the classic mThreadEvent :()
(In reply to Patrick McManus [:mcmanus] from comment #84) > Thanks so much for those tests! One thing worries me.. > > in the new.cap the rwin for the receiver seems to be perpetually shrinking > with each ack. As if firefox is not reading bytes into the application layer. > > but I've played around with this on linux and see the same thing and then it > sort of warps around and catches up after a few thousand. I've added logging > and (on linux - when socket is forced) the reads and writes are balanced. It > is probably the auto-tuning algorithm that I'm seeing.. still.. > > But just to be sure, can you run a test on windows on the "sg tcp optimizer" > host that moves around 64K events or whatever you need to be sure we won't > exhaust that window? how did you capture localhost traffic on windows? I will test it tomorrow. RawCap can capture localhost on windows.
Attachment #8731417 - Flags: review?(dd.mozilla)
Attachment #8731408 - Attachment is obsolete: true
Attachment #8731408 - Flags: review?(dd.mozilla)
re comment 89 - I put pollableEvent back on the child process because RTC needs the full sockettransport stack there
(In reply to Patrick McManus [:mcmanus] from comment #84) > Thanks so much for those tests! One thing worries me.. > > in the new.cap the rwin for the receiver seems to be perpetually shrinking > with each ack. As if firefox is not reading bytes into the application layer. > > but I've played around with this on linux and see the same thing and then it > sort of warps around and catches up after a few thousand. I've added logging > and (on linux - when socket is forced) the reads and writes are balanced. It > is probably the auto-tuning algorithm that I'm seeing.. still.. > > But just to be sure, can you run a test on windows on the "sg tcp optimizer" > host that moves around 64K events or whatever you need to be sure we won't > exhaust that window? how did you capture localhost traffic on windows? I have tested this and it always decrease until 64076 an than increase back to 65535. It is the TCP implementation. (I have capture around 330000 packet - around 160000 writes)
Blocks: 1248358
Attachment #8731417 - Flags: review?(dd.mozilla) → review+
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/01c747d7ca4f for leaking on OS X 10.6, https://treeherder.mozilla.org/logviewer.html#?job_id=24198571&repo=mozilla-inbound 10.6 debug being best known for slow, I wouldn't be surprised if it also winds up leaking on WinXP, but that hasn't finished and run tests yet, thanks to a multi-hour build outage.
Or, how about a much stranger Windows result, 100% https://treeherder.mozilla.org/logviewer.html#?job_id=24199956&repo=mozilla-inbound in xperf.
(In reply to Phil Ringnalda (:philor) from comment #95) > Or, how about a much stranger Windows result, 100% > https://treeherder.mozilla.org/logviewer.html#?job_id=24199956&repo=mozilla- > inbound in xperf. talos bug! We've zero'd out two counters :) That's good news and more or less expected. before this patch: 07:39 INFO - COUNTER 'nonmain_normal_netio': [186735738.0] 20:07:39 INFO - COUNTER 'mainthread_writebytes': [[32, 'C:\\ProgramData\\NVIDIA Corporation\\Drs\\nvAppTimestamps'], [782, 'C:\\Users\\cltbld\\AppData\\Local\\Temp\\tmpjxtq4x\\profile\\compatibility.ini']] 20:07:39 INFO - COUNTER 'main_startup_fileio': [42398765.0] 20:07:39 INFO - COUNTER 'main_normal_netio': [3674613.0] 20:07:39 INFO - COUNTER 'nonmain_startup_fileio': [309099.0] 20:07:39 INFO - COUNTER 'main_startup_netio': [13.0] 20:07:39 INFO - COUNTER 'nonmain_normal_fileio': [476665801.0] 20:07:39 INFO - COUNTER 'main_normal_fileio': [51818957.0] after this patch: 20:06:38 INFO - COUNTER 'nonmain_normal_netio': [187687767.0] 20:06:38 INFO - COUNTER 'mainthread_writebytes': [[782, 'C:\\Users\\cltbld\\AppData\\Local\\Temp\\tmp151bji\\profile\\compatibility.ini'], [32, 'C:\\ProgramData\\NVIDIA Corporation\\Drs\\nvAppTimestamps']] 20:06:38 INFO - COUNTER 'main_startup_fileio': [42402881.0] 20:06:38 INFO - COUNTER 'main_normal_netio': [] 20:06:38 INFO - COUNTER 'nonmain_startup_fileio': [309099.0] 20:06:38 INFO - COUNTER 'main_startup_netio': [] 20:06:38 INFO - COUNTER 'nonmain_normal_fileio': [479767805.0] 20:06:38 INFO - COUNTER 'main_normal_fileio': [51912637.0]
Depends on: 1258414
(In reply to Phil Ringnalda (:philor) from comment #94) > Backed out in > https://hg.mozilla.org/integration/mozilla-inbound/rev/01c747d7ca4f for > leaking on OS X 10.6, > https://treeherder.mozilla.org/logviewer.html#?job_id=24198571&repo=mozilla- > inbound > > 10.6 debug being best known for slow, I wouldn't be surprised if it also > winds up leaking on WinXP, but that hasn't finished and run tests yet, > thanks to a multi-hour build outage. fwiw - that's bug 1256022 (now with a new object added to the tree of leaked stuff.)
Depends on: 1256022
comment 95 is bug 1258414 - independent, and resolved. comment 98 is bug 1256022 - independent, and will be pushed together with this. push comment 101 (that was backed out in comment 102) was because the pollable event listen socket was made non blocking. This is the socket that is actually closed once the client/server pair has been established. https://treeherder.mozilla.org/#/jobs?repo=try&revision=bf5375f8cbe5&selectedJob=18612843 looks good. The only blemish is the frequent occurance of 1222277 - which is occurring frequently without the patch as well. let's go for it.
Whiteboard: [necko-active][leave-open] → [necko-active]
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Target Milestone: mozilla47 → mozilla48
I am seeing a large increase in TSan-reported data races on xpcshell tests within the past week. I suspect this to be the cause. It appears to be racing on nsSocketTransportService::mPollableEvent. It is first read at nsSocketTransportService2.cpp:631 (nsSocketTransportService::SetOffline) whilst mLock is held, but is later written at nsSocketTransportService2.cpp:803 (nsSocketTransportService::Run) with no lock held. 631 if (mPollableEvent) { 632 mPollableEvent->Signal(); 633 } 803 mPollableEvent.reset(new PollableEvent());
Patrick, comment 106?
Flags: needinfo?(mcmanus)
yes, the 803 bit should hold a lock. I'll file a followon - thanks. This should only be during startup fwiw.
Flags: needinfo?(mcmanus)
Depends on: 1260764
(In reply to Patrick McManus [:mcmanus] from comment #108) > yes, the 803 bit should hold a lock. I'll file a followon - thanks. This > should only be during startup fwiw. https://bugzilla.mozilla.org/show_bug.cgi?id=1260764
We knew that this patch will help in some cases of shutdown hang. Looking at the crash stats: https://crash-stats.mozilla.com/signature/?date=%3E2016-01-01&version=48.0a1&signature=shutdownhang+%7C+WaitForSingleObjectEx+%7C+WaitForSingleObject+%7C+PR_WaitCondVar+%7C+nsThread%3A%3AProcessNextEvent+%7C+NS_ProcessNextEvent+%7C+mozilla%3A%3Anet%3A%3AnsHttpConnectionMgr%3A%3AShutdown&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1#reports Nightly does not have a lot of shutdown hangs, but this is interesting: There are 19 hangs and they are all with builds before 2016-03-24. This patch landed on 2016-03-28 and it is probably the reason for this improvement. Nothing else landed in that period in necko that could improved the stats. We could probably uplift this to aurora, but I will write all pros and risks as well.
Attached patch bug_698882_aurora.patch (obsolete) — Splinter Review
This patch is the merge of this bug and bug 1260764. It is also rebased for aurora.
Attached patch bug_698882_aurora_reverse.patch (obsolete) — Splinter Review
This is reverse patch for backing out bug_698882_aurora.patch
Attached patch bug_698882_aurora_reverse.patch (obsolete) — Splinter Review
comment updated.
Attachment #8740489 - Attachment is obsolete: true
Comment on attachment 8740487 [details] [diff] [review] bug_698882_aurora.patch Approval Request Comment [Feature/regressing bug #]: Comment #110 explains an interesting finding. Most probably it was this patch that improved shutdown hangs stats. [User impact if declined]: Eliminates some shutdown hangs (not guaranty that will eliminate all of them) [Describe test coverage new/current, TreeHerder]: it is in nightly since 28.3.2016 and it is heavily used code. 1 - data implies this is good even though not directly targeted at this problem 2 - we know of several other bugs that it fixes 3 - startup perf improvement 4 - had a good run on nightly without problems and is heavily used. [Risks and why]: 1 - mid sized patch 2 - lots of "surface area" 3 - highly dependent on state and calling pattern - makes it hard to test 4 - shown to have unexpected interactions with LSP, intermittent orange, etc.. as second order effects. On the other hand it may have cleaned up some of those as well. (shutdown hangs?) 5 - no str so no verification possible other than crash-stats data [String/UUID change made/needed]:none This patch contains change from this bug and bug 1260764. Bug 1258414 needs to be uplifted as well, but that is just a 2 line fix for a test. There is also a backout patch attached to this bug.
Attachment #8740487 - Flags: approval-mozilla-aurora?
Hi Jason, Patrick, Honza: Given the medium risk associated with the patch(es) here, I would like to get your opinion on whether this is worth uplifting to Aurora 47 or not? The data from Nightly looks promising but I haven't looked at anything new that has cropped up since this landed. Appreciate your help here.
Flags: needinfo?(odvarko)
Flags: needinfo?(mcmanus)
Flags: needinfo?(jduell.mcbugs)
Flags: needinfo?(odvarko) → needinfo?(honzab.moz)
I don't think I have enough information to tell for sure here. The patch definitely has benefits according performance and stability, getting proven. But I think Patrick has move info on about wider negative impact, if any.
Flags: needinfo?(honzab.moz)
(In reply to Ritu Kothari (:ritu) from comment #115) > Hi Jason, Patrick, Honza: Given the medium risk associated with the > patch(es) here, I would like to get your opinion on whether this is worth > uplifting to Aurora 47 or not? The data from Nightly looks promising but I > haven't looked at anything new that has cropped up since this landed. > Appreciate your help here. Hi Ritu, As the author here, I can say this patch is great :) Uplift depends largely on how badly you want to quickly address the shutdown hang in bug 1158189 which has been described as the #3 top crasher. We believe this patch largely fixes that problem. Unfortunately, without a STR for that bug we just focused on addressing a whole class of things that looked related to stablity - and this patch was one of those initiatives. I'm not exactly certain why it has the outsized positive impact on 1158189 (lacking an STR for that) but I'm not really surprised as there are other well defined scenarios that I can explain why it fixes.. 1158189 is probably some more complicated manifestation of those. Shutdown hang data can be notoriously hard to interpret. So I recommended to dragana that we go with the uplift for one channel. There is a small chance that this actually isn't the reason those crashes went away, but it seems very unlikely given the data. As for stability risks from the patch I'm not particularly worried about the patch logic itself - that's pretty well understood. Its more a matter that it changes/improves some things at a low level and we've seen a couple interactions from that.. e.g. one windows binary LSP was making completely undocumented assumptions about our internal ABI from the old method.. and a couple tests that have always been occasionally orange got very orange when the patch first landed due to timing changes (which helped us find those unrelated bugs that were always in those tests). This patch is always going to be at some risk for new interop challenges as it goes through beta.. I'm not sure waiting an extra 6 weeks to get it there is going to tell us anything extra - so I would land it given the stability it brings. The perf win is a bonus.
Flags: needinfo?(mcmanus)
(In reply to Patrick McManus [:mcmanus] from comment #117) > (In reply to Ritu Kothari (:ritu) from comment #115) > > Hi Jason, Patrick, Honza: Given the medium risk associated with the > > patch(es) here, I would like to get your opinion on whether this is worth > > uplifting to Aurora 47 or not? The data from Nightly looks promising but I > > haven't looked at anything new that has cropped up since this landed. > > Appreciate your help here. > > Hi Ritu, > > As the author here, I can say this patch is great :) Uplift depends largely > on how badly you want to quickly address the shutdown hang in bug 1158189 > which has been described as the #3 top crasher. > > We believe this patch largely fixes that problem. Unfortunately, without a > STR for that bug we just focused on addressing a whole class of things that > looked related to stablity - and this patch was one of those initiatives. > I'm not exactly certain why it has the outsized positive impact on 1158189 > (lacking an STR for that) but I'm not really surprised as there are other > well defined scenarios that I can explain why it fixes.. 1158189 is probably > some more complicated manifestation of those. Shutdown hang data can be > notoriously hard to interpret. > > So I recommended to dragana that we go with the uplift for one channel. > There is a small chance that this actually isn't the reason those crashes > went away, but it seems very unlikely given the data. > > As for stability risks from the patch I'm not particularly worried about the > patch logic itself - that's pretty well understood. Its more a matter that > it changes/improves some things at a low level and we've seen a couple > interactions from that.. e.g. one windows binary LSP was making completely > undocumented assumptions about our internal ABI from the old method.. and a > couple tests that have always been occasionally orange got very orange when > the patch first landed due to timing changes (which helped us find those > unrelated bugs that were always in those tests). > > This patch is always going to be at some risk for new interop challenges as > it goes through beta.. I'm not sure waiting an extra 6 weeks to get it there > is going to tell us anything extra - so I would land it given the stability > it brings. The perf win is a bonus. Thanks for the detailed write up (even though you could have left it at "I am the author and the patch is great!" ;). I think shutdown hangs have been a cause for concern for a few releases now so this fix seems timely. I think we will need to take a risk at some point and if the data from Nightly looks promising let's uplift to Aurora now before 47 goes to Beta. I hope you guys will be watching any new issues that crop up closely and worst case we will be able to cleanly back out this patch from 47.
Comment on attachment 8740487 [details] [diff] [review] bug_698882_aurora.patch Nightly data suggests this patch has eliminated (reduced) shutdown hangs of some signatures, let's uplift to Aurora47.
Attachment #8740487 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
has problems to apply to uplift : merging netwerk/base/moz.build merging netwerk/base/nsSocketTransportService2.cpp merging netwerk/base/nsSocketTransportService2.h merging netwerk/standalone/moz.build warning: conflicts while merging netwerk/base/nsSocketTransportService2.h! (edit, then use 'hg resolve --mark') abort: unresolved conflicts, can't continue (use hg resolve and hg graft --continue)
Flags: needinfo?(mcmanus)
Flags: needinfo?(mcmanus)
Flags: needinfo?(jduell.mcbugs)
Flags: needinfo?(dd.mozilla)
(In reply to Carsten Book [:Tomcat] from comment #120) > has problems to apply to uplift : > > merging netwerk/base/moz.build > merging netwerk/base/nsSocketTransportService2.cpp > merging netwerk/base/nsSocketTransportService2.h > merging netwerk/standalone/moz.build > warning: conflicts while merging netwerk/base/nsSocketTransportService2.h! > (edit, then use 'hg resolve --mark') > abort: unresolved conflicts, can't continue > (use hg resolve and hg graft --continue) I tried to apply it locally on aurora and I do not see any conflicts. I will try later...
Flags: needinfo?(dd.mozilla)
(In reply to Carsten Book [:Tomcat] from comment #123) > sorry had to back this out , seems this caused test failures like > https://treeherder.mozilla.org/logviewer.html#?job_id=2420245&repo=mozilla- > aurora > I am looking into this one. > and > https://treeherder.mozilla.org/logviewer.html#?job_id=2420627&repo=mozilla- > aurora this is bug 1258414 needs to be uplifted too :)
Flags: needinfo?(mcmanus)
I was investigating https://treeherder.mozilla.org/logviewer.html#?job_id=2420245&repo=mozilla-aurora but I could not reproduce it at all and it showed up once. On the other hand, bug 1152522 shows up 5 times: https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=efa0fb87773ed6e6850dcaf794d53651429decca&selectedJob=2422119 I am not sure why it is so often on aurora. I manage to reproduce it 1-2 times locally on on mac, but without logging :( At the first glance it looks like the test is racy. The server is not close completely before it is started again on the same port.
(In reply to Dragana Damjanovic [:dragana] from comment #125) > I was investigating > https://treeherder.mozilla.org/logviewer.html#?job_id=2420245&repo=mozilla- > aurora > but I could not reproduce it at all and it showed up once. > > On the other hand, bug 1152522 shows up 5 times: > https://treeherder.mozilla.org/#/jobs?repo=mozilla- > aurora&revision=efa0fb87773ed6e6850dcaf794d53651429decca&selectedJob=2422119 > > I am not sure why it is so often on aurora. > I manage to reproduce it 1-2 times locally on on mac, but without logging :( > At the first glance it looks like the test is racy. The server is not close > completely before it is started again on the same port. probly a dup of 1256022 which had to be fixed with this bug? (unrelated afaict)
(In reply to Patrick McManus [:mcmanus] from comment #126) > (In reply to Dragana Damjanovic [:dragana] from comment #125) > > I was investigating > > https://treeherder.mozilla.org/logviewer.html#?job_id=2420245&repo=mozilla- > > aurora > > but I could not reproduce it at all and it showed up once. > > > > On the other hand, bug 1152522 shows up 5 times: > > https://treeherder.mozilla.org/#/jobs?repo=mozilla- > > aurora&revision=efa0fb87773ed6e6850dcaf794d53651429decca&selectedJob=2422119 > > > > I am not sure why it is so often on aurora. > > I manage to reproduce it 1-2 times locally on on mac, but without logging :( > > At the first glance it looks like the test is racy. The server is not close > > completely before it is started again on the same port. > > probly a dup of 1256022 which had to be fixed with this bug? (unrelated > afaict) yes this is bug 1256022.
Bug 1256022 and bug 1258414 need uplift to aurora as well.
Backed out in https://hg.mozilla.org/releases/mozilla-aurora/rev/08791c3d6f713f4032f704e44bae22353bbdae77 A thing which we misstarred as being bug 1152522 the last time you landed on aurora (and which happened quite a bit when you first landed on m-i on March 20th, but has only happened a few times since you relanded on trunk), is happening about 50% of the time on all three flavors on Windows on m-a, https://treeherder.mozilla.org/logviewer.html#?job_id=2447128&repo=mozilla-aurora (I picked that one out because I like how it also has the one-off "test_register_5xxCode_http2.js | Test timed out" of comment 123 back again), and what's going to be far worse for you to figure out is Windows mochitest-4, which either hits the useless and frustrating https://treeherder.mozilla.org/logviewer.html#?job_id=2448025&repo=mozilla-aurora which doesn't actually say that test_xhr_system.html crashed, but instead says that test_xhr_system.html is the last test to run in that chunk and that shutdown was slow enough that RunWatchdog got pissed off and crashed the browser, or, https://treeherder.mozilla.org/logviewer.html#?job_id=2450895&repo=mozilla-aurora with test_xultree_animation.xhtml | application crashed [@ mozilla::image::SourceBuffer::Complete(nsresult)] after Assertion failure: false (MOZ_ASSERT_UNREACHABLE: Called Complete more than once), at c:/builds/moz2_slave/m-aurora-w32-d-000000000000000/build/src/image/SourceBuffer.cpp:400
But at least you don't have to worry about the xpcshell failures in security/manager/ssl/tests/unit/test_signed_dir.js, which were just a timebomb as some cert expired.
Depends on: 1267003
ok, I notice that PollableEvent fails to make a tcp socket pair. And I found out the the patch attached to this bug differs from the change made to the m-c (I have not notice that before because there is no comment about this). There is only one line diff and it is very plausible that that is a cause. In comment #132 is a try run with the same patch as the patch pushed to the m-c. Patrick, what was the reason to make this one-line change: 84c86 < + --- > + PR_SetSocketOption(listener, &noblockOpt);
Flags: needinfo?(mcmanus)
The original patch attached to the bug (attachment mozilla::net::PollableEvent) differed from the patch pushed to m-c, therefore the previous aurora patch (made from the attached patch) was making problems with some tests. This patch has the same changes as patch pushed to m-c (plus the changes from bug 1260764) Approval Request Comment [Feature/regressing bug #]: [User impact if declined]: [Describe test coverage new/current, TreeHerder]: [Risks and why]: [String/UUID change made/needed]:
Attachment #8740487 - Attachment is obsolete: true
Attachment #8744807 - Flags: approval-mozilla-aurora?
(In reply to Dragana Damjanovic [:dragana] from comment #133) > ok, I notice that PollableEvent fails to make a tcp socket pair. And I found > out the the patch attached to this bug differs from the change made to the > m-c (I have not notice that before because there is no comment about this). > There is only one line diff and it is very plausible that that is a cause. > In comment #132 is a try run with the same patch as the patch pushed to the > m-c. > Patrick, what was the reason to make this one-line change: > > 84c86 > < + > --- > > + PR_SetSocketOption(listener, &noblockOpt); See Comment 103. On slow platforms the accept() was being called before the connect() had been processed on the kernel side completely and when it was NB it just didn't wait for it.. the only thing the listener is used for is setting up this localhost pair, so blocking is the semantic we want
Flags: needinfo?(mcmanus)
Comment on attachment 8744807 [details] [diff] [review] bug_698882_aurora.patch Assuming this needs to be uplifted to Fx47, I am changing the approval flag from Aurora47 to Beta47 (m-a -> m-b merge is complete).
Attachment #8744807 - Flags: approval-mozilla-aurora? → approval-mozilla-beta+
Hey Wes, is this a patch that still needs to be uplifted to Fx47? I am not entirely sure but I know you can definitely help figure it out. Thanks a lot!
Flags: needinfo?(wkocher)
This is not on beta ff47 :)
(In reply to Dragana Damjanovic [:dragana] from comment #139) > This is not on beta ff47 :) This has status-firefox47, are we missing a patch. Looking at beta 47 build 1 this is currently the #2 top crasher. https://crash-stats.mozilla.com/topcrashers/?product=Firefox&version=47.0b1&days=7
(In reply to Jim Mathies [:jimm] from comment #140) > (In reply to Dragana Damjanovic [:dragana] from comment #139) > > This is not on beta ff47 :) > > This has status-firefox47, are we missing a patch. Looking at beta 47 build > 1 this is currently the #2 top crasher. > > https://crash-stats.mozilla.com/topcrashers/?product=Firefox&version=47. > 0b1&days=7 This i not in ff47 :) still needs uplift. (I cannot push it still do not have writes)
Let's backout this from beta. We have some hangs. I do not know what went wrong.(It looks reproducible, so I was hoping that someone will come forward fit a repo but till now no luck) Maybe something went wrong in this uplift and I hope 48 will be just fine in beta. Wes, may i ask you to back this out from beta only? Thanks.
Flags: needinfo?(wkocher)
Backed out in https://hg.mozilla.org/releases/mozilla-beta/rev/25e9aa3f174f Setting the status back to affected. If we're not going to try again for 47, feel free to set it to wontfix.
Flags: needinfo?(wkocher)
when this landed we had an improvement in the startup network io measured in xperf: https://treeherder.mozilla.org/perf.html#/alerts?id=1099
Comment on attachment 8744807 [details] [diff] [review] bug_698882_aurora.patch Reverting the flag as this was backed out of Beta47.
Attachment #8744807 - Flags: approval-mozilla-beta+ → approval-mozilla-beta-
hi, in quite early data for 48.0b1 it looks as if the crashes with shutdownhang | ... | mozilla::net::nsHttpConnectionMgr::Shutdown (bug 1158189) are rocketing up again unfortunately.
(In reply to [:philipp] from comment #148) > hi, in quite early data for 48.0b1 it looks as if the crashes with > shutdownhang | ... | mozilla::net::nsHttpConnectionMgr::Shutdown (bug > 1158189) are rocketing up again unfortunately. I was afraid of that :) Thanks Philipp!
I have open a new bug 1279163.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: