Closed
Bug 698882
Opened 14 years ago
Closed 9 years ago
Deadlock in nsSocketTransportService [tbird] PR_SetPollableEvent
Categories
(Core :: Networking, defect)
Tracking
()
RESOLVED
FIXED
mozilla48
People
(Reporter: Irving, Assigned: mcmanus)
References
(Blocks 1 open bug)
Details
(Keywords: hang, Whiteboard: [necko-active])
Attachments
(5 files, 11 obsolete files)
1007 bytes,
application/octet-stream
|
Details | |
27.08 KB,
application/octet-stream
|
Details | |
30.98 KB,
patch
|
dragana
:
review+
|
Details | Diff | Splinter Review |
31.10 KB,
patch
|
ritu
:
approval-mozilla-beta-
|
Details | Diff | Splinter Review |
30.95 KB,
patch
|
Details | Diff | Splinter Review |
Caught a difficult-to-reproduce Thunderbird hang in nsSocketTransportService. Here are the backtraces of blocked threads; most of the interesting info is in the second trace (the necko thread)
I was running under Xcode / GDB when this locked up, so I spent a pile of time poking around and trying to understand what was up. Unfortunately after a while Xcode crashed (yay! blunt tools!) so I no longer have access to the state as of the lock up. Next time I'll try to make a core dump...
As far as I could tell, there was an issue between the PollableEvent loopback pipe and the nsSocketTransportService - nsSocketTransportService::OnDispatchedEvent was blocked trying to write to the pipe, which implies that the pipe was full, but the poll performed up the stack in DoPollIteration() did not show the other end of the pipe as readable.
Main UI thread:
#2 0x000000010008b7dd in PR_Lock at /Users/ireid/tbird/comm-central/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:206
#3 0x0000000102df3d16 in mozilla::Mutex::Lock() at /Users/ireid/tbird/objdir-comm-central-permissions/mozilla/xpcom/build/BlockingResourceBase.cpp:261
#4 0x0000000101250c84 in mozilla::MutexAutoLock::MutexAutoLock(mozilla::Mutex&, mozilla::GuardObjectNotifier const&) ()
#5 0x00000001012bea2f in nsSocketTransportService::GetThreadSafely() ()
#6 0x00000001012beb68 in nsSocketTransportService::Dispatch(nsIRunnable*, unsigned int) ()
#7 0x0000000102e3d46e in nsAStreamCopier::PostContinuationEvent_Locked() ()
#8 0x0000000102e3dd65 in nsAStreamCopier::PostContinuationEvent() ()
#9 0x0000000102e3deac in nsAStreamCopier::Start(nsIInputStream*, nsIOutputStream*, nsIEventTarget*, void (*)(void*, unsigned int), void*, unsigned int, bool, bool) ()
#10 0x0000000102e3d19f in NS_AsyncCopy(nsIInputStream*, nsIOutputStream*, nsIEventTarget*, nsAsyncCopyMode, unsigned int, void (*)(void*, unsigned int), void*, bool, bool, nsISupports**) ()
#11 0x00000001012badff in nsSocketTransport::OpenInputStream(unsigned int, unsigned int, unsigned int, nsIInputStream**) ()
#12 0x0000000102a3c9d2 in nsImapProtocol::SetupWithUrl(nsIURI*, nsISupports*) ()
#13 0x0000000102a3d921 in nsImapProtocol::LoadImapUrl(nsIURI*, nsISupports*) ()
#14 0x00000001029e9361 in nsImapIncomingServer::GetImapConnectionAndLoadUrl(nsIEventTarget*, nsIImapUrl*, nsISupports*) ()
#15 0x0000000102a58971 in nsImapService::GetImapConnectionAndLoadUrl(nsIEventTarget*, nsIImapUrl*, nsISupports*, nsIURI**) ()
#16 0x0000000102a667f6 in nsImapService::SelectFolder(nsIEventTarget*, nsIMsgFolder*, nsIUrlListener*, nsIMsgWindow*, nsIURI**) ()
#17 0x0000000102a0ff1a in nsImapMailFolder::UpdateFolderWithListener(nsIMsgWindow*, nsIUrlListener*) ()
#18 0x00000001029ebc25 in nsImapMailFolder::UpdateFolder(nsIMsgWindow*) ()
the "necko" thread
#1 0x0000000100096145 in poll ()
#2 0x000000010008d7b0 in pt_poll_now ()
#3 0x000000010008db6a in pt_Continue ()
#4 0x000000010008e96e in pt_Write ()
#5 0x000000010006b422 in PR_Write ()
#6 0x000000010006ef02 in PR_SetPollableEvent ()
#7 0x00000001012bec62 in nsSocketTransportService::OnDispatchedEvent(nsIThreadInternal*) at /Users/ireid/tbird/comm-central/mozilla/netwerk/base/src/nsSocketTransportService2.cpp:588
#8 0x0000000102e61360 in nsThread::PutEvent(nsIRunnable*) at /Users/ireid/tbird/comm-central/mozilla/xpcom/threads/nsThread.cpp:397
#9 0x0000000102e62830 in nsThread::Dispatch(nsIRunnable*, unsigned int) at /Users/ireid/tbird/comm-central/mozilla/xpcom/threads/nsThread.cpp:435
#10 0x00000001012bebe0 in nsSocketTransportService::Dispatch(nsIRunnable*, unsigned int) at /Users/ireid/tbird/comm-central/mozilla/netwerk/base/src/nsSocketTransportService2.cpp:140
#11 0x0000000102e3d46e in nsAStreamCopier::PostContinuationEvent_Locked() at /Users/ireid/tbird/comm-central/mozilla/xpcom/io/nsStreamUtils.cpp:467
#12 0x0000000102e3dd65 in nsAStreamCopier::PostContinuationEvent() at /Users/ireid/tbird/comm-central/mozilla/xpcom/io/nsStreamUtils.cpp:458
#13 0x0000000102e3dd9b in nsAStreamCopier::OnOutputStreamReady(nsIAsyncOutputStream*) at /Users/ireid/tbird/comm-central/mozilla/xpcom/io/nsStreamUtils.cpp:428
#14 0x00000001012b8525 in nsSocketOutputStream::OnSocketReady(unsigned int) at /Users/ireid/tbird/comm-central/mozilla/netwerk/base/src/nsSocketTransport2.cpp:514
#15 0x00000001012b894d in nsSocketTransport::OnSocketReady(PRFileDesc*, short) at /Users/ireid/tbird/comm-central/mozilla/netwerk/base/src/nsSocketTransport2.cpp:1531
#16 0x00000001012c035c in nsSocketTransportService::DoPollIteration(bool) ()
#17 0x00000001012c068a in nsSocketTransportService::Run() ()
#18 0x0000000102e6188e in nsThread::ProcessNextEvent(bool, bool*) ()
#19 0x0000000102defe26 in NS_ProcessNextEvent_P(nsIThread*, bool) ()
#20 0x0000000102e623c5 in nsThread::ThreadFunc(void*) at /Users/ireid/tbird/comm-central/mozilla/xpcom/threads/nsThread.cpp:272
IMAP thread for host I was copying to
#1 0x00007fff870cb881 in _pthread_cond_wait ()
#2 0x000000010008c094 in PR_WaitCondVar ()
#3 0x000000010008c813 in PR_Wait ()
#4 0x0000000102df3137 in mozilla::ReentrantMonitor::Wait(unsigned int) at /Users/ireid/tbird/objdir-comm-central-permissions/mozilla/xpcom/build/BlockingResourceBase.cpp:346
#5 0x0000000101361e5e in mozilla::ReentrantMonitorAutoEnter::Wait(unsigned int) ()
#6 0x0000000102e3a09f in nsPipeInputStream::Wait() at /Users/ireid/tbird/comm-central/mozilla/xpcom/io/nsPipe3.cpp:653
#7 0x0000000102e3b697 in nsPipeInputStream::ReadSegments(unsigned int (*)(nsIInputStream*, void*, char const*, unsigned int, unsigned int, unsigned int*), void*, unsigned int, unsigned int*) ()
#8 0x0000000102e38f91 in nsPipeInputStream::Read(char*, unsigned int, unsigned int*) ()
#9 0x0000000102779837 in nsMsgLineStreamBuffer::ReadNextLine(nsIInputStream*, unsigned int&, bool&, unsigned int*, bool) ()
#10 0x0000000102a2cb26 in nsImapProtocol::CreateNewLineFromSocket() ()
#11 0x0000000102a3fa2c in nsImapProtocol::EstablishServerConnection() ()
#12 0x0000000102a44a7b in nsImapProtocol::ProcessCurrentURL() ()
#13 0x0000000102a35a06 in nsImapProtocol::ImapThreadMainLoop() ()
#14 0x0000000102a3d189 in nsImapProtocol::Run() ()
#15 0x0000000102e6188e in nsThread::ProcessNextEvent(bool, bool*) ()
#16 0x0000000102defe26 in NS_ProcessNextEvent_P(nsIThread*, bool) ()
#17 0x0000000102e623c5 in nsThread::ThreadFunc(void*) ()
#18 0x000000010009304b in _pt_root at /Users/ireid/tbird/comm-central/mozilla/nsprpub/pr/src/pthreads/ptthread.c:187
Comment 1•14 years ago
|
||
Thanks a lot for taking the time to investigate this.
Is this a hang you've seen frequently, or you just saw once?
Reporter | ||
Comment 2•14 years ago
|
||
(In reply to Jason Duell (:jduell) from comment #1)
> Thanks a lot for taking the time to investigate this.
>
> Is this a hang you've seen frequently, or you just saw once?
I've only caught this exact backtrace once, but :bienvenu has hung Thunderbird reliably in a few other ways; I'm not sure if his hangs have a thread stopped in exactly the same place or not.
Reporter | ||
Comment 3•14 years ago
|
||
Sid0 is now hitting this pretty reliably on Windows. Confirmed it's the same problem by looking at stack traces.
OS: Mac OS X → All
![]() |
||
Comment 4•14 years ago
|
||
Please check on bug 711787. It (or the cause of it) might be related if this is something new.
Reporter | ||
Comment 6•13 years ago
|
||
Firefox is seeing quite a few crashes in PR_SetPollableEvent(); worth keeping an eye on these in case they are related:
https://bugzilla.mozilla.org/show_bug.cgi?id=709847
https://bugzilla.mozilla.org/show_bug.cgi?id=662330
Also, Bug 649323 appears to be another dupe. It's older, anyone else have a preference as to which direction we set the duplication?
Updated•13 years ago
|
![]() |
||
Comment 7•13 years ago
|
||
We may want to make the pollable event's socket or pipe buffer be a bit larger and in PR_WaitForPollableEvent read more then just 1024 bytes.
Reporter | ||
Comment 8•13 years ago
|
||
There are a number of ways we could make the problem more survivable, but I think they'd just end up covering over a more important underlying bug. As far as I can tell, we're already getting into a situation where there are 16k more notifications posted to the network thread than it has been able to read; since we *should* be reading from the notification socket once every time we wake up from select() I'm concerned that there's a nasty loop happening somewhere below the socket thread's main loop, or a path through the main loop that doesn't get to the place where we handle the notification socket.
![]() |
||
Comment 9•13 years ago
|
||
If we are inside the event handling loop, we may not need to call on PR_SetPollableEvent. If we are inside that loop [1], then we will always loop while events are in the queue. When events are detected prior to call to PR_Poll, we exit immediately from PR_Poll regardless pollable event has been set or not.
Though, simpler option would be to prevent duplicate calls to PR_SetPollableEvent between calls to PR_WaitForPollableEvent (may need one more lock enter near PR_WaitForPollableEvent, but precise atomicity here is not that important).
[1] http://hg.mozilla.org/mozilla-central/annotate/8ea5c983743f/netwerk/base/src/nsSocketTransportService2.cpp#l635
Reporter | ||
Comment 10•13 years ago
|
||
I agree with both of those points; we don't need to write a byte into the pollable event pipe if we're already in the networking thread, because we'll always process any pending events before we do a blocking poll. We also don't need more than one byte in the pipe for each time around the networking main loop, because we'll process all events no matter how many times PR_SetPollableEvent() is called.
However, I don't feel comfortable with fixing the problem in either of these ways until I understand *why* so many bytes are being written into the event pipe during a single pass through the main networking loop. I'm nervous that there is some underlying problem that will be even harder to isolate and fix if we make the pollable event mechanism less sensitive.
And if we do find the reason that the pipe is getting clogged, we may not need to make the pollable event more complicated after all.
Assignee | ||
Updated•9 years ago
|
Summary: Deadlock in nsSocketTransportService → Deadlock in nsSocketTransportService [tbird]
Whiteboard: [necko-would-take]
Assignee | ||
Updated•9 years ago
|
Summary: Deadlock in nsSocketTransportService [tbird] → Deadlock in nsSocketTransportService [tbird] PR_SetPollableEvent
Whiteboard: [necko-would-take] → [necko-backlog]
Assignee | ||
Comment 12•9 years ago
|
||
what do you guys think of replacing mThreadEvent with this? We could easily merge it with Daniel's work.
Basically its a necko level implementation of a similar thing to pr_pollable event.. but
1] it fixes the socket thread deadlock where the socket thread blocks on trying to add to the event queue (and the socket thread is the only thing that clears it).. also optimizes away the pipe nonsense in that case
2] never puts more than 1 event in the queue.. so other threads can't block waiting for the socket thread to service it
PR_FileDesc2NativeHandle
3] uses non blocking fd's.. so even if we screw up the logic, it still won't block
4] lets us do better error checking and evolve the implementation faster here than having it tied to nspr.. in general less nspr is good.
Attachment #8718619 -
Flags: feedback?(dd.mozilla)
Attachment #8718619 -
Flags: feedback?(daniel)
Assignee | ||
Updated•9 years ago
|
Attachment #8718619 -
Attachment is patch: true
Attachment #8718619 -
Attachment mime type: application/mbox → text/plain
Assignee | ||
Comment 13•9 years ago
|
||
(that version should lose the mlock from the new class - locking is handled in nsSocketTransportService and has to be because of the occasional del/new-replace behavior)
Comment 14•9 years ago
|
||
Comment on attachment 8718619 [details] [diff] [review]
newpollable
Review of attachment 8718619 [details] [diff] [review]:
-----------------------------------------------------------------
I'm strongly in favor of this! It will make the system less likely to fill the buffer and block when things go wrong.
I'm also thinking we could add a counter in the Signal() method for debug builds at least, and we could use that as a "high water mark" and possibly trigger a warning at some level. "X signals sent without the event having been polled".
Attachment #8718619 -
Flags: feedback?(daniel) → feedback+
Comment 15•9 years ago
|
||
Comment on attachment 8718619 [details] [diff] [review]
newpollable
Review of attachment 8718619 [details] [diff] [review]:
-----------------------------------------------------------------
Yes, we should do this.
Attachment #8718619 -
Flags: feedback?(dd.mozilla)
Attachment #8718619 -
Flags: feedback?
Attachment #8718619 -
Flags: feedback+
Updated•9 years ago
|
Attachment #8718619 -
Flags: feedback?
![]() |
||
Comment 16•9 years ago
|
||
Comment on attachment 8718619 [details] [diff] [review]
newpollable
Review of attachment 8718619 [details] [diff] [review]:
-----------------------------------------------------------------
::: netwerk/base/PollableEvent.cpp
@@ +97,5 @@
> + if (!mWriteFD) {
> + SOCKET_LOG(("PollableEventSignal Failed on no FD\n"));
> + return false;
> + }
> + if(PR_GetCurrentThread() == gSocketThread) {
if(
::: netwerk/base/nsSocketTransportService2.cpp
@@ +1131,5 @@
> DetachSocket(mActiveList, &mActiveList[i]);
> }
>
> if (n != 0 && mPollList[0].out_flags == PR_POLL_READ) {
> + DebugMutexAutoLock lock(mLock);
designing it that we don't need this lock being entered every time would be nice
I could see the pollable event be a direct member of nsSocketTransportService and just use its internal lock for this "reset" case
Assignee | ||
Comment 17•9 years ago
|
||
I actually have a lot of thoughts on locking :)
First, reads to writes exist as a 1:N ratio (because we read out N events in one call.) So the write path is more important.
In m-c all writes are actually double locked. Once by nsSocketTransportService::mLock (protecting the swap of a new mThreadEvent) and once at the system level in write(). The new code improves that signficantly:
* a subset of writes now take 0 locks because they are invoked on the socket thread. Anecdotally that's 20+% of them from a smoke test.
* Another subset of writes now take just 1 lock (the STS mutex) because they can skip the pipe write() if mSignaled is already set. Anecdotally that's 50+% of them.
So that's a reduction of ~45% locks on the write path.
Now the read path, as noted, picks up a lock it didn't have before (except on case of failure, which we will ignore). This is because it needs to create a sequence point for the write path so that it can atomically consider whether or not to skip the write() if mSignaled is already set - it allows us to skip a lot of IO operations.
So the write path is roughly halved and the read path is roughly doubled.. but we call read less than write, so that's a win I think :)
I'm not sure what inlining the pollableevent into STS does to make this better.. I think the same amount of locking still has to go on.
Assignee: nobody → mcmanus
Whiteboard: [necko-backlog] → [necko-active]
Assignee | ||
Comment 18•9 years ago
|
||
Assignee | ||
Comment 19•9 years ago
|
||
Attachment #8718994 -
Flags: review?(honzab.moz)
Assignee | ||
Comment 20•9 years ago
|
||
Comment on attachment 8718994 [details] [diff] [review]
mozilla::net::PollableEvent
Review of attachment 8718994 [details] [diff] [review]:
-----------------------------------------------------------------
double up the review here
Attachment #8718994 -
Flags: review?(dd.mozilla)
Assignee | ||
Comment 21•9 years ago
|
||
updated patch coming with explanation of change.. but just in case you've been looking at the old one, this is the interdiff
diff --git a/netwerk/base/nsSocketTransportService2.cpp b/netwerk/base/nsSocketTransportService2.cpp
index 5af9887..2d7e33d 100644
--- a/netwerk/base/nsSocketTransportService2.cpp
+++ b/netwerk/base/nsSocketTransportService2.cpp
@@ -428,43 +428,46 @@ nsSocketTransportService::PollTimeout()
SOCKET_LOG(("poll timeout: none\n"));
return NS_SOCKET_POLL_TIMEOUT;
}
SOCKET_LOG(("poll timeout: %lu\n", minR));
return PR_SecondsToInterval(minR);
}
int32_t
-nsSocketTransportService::Poll(bool wait, uint32_t *interval,
+nsSocketTransportService::Poll(uint32_t *interval,
TimeDuration *pollDuration)
{
PRPollDesc *pollList;
uint32_t pollCount;
PRIntervalTime pollTimeout;
*pollDuration = 0;
+ // If there are pending events for this thread then
+ // DoPollIteration() should service the network without blocking.
+ bool pendingEvents = false;
+ mThread->HasPendingEvents(&pendingEvents);
+
if (mPollList[0].fd) {
mPollList[0].out_flags = 0;
pollList = mPollList;
pollCount = mActiveCount + 1;
- pollTimeout = PollTimeout();
+ pollTimeout = pendingEvents ? PR_INTERVAL_NO_WAIT : PollTimeout();
}
else {
// no pollable event, so busy wait...
pollCount = mActiveCount;
if (pollCount)
pollList = &mPollList[1];
else
pollList = nullptr;
- pollTimeout = PR_MillisecondsToInterval(25);
+ pollTimeout =
+ pendingEvents ? PR_INTERVAL_NO_WAIT : PR_MillisecondsToInterval(25);
}
- if (!wait)
- pollTimeout = PR_INTERVAL_NO_WAIT;
-
PRIntervalTime ts = PR_IntervalNow();
TimeStamp pollStart;
if (mTelemetryEnabledPref) {
pollStart = TimeStamp::NowLoRes();
}
SOCKET_LOG((" timeout = %i milliseconds\n",
@@ -756,16 +759,25 @@ nsSocketTransportService::SetAutodialEnabled(bool value)
{
mAutodialEnabled = value;
return NS_OK;
}
NS_IMETHODIMP
nsSocketTransportService::OnDispatchedEvent(nsIThreadInternal *thread)
{
+ if(PR_GetCurrentThread() == gSocketThread) {
+ // this check is redundant to one done inside ::Signal(), but
+ // we can do it here and skip obtaining the lock - given that
+ // this is a relatively common occurance its worth the
+ // redundant code
+ SOCKET_LOG(("OnDispatchedEvent Same Thread Skip Signal\n"));
+ return NS_OK;
+ }
+
DebugMutexAutoLock lock(mLock);
if (mPollableEvent) {
mPollableEvent->Signal();
}
return NS_OK;
}
NS_IMETHODIMP
@@ -843,50 +855,43 @@ nsSocketTransportService::Run()
// If there is too many pending events queued, we will run some poll()
// between them and the following variable is cumulative time spent
// blocking in poll().
TimeDuration pollDuration;
for (;;) {
bool pendingEvents = false;
- thread->HasPendingEvents(&pendingEvents);
numberOfPendingEvents = 0;
numberOfPendingEventsLastCycle = 0;
if (mTelemetryEnabledPref) {
startOfCycleForLastCycleCalc = TimeStamp::NowLoRes();
startOfNextIteration = TimeStamp::NowLoRes();
}
pollDuration = 0;
do {
if (mTelemetryEnabledPref) {
pollCycleStart = TimeStamp::NowLoRes();
}
- // If there are pending events for this thread then
- // DoPollIteration() should service the network without blocking.
- DoPollIteration(!pendingEvents, &singlePollDuration);
+ DoPollIteration(&singlePollDuration);
if (mTelemetryEnabledPref && !pollCycleStart.IsNull()) {
Telemetry::Accumulate(Telemetry::STS_POLL_BLOCK_TIME,
singlePollDuration.ToMilliseconds());
Telemetry::AccumulateTimeDelta(
Telemetry::STS_POLL_CYCLE,
pollCycleStart + singlePollDuration,
TimeStamp::NowLoRes());
pollDuration += singlePollDuration;
}
- // If nothing was pending before the poll, it might be now
- if (!pendingEvents) {
- thread->HasPendingEvents(&pendingEvents);
- }
-
+ thread->HasPendingEvents(&pendingEvents);
if (pendingEvents) {
if (!mServingPendingQueue) {
nsresult rv = Dispatch(NS_NewRunnableMethod(this,
&nsSocketTransportService::MarkTheLastElementOfPendingQueue),
nsIEventTarget::DISPATCH_NORMAL);
if (NS_FAILED(rv)) {
NS_WARNING("Could not dispatch a new event on the "
"socket thread.");
@@ -1002,22 +1007,21 @@ nsSocketTransportService::Reset(bool aGuardLocals)
DetachSocketWithGuard(aGuardLocals, mActiveList, i);
}
for (i = mIdleCount - 1; i >= 0; --i) {
DetachSocketWithGuard(aGuardLocals, mIdleList, i);
}
}
nsresult
-nsSocketTransportService::DoPollIteration(bool wait, TimeDuration *pollDuration)
+nsSocketTransportService::DoPollIteration(TimeDuration *pollDuration)
{
- SOCKET_LOG(("STS poll iter [%d]\n", wait));
+ SOCKET_LOG(("STS poll iter\n"));
int32_t i, count;
-
//
// poll loop
//
// walk active list backwards to see if any sockets should actually be
// idle, then walk the idle list backwards to see if any idle sockets
// should become active. take care to check only idle sockets that
// were idle to begin with ;-)
//
@@ -1059,28 +1063,26 @@ nsSocketTransportService::DoPollIteration(bool wait, TimeDuration *pollDuration)
#if defined(XP_WIN)
// 30 active connections is the historic limit before firefox 7's 256. A few
// windows systems have troubles with the higher limit, so actively probe a
// limit the first time we exceed 30.
if ((mActiveCount > 30) && !mProbedMaxCount)
ProbeMaxCount();
#endif
-
// Measures seconds spent while blocked on PR_Poll
uint32_t pollInterval;
-
int32_t n = 0;
#if !defined(MOZILLA_XPCOMRT_API)
if (!gIOService->IsNetTearingDown()) {
// Let's not do polling during shutdown.
- n = Poll(wait, &pollInterval, pollDuration);
+ n = Poll(&pollInterval, pollDuration);
}
#else
- n = Poll(wait, &pollInterval, pollDuration);
+ n = Poll(&pollInterval, pollDuration);
#endif // defined(MOZILLA_XPCOMRT_API)
if (n < 0) {
SOCKET_LOG((" PR_Poll error [%d] os error [%d]\n", PR_GetError(),
PR_GetOSError()));
}
else {
//
diff --git a/netwerk/base/nsSocketTransportService2.h b/netwerk/base/nsSocketTransportService2.h
index b27e034..a393188 100644
--- a/netwerk/base/nsSocketTransportService2.h
+++ b/netwerk/base/nsSocketTransportService2.h
@@ -194,21 +194,19 @@ private:
//
// first element of the poll list is mThreadEvent (or null if the pollable
// event cannot be created).
//-------------------------------------------------------------------------
PRPollDesc *mPollList; /* mListSize + 1 entries */
PRIntervalTime PollTimeout(); // computes ideal poll timeout
- nsresult DoPollIteration(bool wait,
- mozilla::TimeDuration *pollDuration);
+ nsresult DoPollIteration(mozilla::TimeDuration *pollDuration);
// perfoms a single poll iteration
- int32_t Poll(bool wait,
- uint32_t *interval,
+ int32_t Poll(uint32_t *interval,
mozilla::TimeDuration *pollDuration);
// calls PR_Poll. the out param
// interval indicates the poll
// duration in seconds.
// pollDuration is used only for
// telemetry
//-------------------------------------------------------------------------
Assignee | ||
Comment 22•9 years ago
|
||
Assignee | ||
Comment 23•9 years ago
|
||
Attachment #8719034 -
Flags: review?(honzab.moz)
Attachment #8719034 -
Flags: review?(dd.mozilla)
Assignee | ||
Updated•9 years ago
|
Attachment #8718994 -
Attachment is obsolete: true
Attachment #8718994 -
Flags: review?(honzab.moz)
Attachment #8718994 -
Flags: review?(dd.mozilla)
Assignee | ||
Comment 24•9 years ago
|
||
basically "hasPendingEvents" was determined on the socket thread and then passed to dopolliteration which passed it to poll which used it to set the timeout to pr_poll().
however, dopolliteration can call DetachSocket() which can have the side effect of getting ::OnDispatchedEvent called.. which meant that the hasPendingEvents value previously calculated was now inaccurate.
so I stopped passing that value along the stack and just calculated it in poll() close to where it is used. I also got rid of a redundant call to PollTimeout() [which separately is an absurd legacy function we could improve] in the fairly common case where hasPendingEvents is true and the PollTimeout() result was just overwritten.
I also duplicated the "same-thread?" check in ::Signal outside of the function in its most common path - OnDispatchedEvent simply because it can be done without the lock and is a common enough occurrance to warrant the code duplication for optimization.
Assignee | ||
Comment 25•9 years ago
|
||
Attachment #8719163 -
Flags: review?(honzab.moz)
Attachment #8719163 -
Flags: review?(dd.mozilla)
Assignee | ||
Updated•9 years ago
|
Attachment #8719034 -
Attachment is obsolete: true
Attachment #8719034 -
Flags: review?(honzab.moz)
Attachment #8719034 -
Flags: review?(dd.mozilla)
Assignee | ||
Comment 26•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #25)
> Created attachment 8719163 [details] [diff] [review]
> mozilla::net::PollableEvent
just nullcheck mthread on shutdown
diff --git a/netwerk/base/nsSocketTransportService2.cpp b/netwerk/base/nsSocketTransportService2.cpp
--- a/netwerk/base/nsSocketTransportService2.cpp
+++ b/netwerk/base/nsSocketTransportService2.cpp
@@ -439,17 +439,21 @@ nsSocketTransportService::Poll(uint32_t
PRPollDesc *pollList;
uint32_t pollCount;
PRIntervalTime pollTimeout;
*pollDuration = 0;
// If there are pending events for this thread then
// DoPollIteration() should service the network without blocking.
bool pendingEvents = false;
- mThread->HasPendingEvents(&pendingEvents);
+ if (mThread) {
+ mThread->HasPendingEvents(&pendingEvents);
+ } else { // shutdown awaits!
+ pendingEvents = true;
+ }
if (mPollList[0].fd) {
mPollList[0].out_flags = 0;
pollList = mPollList;
pollCount = mActiveCount + 1;
pollTimeout = pendingEvents ? PR_INTERVAL_NO_WAIT : PollTimeout();
}
else {
Assignee | ||
Comment 27•9 years ago
|
||
Comment 28•9 years ago
|
||
Comment on attachment 8719163 [details] [diff] [review]
mozilla::net::PollableEvent
Review of attachment 8719163 [details] [diff] [review]:
-----------------------------------------------------------------
I am giving it a r+ because the first comment can, should be done in a separate bug.
::: netwerk/base/PollableEvent.cpp
@@ +109,5 @@
> + if (status != 1) {
> + NS_WARNING("PollableEventSignal Failed\n");
> + SOCKET_LOG(("PollableEventSignal Failed\n"));
> + }
> + return (status == 1);
This function can be used to discover that we are hanging in a poll and do some repairs.
Maybe closing of mReadFD will cause the hanging thread to get out of poll?
I know that serving onsocketready can take longer time, but we can try to repair the pipe if the signal() call, that have set mSignaled(so the first one in a cycle) was something like 500ms ago.
Also if we fail in Signal() we can try to close mReadFD, or in other way cause an error on mReadFD, maybe that will cause the poll to return.
All of this only on windows :)
::: netwerk/base/nsSocketTransportService2.cpp
@@ +131,5 @@
> nsSocketTransportService::~nsSocketTransportService()
> {
> NS_ASSERTION(NS_IsMainThread(), "wrong thread");
> NS_ASSERTION(!mInitialized, "not shutdown properly");
>
Remove blank space when you are already here.
@@ +445,5 @@
> + // DoPollIteration() should service the network without blocking.
> + bool pendingEvents = false;
> + if (mThread) {
> + mThread->HasPendingEvents(&pendingEvents);
> + } else { // shutdown awaits!
Can you explain this.
Attachment #8719163 -
Flags: review?(dd.mozilla) → review+
![]() |
||
Comment 29•9 years ago
|
||
Comment on attachment 8719163 [details] [diff] [review]
mozilla::net::PollableEvent
Review of attachment 8719163 [details] [diff] [review]:
-----------------------------------------------------------------
The mMaxTimePerPollIter thing is IMO wrong (now I'm not complaining about this patch). It could happen that a previously scheduled event may be executed later than a next socket handler is executed. I mean:
expected order:
- dispatch(a)
- a->Run()
- OnSocketReady
but we may do:
- dispatch(a)
- OnSocketReady() -> I assume and behave as if |a| ran (but actually didn't)
- a->Run()
I don't know if there is a code that might suffer from this reordering, but I think it would be good to avoid it anyway.
I tho still smell a dead fish here.. the code is pretty much overcomplicated and something can be overlooked.
::: netwerk/base/PollableEvent.cpp
@@ +75,5 @@
> +
> +PollableEvent::~PollableEvent()
> +{
> + if (mWriteFD) {
> + PR_Close(mWriteFD);
to prevent being blocked, maybe read first from the read side? easily done with Clear() call.
would use of shutdown() lower any possibility of being blocked here?
@@ +96,5 @@
> + if (!mWriteFD) {
> + SOCKET_LOG(("PollableEventSignal Failed on no FD\n"));
> + return false;
> + }
> + if(PR_GetCurrentThread() == gSocketThread) {
if(
@@ +107,5 @@
> + mSignaled = true;
> + int32_t status = PR_Write(mWriteFD, "M", 1);
> + if (status != 1) {
> + NS_WARNING("PollableEventSignal Failed\n");
> + SOCKET_LOG(("PollableEventSignal Failed\n"));
"PollableEvent::Signal" ? when people are going to search for it in the code... applies to other logs as well
@@ +115,5 @@
> +
> +bool
> +PollableEvent::Clear()
> +{
> + // necessary because of the "dont signal on socket thread optimization
" at the end
@@ +121,5 @@
> +
> + SOCKET_LOG(("PollableEventClear\n"));
> + mSignaled = false;
> + char buf[2048];
> + int32_t status = PR_Read(mReadFD, buf, 2048);
Maybe do a non-null check on mReadFD just for sure?
@@ +133,5 @@
> + }
> + if (status > 1) {
> + MOZ_ASSERT(false);
> + SOCKET_LOG(("PollableEventClear Unexpected events\n"));
> + return true;
maybe call Clear() recursively (maybe with a limit) here to wipe the whole queue? I know this must not happen, but anyway. up to you.
::: netwerk/base/nsSocketTransportService2.cpp
@@ +453,5 @@
> if (mPollList[0].fd) {
> mPollList[0].out_flags = 0;
> pollList = mPollList;
> pollCount = mActiveCount + 1;
> + pollTimeout = pendingEvents ? PR_INTERVAL_NO_WAIT : PollTimeout();
this is a very good change! before that we could get stuck:
- there is no active socket
- we loop through the pending events, but not make it on mMaxTimePerPollIter time
- there are events pending left
- we go to PR_Poll with an arbitrary timeout value
- we don't wake up sooner until someone posts something
@@ +767,5 @@
>
> NS_IMETHODIMP
> nsSocketTransportService::OnDispatchedEvent(nsIThreadInternal *thread)
> {
> + if(PR_GetCurrentThread() == gSocketThread) {
if(
@@ -1064,5 @@
> // limit the first time we exceed 30.
> if ((mActiveCount > 30) && !mProbedMaxCount)
> ProbeMaxCount();
> #endif
> -
patrick please don't remove blank lines this way. well put blank lines make code much more easily readable.
@@ +1154,2 @@
> }
> + if (!mPollableEvent) {
maybe just:
mPollableEvent.reset(new PollableEvent());
if (!mPollableEvent->Valid()) {
mPollableEvent = nullptr;
NS_WARN... etc?
}
Attachment #8719163 -
Flags: review?(honzab.moz)
Attachment #8719163 -
Flags: review?(dd.mozilla)
Attachment #8719163 -
Flags: review+
![]() |
||
Updated•9 years ago
|
Attachment #8719163 -
Flags: review?(dd.mozilla) → review+
Comment 30•9 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #29)
> Comment on attachment 8719163 [details] [diff] [review]
> mozilla::net::PollableEvent
>
> Review of attachment 8719163 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> The mMaxTimePerPollIter thing is IMO wrong (now I'm not complaining about
> this patch). It could happen that a previously scheduled event may be
> executed later than a next socket handler is executed. I mean:
>
> expected order:
> - dispatch(a)
> - a->Run()
> - OnSocketReady
>
> but we may do:
> - dispatch(a)
> - OnSocketReady() -> I assume and behave as if |a| ran (but actually didn't)
> - a->Run()
>
>
> I don't know if there is a code that might suffer from this reordering, but
> I think it would be good to avoid it anyway.
>
>
> I tho still smell a dead fish here.. the code is pretty much
> overcomplicated and something can be overlooked.
This use to be even worse. We use to do only one event per cycle so it was always:
- dispatch(a)
- OnSocketReady()
- a->Run()
adding mMaxTimePerPollIter make it do more than one event.
> ::: netwerk/base/nsSocketTransportService2.cpp
> @@ +453,5 @@
> > if (mPollList[0].fd) {
> > mPollList[0].out_flags = 0;
> > pollList = mPollList;
> > pollCount = mActiveCount + 1;
> > + pollTimeout = pendingEvents ? PR_INTERVAL_NO_WAIT : PollTimeout();
>
> this is a very good change! before that we could get stuck:
>
> - there is no active socket
> - we loop through the pending events, but not make it on mMaxTimePerPollIter
> time
> - there are events pending left
> - we go to PR_Poll with an arbitrary timeout value
> - we don't wake up sooner until someone posts something
>
No, we were always checking if there is events in queue before calling poll:
http://mxr.mozilla.org/mozilla-central/source/netwerk/base/nsSocketTransportService2.cpp#912
and http://mxr.mozilla.org/mozilla-central/source/netwerk/base/nsSocketTransportService2.cpp#883
![]() |
||
Comment 31•9 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #30)
> (In reply to Honza Bambas (:mayhemer) from comment #29)
> > Comment on attachment 8719163 [details] [diff] [review]
> > mozilla::net::PollableEvent
> >
> > Review of attachment 8719163 [details] [diff] [review]:
> > -----------------------------------------------------------------
> >
> > The mMaxTimePerPollIter thing is IMO wrong (now I'm not complaining about
> > this patch). It could happen that a previously scheduled event may be
> > executed later than a next socket handler is executed. I mean:
> >
> > expected order:
> > - dispatch(a)
> > - a->Run()
> > - OnSocketReady
> >
> > but we may do:
> > - dispatch(a)
> > - OnSocketReady() -> I assume and behave as if |a| ran (but actually didn't)
> > - a->Run()
> >
> >
> > I don't know if there is a code that might suffer from this reordering, but
> > I think it would be good to avoid it anyway.
> >
> >
> > I tho still smell a dead fish here.. the code is pretty much
> > overcomplicated and something can be overlooked.
>
> This use to be even worse. We use to do only one event per cycle so it was
> always:
> - dispatch(a)
> - OnSocketReady()
> - a->Run()
>
> adding mMaxTimePerPollIter make it do more than one event.
So, what is the status with this patch?
>
>
> > ::: netwerk/base/nsSocketTransportService2.cpp
> > @@ +453,5 @@
> > > if (mPollList[0].fd) {
> > > mPollList[0].out_flags = 0;
> > > pollList = mPollList;
> > > pollCount = mActiveCount + 1;
> > > + pollTimeout = pendingEvents ? PR_INTERVAL_NO_WAIT : PollTimeout();
> >
> > this is a very good change! before that we could get stuck:
> >
> > - there is no active socket
> > - we loop through the pending events, but not make it on mMaxTimePerPollIter
> > time
> > - there are events pending left
> > - we go to PR_Poll with an arbitrary timeout value
> > - we don't wake up sooner until someone posts something
> >
>
> No, we were always checking if there is events in queue before calling poll:
> http://mxr.mozilla.org/mozilla-central/source/netwerk/base/
> nsSocketTransportService2.cpp#912
> and
> http://mxr.mozilla.org/mozilla-central/source/netwerk/base/
> nsSocketTransportService2.cpp#883
Maybe I'm wrong on this point. Any event dispatch sets the pollable event. So PR_Poll should have exit immediately.
Assignee | ||
Updated•9 years ago
|
Attachment #8718619 -
Attachment is obsolete: true
Assignee | ||
Updated•9 years ago
|
Assignee | ||
Comment 32•9 years ago
|
||
Comment on attachment 8719163 [details] [diff] [review]
mozilla::net::PollableEvent
Review of attachment 8719163 [details] [diff] [review]:
-----------------------------------------------------------------
::: netwerk/base/PollableEvent.cpp
@@ +109,5 @@
> + if (status != 1) {
> + NS_WARNING("PollableEventSignal Failed\n");
> + SOCKET_LOG(("PollableEventSignal Failed\n"));
> + }
> + return (status == 1);
I agree - other bug. I think we should try and land the more extreme "OMG this is broken" fixes in daniel's patch..
::: netwerk/base/nsSocketTransportService2.cpp
@@ +445,5 @@
> + // DoPollIteration() should service the network without blocking.
> + bool pendingEvents = false;
> + if (mThread) {
> + mThread->HasPendingEvents(&pendingEvents);
> + } else { // shutdown awaits!
ha - good comment. It was obviously null here:
http://archive.mozilla.org/pub/firefox/try-builds/mcmanus@ducksong.com-fbcbc97040eed0a7c139bfe9a7454b044612083b/try-macosx64/try_yosemite_r7_test-xpcshell-bm107-tests1-macosx-build120.txt.gz
from the try run in comment 22 (which didn't have this check)
but looking at it harder, this is actually at startup - a totally different race. Thanks! mthread is done locked on the main thread after init.. so the thread can already be running and not know mthread. I've changed this usage of it to be a raw pointer (accessible only to socket thread) set in the run method.
@@ +1143,2 @@
> // acknowledge pollable event (wait should not block)
> + if (!mPollableEvent || !mPollableEvent->Clear()) {
I updated this to mPollableEvent && !pe->clear().. fyi
Assignee | ||
Comment 33•9 years ago
|
||
Assignee | ||
Comment 34•9 years ago
|
||
Assignee | ||
Comment 35•9 years ago
|
||
![]() |
||
Comment 36•9 years ago
|
||
Comment on attachment 8719163 [details] [diff] [review]
mozilla::net::PollableEvent
Review of attachment 8719163 [details] [diff] [review]:
-----------------------------------------------------------------
::: netwerk/base/nsSocketTransportService2.cpp
@@ +832,1 @@
> mPollList[0].in_flags = PR_POLL_READ;
forgot one important thing: can you also add PR_POLL_EXCEPT here?
Assignee | ||
Comment 37•9 years ago
|
||
Assignee | ||
Comment 38•9 years ago
|
||
Assignee | ||
Updated•9 years ago
|
Attachment #8719163 -
Attachment is obsolete: true
Assignee | ||
Updated•9 years ago
|
Attachment #8720332 -
Flags: review+
Assignee | ||
Comment 39•9 years ago
|
||
Assignee | ||
Comment 40•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/d6005f001576ee85c0b7db83a902282094ce2717
Bug 698882 - mozilla::net::PollableEvent r=dragana r=mayhemer
Comment 41•9 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 9 years ago
status-firefox47:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Comment 43•9 years ago
|
||
May crash https://crash-stats.mozilla.com/report/index/03bfefe5-c623-4744-a992-b296d2160221 be triggered by this commit? Crashed 100% on startup.
Assignee | ||
Comment 44•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/6183d3a293a98ef1313fd57bbbdafcee0783b29f
Bug 698882 - backout threadevent changes due to crashes with McAfee Firefox Hook r=backout
Assignee | ||
Comment 45•9 years ago
|
||
(In reply to Alexander Usyskin from comment #43)
> May crash
> https://crash-stats.mozilla.com/report/index/03bfefe5-c623-4744-a992-
> b296d2160221 be triggered by this commit? Crashed 100% on startup.
yeah it looks like gcagff.dll (McAfee Firefox Hook) is crashing with this patch. lovely what those damn LSPs do. But its just on ctor of the object so it shouldn't be hard to figure out.
I backed it out and will test and fix with that software if I can install it.
Thanks for reporting it.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [necko-active] → [necko-active][leave-open]
Comment 46•9 years ago
|
||
bugherder |
Assignee | ||
Comment 47•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #45)
> (In reply to Alexander Usyskin from comment #43)
> > May crash
> > https://crash-stats.mozilla.com/report/index/03bfefe5-c623-4744-a992-
> > b296d2160221 be triggered by this commit? Crashed 100% on startup.
>
> yeah it looks like gcagff.dll (McAfee Firefox Hook) is crashing with this
> patch. lovely what those damn LSPs do. But its just on ctor of the object so
> it shouldn't be hard to figure out.
>
> I backed it out and will test and fix with that software if I can install it.
>
> Thanks for reporting it.
afaict fgcagff.dll is the mcafee data loss prevention module.. which doesn't seem to be a consumer grade item that I can easily download and test with.
If I make a few guesses at how to resolve this, would you Alexander be able to test the builds?
Flags: needinfo?(sanniu)
Assignee | ||
Comment 48•9 years ago
|
||
https://bugzilla.mozilla.org/show_bug.cgi?id=1250979 has info on a talos test regression that needs to be sorted out before relanding
Assignee | ||
Comment 49•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #48)
> https://bugzilla.mozilla.org/show_bug.cgi?id=1250979 has info on a talos
> test regression that needs to be sorted out before relanding
That was wrong on my part.
The "regression" was on 6183d3a293a9 which was the backout of this patch (due to crashes). So the patch is actually a performance win (as expected). good news. I do like this patch.
Comment 50•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #47)
> (In reply to Patrick McManus [:mcmanus] from comment #45)
> > (In reply to Alexander Usyskin from comment #43)
> > > May crash
> > > https://crash-stats.mozilla.com/report/index/03bfefe5-c623-4744-a992-
> > > b296d2160221 be triggered by this commit? Crashed 100% on startup.
> >
> > yeah it looks like gcagff.dll (McAfee Firefox Hook) is crashing with this
> > patch. lovely what those damn LSPs do. But its just on ctor of the object so
> > it shouldn't be hard to figure out.
> >
> > I backed it out and will test and fix with that software if I can install it.
> >
> > Thanks for reporting it.
>
> afaict fgcagff.dll is the mcafee data loss prevention module.. which doesn't
> seem to be a consumer grade item that I can easily download and test with.
>
> If I make a few guesses at how to resolve this, would you Alexander be able
> to test the builds?
Yes, I would be glad to help. I can test you engineering builds.
Flags: needinfo?(sanniu)
Assignee | ||
Comment 51•9 years ago
|
||
Assignee | ||
Comment 52•9 years ago
|
||
Assignee | ||
Comment 53•9 years ago
|
||
Assignee | ||
Comment 54•9 years ago
|
||
Assignee | ||
Comment 55•9 years ago
|
||
Assignee | ||
Comment 56•9 years ago
|
||
Assignee | ||
Comment 57•9 years ago
|
||
Assignee | ||
Comment 58•9 years ago
|
||
OK Alexander, here we go!
I have a series of builds.. they build on each other.. so the idea is just to find the first one that fails for you by crashing on startup.. after that there is no point in continuing.
They start by including just the very basic functionality with all of the config on it stripped away.. so its very possible that firefox won't work well (or conceivably at all) with some of them.. that's ok - what we're looking for is the crash
https://archive.mozilla.org/pub/firefox/try-builds/mcmanus@ducksong.com-9ac39d4b5f15be95d8e8a6e02af7bd0a8697db23/
https://archive.mozilla.org/pub/firefox/try-builds/mcmanus@ducksong.com-feb17c7cc4057bc457fb8efbdfb4aed2d8ab5215/
https://archive.mozilla.org/pub/firefox/try-builds/mcmanus@ducksong.com-51c5d580f4284ac764f00b9074a5ba96e9baa95a/
https://archive.mozilla.org/pub/firefox/try-builds/mcmanus@ducksong.com-64b0bb93538147439b8aebf7e2f5230c28c2a943/
https://archive.mozilla.org/pub/firefox/try-builds/mcmanus@ducksong.com-99f8b58ad30c1df29b2574495ea6352c74dfb661/
https://archive.mozilla.org/pub/firefox/try-builds/mcmanus@ducksong.com-aa3ff98df8596c805b36c20f262a62f97e38feaf/
https://archive.mozilla.org/pub/firefox/try-builds/mcmanus@ducksong.com-5259a7398eb8f39862ec0a80abef38cb313a73d6/
Thanks!
Assignee | ||
Comment 59•9 years ago
|
||
(oh, and those builds just started.. so if you're too quick to download them you might need to wait a few hours for them to be complete)
Flags: needinfo?(sanniu)
Comment 60•9 years ago
|
||
Tried:
https://archive.mozilla.org/pub/firefox/try-builds/mcmanus@ducksong.com-9ac39d4b5f15be95d8e8a6e02af7bd0a8697db23/
https://archive.mozilla.org/pub/firefox/try-builds/mcmanus@ducksong.com-feb17c7cc4057bc457fb8efbdfb4aed2d8ab5215/
https://archive.mozilla.org/pub/firefox/try-builds/mcmanus@ducksong.com-5259a7398eb8f39862ec0a80abef38cb313a73d6/
all three crashed
Flags: needinfo?(sanniu)
Assignee | ||
Comment 61•9 years ago
|
||
thanks. any links from about:crashes ?
Comment 62•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #61)
> thanks. any links from about:crashes ?
Re-run with crash reports:
https://crash-stats.mozilla.com/report/index/bp-5fef5c6e-89a5-47b7-a1b0-501ee2160307
https://crash-stats.mozilla.com/report/index/bp-b7bf959e-f771-4624-8744-c84e22160307
https://crash-stats.mozilla.com/report/index/bp-5fef5c6e-89a5-47b7-a1b0-501ee2160307
Assignee | ||
Comment 63•9 years ago
|
||
pr_getnameforidentity (and much of nspr) does some definitely non-threadsafe lazy initialization behavior.. but the xpcom startup code happens to run this at a safe single threaded early time (PR_Getenv() triggers the code).. so one more reason to be suspicious of nspr, but unlikely the issue here.
Assignee | ||
Comment 64•9 years ago
|
||
Assignee | ||
Comment 65•9 years ago
|
||
pr_getnameforidentity is part of nspr, but is not used by gecko - it is used by this dll (and is what crashes).
it is definitely not threadsafe with respect to PR_GetUniqueIdentity() which is widely used in gecko, and is used on several threads.
so that's bad. But I'm not sure why this patch would expose that problem.
Assignee | ||
Comment 66•9 years ago
|
||
Assignee | ||
Comment 67•9 years ago
|
||
Assignee | ||
Comment 68•9 years ago
|
||
Assignee | ||
Comment 69•9 years ago
|
||
Assignee | ||
Comment 70•9 years ago
|
||
Assignee | ||
Comment 71•9 years ago
|
||
Hi Alexander, thanks for your help here. I have a few more builds for you to try - I'm sorry I don't have a better way to test this out. I'm not sure if the answer is a race condition inside NSPR or a side effect of how the library is implemented. Although I'm confident I've identified a race condition, your crash is very consistent and that makes it seem like it probably isn't the underlying cause.
So I have 6 builds, can you download them and try them out?
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d0341e69020e
https://treeherder.mozilla.org/#/jobs?repo=try&revision=780745ff0752
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a5e540f3069a
https://treeherder.mozilla.org/#/jobs?repo=try&revision=aa7132a84883
https://treeherder.mozilla.org/#/jobs?repo=try&revision=844426095377
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d87341ca8d00
Flags: needinfo?(sanniu)
Comment 72•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #71)
> Hi Alexander, thanks for your help here. I have a few more builds for you to
> try - I'm sorry I don't have a better way to test this out. I'm not sure if
> the answer is a race condition inside NSPR or a side effect of how the
> library is implemented. Although I'm confident I've identified a race
> condition, your crash is very consistent and that makes it seem like it
> probably isn't the underlying cause.
>
> So I have 6 builds, can you download them and try them out?
>
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=d0341e69020e
crash
https://crash-stats.mozilla.com/report/index/bp-a1f9c531-d78a-43c2-9515-1bca92160316
https://crash-stats.mozilla.com/report/index/bp-bfc9868e-b934-4ae6-a06e-c91a72160316
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=780745ff0752
ok
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=a5e540f3069a
ok
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=aa7132a84883
crash
https://crash-stats.mozilla.com/report/index/bp-1f32cf24-9e06-4ba7-992b-21cbb2160316
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=844426095377
crash
https://crash-stats.mozilla.com/report/index/bp-283e5a8a-3290-4af5-abbf-186c32160316
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=d87341ca8d00
crash
https://crash-stats.mozilla.com/report/index/bp-8dc95941-e4a1-4f72-afd5-6f7b02160316
Second build and all subsequent triggered "unknown app" win8.1 protection window
Flags: needinfo?(sanniu)
Assignee | ||
Comment 73•9 years ago
|
||
Alexander, huge help here. Thank you. McAfee seems to be hardcoded to expect an NSPR layer on top of the read side of the localhost pair (as it happens to be in released code). We can give it one fairly painlessly, which seems to be the path of least resistance. (that's what worked in builds 2 and 3).
Assignee | ||
Comment 74•9 years ago
|
||
Assignee | ||
Comment 75•9 years ago
|
||
Attachment #8731262 -
Flags: review?(dd.mozilla)
Assignee | ||
Updated•9 years ago
|
Attachment #8720332 -
Attachment is obsolete: true
Assignee | ||
Comment 76•9 years ago
|
||
Comment on attachment 8731262 [details] [diff] [review]
mozilla::net::PollableEvent
Review of attachment 8731262 [details] [diff] [review]:
-----------------------------------------------------------------
The primary changes you will find here
1] pollableEvent ctor moved to socket thread. No reason to do that stuff during module construction on the main thread. Also makes the new lazy init threadsafe to do
2] an empty nspr layer that mirrors the one that nspr pollable event was using. It seems we have an LSP hardcoded to believe that is there. Its ugly but I think performance wise its a lateral move.. the same thing happens on current releases and is dwarfed by the IO event anyhow (plus we call this path less which is a benefit of the patch overall). I only put it on the socket and not the pipe.
3] mem leak trackers for PollableEvent
Assignee | ||
Comment 77•9 years ago
|
||
Comment on attachment 8731262 [details] [diff] [review]
mozilla::net::PollableEvent
Review of attachment 8731262 [details] [diff] [review]:
-----------------------------------------------------------------
oh, it also makes the rwin change that should satisfy the "sg tcp optimizer" case. Can we verify that?
Comment 78•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #77)
> Comment on attachment 8731262 [details] [diff] [review]
> mozilla::net::PollableEvent
>
> Review of attachment 8731262 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> oh, it also makes the rwin change that should satisfy the "sg tcp optimizer"
> case. Can we verify that?
I will test it.
Assignee | ||
Comment 79•9 years ago
|
||
Assignee | ||
Comment 80•9 years ago
|
||
comment 79 brings a localhost/pair implementation into PollableEvent (instead of using PR_NewTCPSocketPair()) so that the recv window can be set before connection, required by the "sg tcp optimizer" bug. let's see how try feels about that before asking for review :)
Comment 81•9 years ago
|
||
this is firefox release.
Comment 82•9 years ago
|
||
This is the last try build comment 79.
As soon as you change rwin auto tuning is turned off for the socket.
So the "sg tcp optimier" bug is fixed.
Assignee | ||
Comment 83•9 years ago
|
||
Attachment #8731392 -
Flags: review?(dd.mozilla)
Assignee | ||
Updated•9 years ago
|
Attachment #8731262 -
Attachment is obsolete: true
Attachment #8731262 -
Flags: review?(dd.mozilla)
Assignee | ||
Comment 84•9 years ago
|
||
Thanks so much for those tests! One thing worries me..
in the new.cap the rwin for the receiver seems to be perpetually shrinking with each ack. As if firefox is not reading bytes into the application layer.
but I've played around with this on linux and see the same thing and then it sort of warps around and catches up after a few thousand. I've added logging and (on linux - when socket is forced) the reads and writes are balanced. It is probably the auto-tuning algorithm that I'm seeing.. still..
But just to be sure, can you run a test on windows on the "sg tcp optimizer" host that moves around 64K events or whatever you need to be sure we won't exhaust that window? how did you capture localhost traffic on windows?
Assignee | ||
Comment 85•9 years ago
|
||
Assignee | ||
Comment 86•9 years ago
|
||
Attachment #8731408 -
Flags: review?(dd.mozilla)
Assignee | ||
Updated•9 years ago
|
Attachment #8731392 -
Attachment is obsolete: true
Attachment #8731392 -
Flags: review?(dd.mozilla)
Assignee | ||
Comment 87•9 years ago
|
||
alright, I'll stop fussing with it now :)
the comment 86 patch contains logs for the read/write return values so we can see if for some reason they don't balance.. and it also skips creating the pollable event if we're on the child process (as this will create a unused localhost pair in the child.. which we must be doing with the classic mThreadEvent :()
Comment 88•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #84)
> Thanks so much for those tests! One thing worries me..
>
> in the new.cap the rwin for the receiver seems to be perpetually shrinking
> with each ack. As if firefox is not reading bytes into the application layer.
>
> but I've played around with this on linux and see the same thing and then it
> sort of warps around and catches up after a few thousand. I've added logging
> and (on linux - when socket is forced) the reads and writes are balanced. It
> is probably the auto-tuning algorithm that I'm seeing.. still..
>
> But just to be sure, can you run a test on windows on the "sg tcp optimizer"
> host that moves around 64K events or whatever you need to be sure we won't
> exhaust that window? how did you capture localhost traffic on windows?
I will test it tomorrow.
RawCap can capture localhost on windows.
Assignee | ||
Comment 89•9 years ago
|
||
Attachment #8731417 -
Flags: review?(dd.mozilla)
Assignee | ||
Updated•9 years ago
|
Attachment #8731408 -
Attachment is obsolete: true
Attachment #8731408 -
Flags: review?(dd.mozilla)
Assignee | ||
Comment 90•9 years ago
|
||
re comment 89 - I put pollableEvent back on the child process because RTC needs the full sockettransport stack there
Assignee | ||
Comment 91•9 years ago
|
||
Comment 92•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #84)
> Thanks so much for those tests! One thing worries me..
>
> in the new.cap the rwin for the receiver seems to be perpetually shrinking
> with each ack. As if firefox is not reading bytes into the application layer.
>
> but I've played around with this on linux and see the same thing and then it
> sort of warps around and catches up after a few thousand. I've added logging
> and (on linux - when socket is forced) the reads and writes are balanced. It
> is probably the auto-tuning algorithm that I'm seeing.. still..
>
> But just to be sure, can you run a test on windows on the "sg tcp optimizer"
> host that moves around 64K events or whatever you need to be sure we won't
> exhaust that window? how did you capture localhost traffic on windows?
I have tested this and it always decrease until 64076 an than increase back to 65535. It is the TCP implementation. (I have capture around 330000 packet - around 160000 writes)
Updated•9 years ago
|
Attachment #8731417 -
Flags: review?(dd.mozilla) → review+
Assignee | ||
Comment 93•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/06e41d2080a2a0736c320aaa7f8319e51515ce85
Bug 698882 - mozilla::net::PollableEvent r=dragana r=mayhemer
Comment 94•9 years ago
|
||
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/01c747d7ca4f for leaking on OS X 10.6, https://treeherder.mozilla.org/logviewer.html#?job_id=24198571&repo=mozilla-inbound
10.6 debug being best known for slow, I wouldn't be surprised if it also winds up leaking on WinXP, but that hasn't finished and run tests yet, thanks to a multi-hour build outage.
Comment 95•9 years ago
|
||
Or, how about a much stranger Windows result, 100% https://treeherder.mozilla.org/logviewer.html#?job_id=24199956&repo=mozilla-inbound in xperf.
Assignee | ||
Comment 96•9 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #95)
> Or, how about a much stranger Windows result, 100%
> https://treeherder.mozilla.org/logviewer.html#?job_id=24199956&repo=mozilla-
> inbound in xperf.
talos bug! We've zero'd out two counters :) That's good news and more or less expected.
before this patch:
07:39 INFO - COUNTER 'nonmain_normal_netio': [186735738.0]
20:07:39 INFO - COUNTER 'mainthread_writebytes': [[32, 'C:\\ProgramData\\NVIDIA Corporation\\Drs\\nvAppTimestamps'], [782, 'C:\\Users\\cltbld\\AppData\\Local\\Temp\\tmpjxtq4x\\profile\\compatibility.ini']]
20:07:39 INFO - COUNTER 'main_startup_fileio': [42398765.0]
20:07:39 INFO - COUNTER 'main_normal_netio': [3674613.0]
20:07:39 INFO - COUNTER 'nonmain_startup_fileio': [309099.0]
20:07:39 INFO - COUNTER 'main_startup_netio': [13.0]
20:07:39 INFO - COUNTER 'nonmain_normal_fileio': [476665801.0]
20:07:39 INFO - COUNTER 'main_normal_fileio': [51818957.0]
after this patch:
20:06:38 INFO - COUNTER 'nonmain_normal_netio': [187687767.0]
20:06:38 INFO - COUNTER 'mainthread_writebytes': [[782, 'C:\\Users\\cltbld\\AppData\\Local\\Temp\\tmp151bji\\profile\\compatibility.ini'], [32, 'C:\\ProgramData\\NVIDIA Corporation\\Drs\\nvAppTimestamps']]
20:06:38 INFO - COUNTER 'main_startup_fileio': [42402881.0]
20:06:38 INFO - COUNTER 'main_normal_netio': []
20:06:38 INFO - COUNTER 'nonmain_startup_fileio': [309099.0]
20:06:38 INFO - COUNTER 'main_startup_netio': []
20:06:38 INFO - COUNTER 'nonmain_normal_fileio': [479767805.0]
20:06:38 INFO - COUNTER 'main_normal_fileio': [51912637.0]
Assignee | ||
Comment 97•9 years ago
|
||
Assignee | ||
Comment 98•9 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #94)
> Backed out in
> https://hg.mozilla.org/integration/mozilla-inbound/rev/01c747d7ca4f for
> leaking on OS X 10.6,
> https://treeherder.mozilla.org/logviewer.html#?job_id=24198571&repo=mozilla-
> inbound
>
> 10.6 debug being best known for slow, I wouldn't be surprised if it also
> winds up leaking on WinXP, but that hasn't finished and run tests yet,
> thanks to a multi-hour build outage.
fwiw - that's bug 1256022 (now with a new object added to the tree of leaked stuff.)
Depends on: 1256022
Assignee | ||
Comment 99•9 years ago
|
||
Assignee | ||
Comment 100•9 years ago
|
||
Assignee | ||
Comment 101•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/34046c232ee1f7324ac586cf0b72374764d8f513
Bug 698882 - mozilla::net::PollableEvent r=dragana r=mayhemer
Assignee | ||
Comment 102•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/bbc95a357cfdfd7226945d817a88d21675b1824f
Bug 698882 - backout 34046c232ee1 r=backout
Assignee | ||
Comment 103•9 years ago
|
||
comment 95 is bug 1258414 - independent, and resolved.
comment 98 is bug 1256022 - independent, and will be pushed together with this.
push comment 101 (that was backed out in comment 102) was because the pollable event listen socket was made non blocking. This is the socket that is actually closed once the client/server pair has been established.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=bf5375f8cbe5&selectedJob=18612843 looks good. The only blemish is the frequent occurance of 1222277 - which is occurring frequently without the patch as well.
let's go for it.
Assignee | ||
Comment 104•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/7c4176c6d8801496127e4643324ef8cf555b38cb
Bug 698882 - mozilla::net::PollableEvent r=dragana r=mayhemer
Assignee | ||
Updated•9 years ago
|
Whiteboard: [necko-active][leave-open] → [necko-active]
Comment 105•9 years ago
|
||
bugherder |
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
status-firefox48:
--- → fixed
Resolution: --- → FIXED
Target Milestone: mozilla47 → mozilla48
Comment 106•9 years ago
|
||
I am seeing a large increase in TSan-reported data races on xpcshell tests
within the past week. I suspect this to be the cause. It appears to be
racing on nsSocketTransportService::mPollableEvent. It is first read at
nsSocketTransportService2.cpp:631 (nsSocketTransportService::SetOffline)
whilst mLock is held, but is later written at nsSocketTransportService2.cpp:803
(nsSocketTransportService::Run) with no lock held.
631 if (mPollableEvent) {
632 mPollableEvent->Signal();
633 }
803 mPollableEvent.reset(new PollableEvent());
Assignee | ||
Comment 108•9 years ago
|
||
yes, the 803 bit should hold a lock. I'll file a followon - thanks. This should only be during startup fwiw.
Flags: needinfo?(mcmanus)
Assignee | ||
Comment 109•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #108)
> yes, the 803 bit should hold a lock. I'll file a followon - thanks. This
> should only be during startup fwiw.
https://bugzilla.mozilla.org/show_bug.cgi?id=1260764
Comment 110•9 years ago
|
||
We knew that this patch will help in some cases of shutdown hang. Looking at the crash stats:
https://crash-stats.mozilla.com/signature/?date=%3E2016-01-01&version=48.0a1&signature=shutdownhang+%7C+WaitForSingleObjectEx+%7C+WaitForSingleObject+%7C+PR_WaitCondVar+%7C+nsThread%3A%3AProcessNextEvent+%7C+NS_ProcessNextEvent+%7C+mozilla%3A%3Anet%3A%3AnsHttpConnectionMgr%3A%3AShutdown&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1#reports
Nightly does not have a lot of shutdown hangs, but this is interesting:
There are 19 hangs and they are all with builds before 2016-03-24.
This patch landed on 2016-03-28 and it is probably the reason for this improvement. Nothing else landed in that period in necko that could improved the stats.
We could probably uplift this to aurora, but I will write all pros and risks as well.
Comment 111•9 years ago
|
||
This patch is the merge of this bug and bug 1260764.
It is also rebased for aurora.
Comment 112•9 years ago
|
||
This is reverse patch for backing out bug_698882_aurora.patch
Comment 114•9 years ago
|
||
Comment on attachment 8740487 [details] [diff] [review]
bug_698882_aurora.patch
Approval Request Comment
[Feature/regressing bug #]: Comment #110 explains an interesting finding. Most probably it was this patch that improved shutdown hangs stats.
[User impact if declined]: Eliminates some shutdown hangs (not guaranty that will eliminate all of them)
[Describe test coverage new/current, TreeHerder]: it is in nightly since 28.3.2016 and it is heavily used code.
1 - data implies this is good even though not directly targeted at this problem
2 - we know of several other bugs that it fixes
3 - startup perf improvement
4 - had a good run on nightly without problems and is heavily used.
[Risks and why]:
1 - mid sized patch
2 - lots of "surface area"
3 - highly dependent on state and calling pattern - makes it hard to test
4 - shown to have unexpected interactions with LSP, intermittent orange, etc.. as second order effects. On the other hand it may have cleaned up some of those as well. (shutdown hangs?)
5 - no str so no verification possible other than crash-stats data
[String/UUID change made/needed]:none
This patch contains change from this bug and bug 1260764.
Bug 1258414 needs to be uplifted as well, but that is just a 2 line fix for a test.
There is also a backout patch attached to this bug.
Attachment #8740487 -
Flags: approval-mozilla-aurora?
Updated•9 years ago
|
Comment 115•9 years ago
|
||
Hi Jason, Patrick, Honza: Given the medium risk associated with the patch(es) here, I would like to get your opinion on whether this is worth uplifting to Aurora 47 or not? The data from Nightly looks promising but I haven't looked at anything new that has cropped up since this landed. Appreciate your help here.
Flags: needinfo?(odvarko)
Flags: needinfo?(mcmanus)
Flags: needinfo?(jduell.mcbugs)
![]() |
||
Updated•9 years ago
|
Flags: needinfo?(odvarko) → needinfo?(honzab.moz)
![]() |
||
Comment 116•9 years ago
|
||
I don't think I have enough information to tell for sure here. The patch definitely has benefits according performance and stability, getting proven. But I think Patrick has move info on about wider negative impact, if any.
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 117•9 years ago
|
||
(In reply to Ritu Kothari (:ritu) from comment #115)
> Hi Jason, Patrick, Honza: Given the medium risk associated with the
> patch(es) here, I would like to get your opinion on whether this is worth
> uplifting to Aurora 47 or not? The data from Nightly looks promising but I
> haven't looked at anything new that has cropped up since this landed.
> Appreciate your help here.
Hi Ritu,
As the author here, I can say this patch is great :) Uplift depends largely on how badly you want to quickly address the shutdown hang in bug 1158189 which has been described as the #3 top crasher.
We believe this patch largely fixes that problem. Unfortunately, without a STR for that bug we just focused on addressing a whole class of things that looked related to stablity - and this patch was one of those initiatives. I'm not exactly certain why it has the outsized positive impact on 1158189 (lacking an STR for that) but I'm not really surprised as there are other well defined scenarios that I can explain why it fixes.. 1158189 is probably some more complicated manifestation of those. Shutdown hang data can be notoriously hard to interpret.
So I recommended to dragana that we go with the uplift for one channel. There is a small chance that this actually isn't the reason those crashes went away, but it seems very unlikely given the data.
As for stability risks from the patch I'm not particularly worried about the patch logic itself - that's pretty well understood. Its more a matter that it changes/improves some things at a low level and we've seen a couple interactions from that.. e.g. one windows binary LSP was making completely undocumented assumptions about our internal ABI from the old method.. and a couple tests that have always been occasionally orange got very orange when the patch first landed due to timing changes (which helped us find those unrelated bugs that were always in those tests).
This patch is always going to be at some risk for new interop challenges as it goes through beta.. I'm not sure waiting an extra 6 weeks to get it there is going to tell us anything extra - so I would land it given the stability it brings. The perf win is a bonus.
Flags: needinfo?(mcmanus)
Comment 118•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #117)
> (In reply to Ritu Kothari (:ritu) from comment #115)
> > Hi Jason, Patrick, Honza: Given the medium risk associated with the
> > patch(es) here, I would like to get your opinion on whether this is worth
> > uplifting to Aurora 47 or not? The data from Nightly looks promising but I
> > haven't looked at anything new that has cropped up since this landed.
> > Appreciate your help here.
>
> Hi Ritu,
>
> As the author here, I can say this patch is great :) Uplift depends largely
> on how badly you want to quickly address the shutdown hang in bug 1158189
> which has been described as the #3 top crasher.
>
> We believe this patch largely fixes that problem. Unfortunately, without a
> STR for that bug we just focused on addressing a whole class of things that
> looked related to stablity - and this patch was one of those initiatives.
> I'm not exactly certain why it has the outsized positive impact on 1158189
> (lacking an STR for that) but I'm not really surprised as there are other
> well defined scenarios that I can explain why it fixes.. 1158189 is probably
> some more complicated manifestation of those. Shutdown hang data can be
> notoriously hard to interpret.
>
> So I recommended to dragana that we go with the uplift for one channel.
> There is a small chance that this actually isn't the reason those crashes
> went away, but it seems very unlikely given the data.
>
> As for stability risks from the patch I'm not particularly worried about the
> patch logic itself - that's pretty well understood. Its more a matter that
> it changes/improves some things at a low level and we've seen a couple
> interactions from that.. e.g. one windows binary LSP was making completely
> undocumented assumptions about our internal ABI from the old method.. and a
> couple tests that have always been occasionally orange got very orange when
> the patch first landed due to timing changes (which helped us find those
> unrelated bugs that were always in those tests).
>
> This patch is always going to be at some risk for new interop challenges as
> it goes through beta.. I'm not sure waiting an extra 6 weeks to get it there
> is going to tell us anything extra - so I would land it given the stability
> it brings. The perf win is a bonus.
Thanks for the detailed write up (even though you could have left it at "I am the author and the patch is great!" ;). I think shutdown hangs have been a cause for concern for a few releases now so this fix seems timely. I think we will need to take a risk at some point and if the data from Nightly looks promising let's uplift to Aurora now before 47 goes to Beta.
I hope you guys will be watching any new issues that crop up closely and worst case we will be able to cleanly back out this patch from 47.
Comment 119•9 years ago
|
||
Comment on attachment 8740487 [details] [diff] [review]
bug_698882_aurora.patch
Nightly data suggests this patch has eliminated (reduced) shutdown hangs of some signatures, let's uplift to Aurora47.
Attachment #8740487 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 120•9 years ago
|
||
has problems to apply to uplift :
merging netwerk/base/moz.build
merging netwerk/base/nsSocketTransportService2.cpp
merging netwerk/base/nsSocketTransportService2.h
merging netwerk/standalone/moz.build
warning: conflicts while merging netwerk/base/nsSocketTransportService2.h! (edit, then use 'hg resolve --mark')
abort: unresolved conflicts, can't continue
(use hg resolve and hg graft --continue)
Flags: needinfo?(mcmanus)
Assignee | ||
Updated•9 years ago
|
Flags: needinfo?(mcmanus)
Flags: needinfo?(jduell.mcbugs)
Flags: needinfo?(dd.mozilla)
Comment 121•9 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #120)
> has problems to apply to uplift :
>
> merging netwerk/base/moz.build
> merging netwerk/base/nsSocketTransportService2.cpp
> merging netwerk/base/nsSocketTransportService2.h
> merging netwerk/standalone/moz.build
> warning: conflicts while merging netwerk/base/nsSocketTransportService2.h!
> (edit, then use 'hg resolve --mark')
> abort: unresolved conflicts, can't continue
> (use hg resolve and hg graft --continue)
I tried to apply it locally on aurora and I do not see any conflicts.
I will try later...
Flags: needinfo?(dd.mozilla)
Comment 122•9 years ago
|
||
https://hg.mozilla.org/releases/mozilla-aurora/rev/efa0fb87773e sorry my bad before with the conflict
Comment 123•9 years ago
|
||
sorry had to back this out , seems this caused test failures like https://treeherder.mozilla.org/logviewer.html#?job_id=2420245&repo=mozilla-aurora
and https://treeherder.mozilla.org/logviewer.html#?job_id=2420627&repo=mozilla-aurora
Flags: needinfo?(mcmanus)
Comment 124•9 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #123)
> sorry had to back this out , seems this caused test failures like
> https://treeherder.mozilla.org/logviewer.html#?job_id=2420245&repo=mozilla-
> aurora
>
I am looking into this one.
> and
> https://treeherder.mozilla.org/logviewer.html#?job_id=2420627&repo=mozilla-
> aurora
this is bug 1258414 needs to be uplifted too :)
Assignee | ||
Updated•9 years ago
|
Flags: needinfo?(mcmanus)
Comment 125•9 years ago
|
||
I was investigating https://treeherder.mozilla.org/logviewer.html#?job_id=2420245&repo=mozilla-aurora
but I could not reproduce it at all and it showed up once.
On the other hand, bug 1152522 shows up 5 times:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=efa0fb87773ed6e6850dcaf794d53651429decca&selectedJob=2422119
I am not sure why it is so often on aurora.
I manage to reproduce it 1-2 times locally on on mac, but without logging :(
At the first glance it looks like the test is racy. The server is not close completely before it is started again on the same port.
Assignee | ||
Comment 126•9 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #125)
> I was investigating
> https://treeherder.mozilla.org/logviewer.html#?job_id=2420245&repo=mozilla-
> aurora
> but I could not reproduce it at all and it showed up once.
>
> On the other hand, bug 1152522 shows up 5 times:
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> aurora&revision=efa0fb87773ed6e6850dcaf794d53651429decca&selectedJob=2422119
>
> I am not sure why it is so often on aurora.
> I manage to reproduce it 1-2 times locally on on mac, but without logging :(
> At the first glance it looks like the test is racy. The server is not close
> completely before it is started again on the same port.
probly a dup of 1256022 which had to be fixed with this bug? (unrelated afaict)
Comment 127•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #126)
> (In reply to Dragana Damjanovic [:dragana] from comment #125)
> > I was investigating
> > https://treeherder.mozilla.org/logviewer.html#?job_id=2420245&repo=mozilla-
> > aurora
> > but I could not reproduce it at all and it showed up once.
> >
> > On the other hand, bug 1152522 shows up 5 times:
> > https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> > aurora&revision=efa0fb87773ed6e6850dcaf794d53651429decca&selectedJob=2422119
> >
> > I am not sure why it is so often on aurora.
> > I manage to reproduce it 1-2 times locally on on mac, but without logging :(
> > At the first glance it looks like the test is racy. The server is not close
> > completely before it is started again on the same port.
>
> probly a dup of 1256022 which had to be fixed with this bug? (unrelated
> afaict)
yes this is bug 1256022.
Comment 128•9 years ago
|
||
Bug 1256022 and bug 1258414 need uplift to aurora as well.
Comment 129•9 years ago
|
||
bugherder uplift |
Comment 130•9 years ago
|
||
Backed out in https://hg.mozilla.org/releases/mozilla-aurora/rev/08791c3d6f713f4032f704e44bae22353bbdae77
A thing which we misstarred as being bug 1152522 the last time you landed on aurora (and which happened quite a bit when you first landed on m-i on March 20th, but has only happened a few times since you relanded on trunk), is happening about 50% of the time on all three flavors on Windows on m-a, https://treeherder.mozilla.org/logviewer.html#?job_id=2447128&repo=mozilla-aurora (I picked that one out because I like how it also has the one-off "test_register_5xxCode_http2.js | Test timed out" of comment 123 back again), and what's going to be far worse for you to figure out is Windows mochitest-4, which either hits the useless and frustrating https://treeherder.mozilla.org/logviewer.html#?job_id=2448025&repo=mozilla-aurora which doesn't actually say that test_xhr_system.html crashed, but instead says that test_xhr_system.html is the last test to run in that chunk and that shutdown was slow enough that RunWatchdog got pissed off and crashed the browser, or, https://treeherder.mozilla.org/logviewer.html#?job_id=2450895&repo=mozilla-aurora with test_xultree_animation.xhtml | application crashed [@ mozilla::image::SourceBuffer::Complete(nsresult)] after Assertion failure: false (MOZ_ASSERT_UNREACHABLE: Called Complete more than once), at c:/builds/moz2_slave/m-aurora-w32-d-000000000000000/build/src/image/SourceBuffer.cpp:400
Comment 131•9 years ago
|
||
But at least you don't have to worry about the xpcshell failures in security/manager/ssl/tests/unit/test_signed_dir.js, which were just a timebomb as some cert expired.
Comment 132•9 years ago
|
||
Comment 133•9 years ago
|
||
ok, I notice that PollableEvent fails to make a tcp socket pair. And I found out the the patch attached to this bug differs from the change made to the m-c (I have not notice that before because there is no comment about this). There is only one line diff and it is very plausible that that is a cause. In comment #132 is a try run with the same patch as the patch pushed to the m-c.
Patrick, what was the reason to make this one-line change:
84c86
< +
---
> + PR_SetSocketOption(listener, &noblockOpt);
Flags: needinfo?(mcmanus)
Comment 134•9 years ago
|
||
The original patch attached to the bug (attachment mozilla::net::PollableEvent) differed from the patch pushed to m-c, therefore the previous aurora patch (made from the attached patch) was making problems with some tests.
This patch has the same changes as patch pushed to m-c (plus the changes from bug 1260764)
Approval Request Comment
[Feature/regressing bug #]:
[User impact if declined]:
[Describe test coverage new/current, TreeHerder]:
[Risks and why]:
[String/UUID change made/needed]:
Attachment #8740487 -
Attachment is obsolete: true
Attachment #8744807 -
Flags: approval-mozilla-aurora?
Comment 135•9 years ago
|
||
Attachment #8740497 -
Attachment is obsolete: true
Assignee | ||
Comment 136•9 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #133)
> ok, I notice that PollableEvent fails to make a tcp socket pair. And I found
> out the the patch attached to this bug differs from the change made to the
> m-c (I have not notice that before because there is no comment about this).
> There is only one line diff and it is very plausible that that is a cause.
> In comment #132 is a try run with the same patch as the patch pushed to the
> m-c.
> Patrick, what was the reason to make this one-line change:
>
> 84c86
> < +
> ---
> > + PR_SetSocketOption(listener, &noblockOpt);
See Comment 103.
On slow platforms the accept() was being called before the connect() had been processed on the kernel side completely and when it was NB it just didn't wait for it.. the only thing the listener is used for is setting up this localhost pair, so blocking is the semantic we want
Flags: needinfo?(mcmanus)
Comment 137•9 years ago
|
||
Comment on attachment 8744807 [details] [diff] [review]
bug_698882_aurora.patch
Assuming this needs to be uplifted to Fx47, I am changing the approval flag from Aurora47 to Beta47 (m-a -> m-b merge is complete).
Attachment #8744807 -
Flags: approval-mozilla-aurora? → approval-mozilla-beta+
Comment 138•9 years ago
|
||
Hey Wes, is this a patch that still needs to be uplifted to Fx47? I am not entirely sure but I know you can definitely help figure it out. Thanks a lot!
Flags: needinfo?(wkocher)
Comment 139•9 years ago
|
||
This is not on beta ff47 :)
![]() |
||
Comment 140•9 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #139)
> This is not on beta ff47 :)
This has status-firefox47, are we missing a patch. Looking at beta 47 build 1 this is currently the #2 top crasher.
https://crash-stats.mozilla.com/topcrashers/?product=Firefox&version=47.0b1&days=7
Comment 141•9 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #140)
> (In reply to Dragana Damjanovic [:dragana] from comment #139)
> > This is not on beta ff47 :)
>
> This has status-firefox47, are we missing a patch. Looking at beta 47 build
> 1 this is currently the #2 top crasher.
>
> https://crash-stats.mozilla.com/topcrashers/?product=Firefox&version=47.
> 0b1&days=7
This i not in ff47 :) still needs uplift. (I cannot push it still do not have writes)
Flags: needinfo?(wkocher)
Comment 143•9 years ago
|
||
Let's backout this from beta. We have some hangs. I do not know what went wrong.(It looks reproducible, so I was hoping that someone will come forward fit a repo but till now no luck)
Maybe something went wrong in this uplift and I hope 48 will be just fine in beta.
Wes, may i ask you to back this out from beta only?
Thanks.
Flags: needinfo?(wkocher)
Backed out in https://hg.mozilla.org/releases/mozilla-beta/rev/25e9aa3f174f
Setting the status back to affected. If we're not going to try again for 47, feel free to set it to wontfix.
Flags: needinfo?(wkocher)
Comment 145•9 years ago
|
||
when this landed we had an improvement in the startup network io measured in xperf:
https://treeherder.mozilla.org/perf.html#/alerts?id=1099
Comment 146•9 years ago
|
||
Comment on attachment 8744807 [details] [diff] [review]
bug_698882_aurora.patch
Reverting the flag as this was backed out of Beta47.
Attachment #8744807 -
Flags: approval-mozilla-beta+ → approval-mozilla-beta-
Comment 147•9 years ago
|
||
and with a backout we lose the improvement:
https://treeherder.mozilla.org/perf.html#/alerts?id=1128
Comment 148•9 years ago
|
||
hi, in quite early data for 48.0b1 it looks as if the crashes with shutdownhang | ... | mozilla::net::nsHttpConnectionMgr::Shutdown (bug 1158189) are rocketing up again unfortunately.
Comment 149•9 years ago
|
||
(In reply to [:philipp] from comment #148)
> hi, in quite early data for 48.0b1 it looks as if the crashes with
> shutdownhang | ... | mozilla::net::nsHttpConnectionMgr::Shutdown (bug
> 1158189) are rocketing up again unfortunately.
I was afraid of that :)
Thanks Philipp!
Comment 150•9 years ago
|
||
I have open a new bug 1279163.
You need to log in
before you can comment on or make changes to this bug.
Description
•