Closed Bug 658580 Opened 13 years ago Closed 13 years ago

IsAlive() is more cpu intensive with pr_poll than with recv(MSG_PEEK) (SSL)

Categories

(Core :: Security: PSM, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla7
Tracking Status
firefox6 + fixed
firefox7 --- fixed

People

(Reporter: mcmanus, Assigned: mcmanus)

References

Details

(Whiteboard: [concerns about add-on compat][qa-])

Attachments

(5 files)

in bug 654201,instances of SSL sockets which have not yet passed application layer data have their IsAlive() status tested with pr_poll() instead of recv(MSG_PEEK). The poll is called with both the READ and EXCEPT because of workaround to bug 658138.

that resulted in a cpu use increase on just the windows tp4 test. See https://bugzilla.mozilla.org/show_bug.cgi?id=654201#c52 .. the elapsed time of the test did not change - just the load.

We should figure out why.. I hypothesize that recv() advanced a state machine that the poll() did not and both do some fairly expensive crypto opt - but the poll() version just keeps redoing it until the state advances.
Depends on: 654201
Blocks: 658799
No longer depends on: 654201
Why would we only do this extra crypto op on Windows XP? My hypothesis, based on a quick look at the relevant code, is that we're busy-waiting in the poll case in a way that we don't do in the recv(MSG_PEEK) case.
(In reply to comment #1)
> Why would we only do this extra crypto op on Windows XP? My hypothesis,
> based on a quick look at the relevant code, is that we're busy-waiting in
> the poll case in a way that we don't do in the recv(MSG_PEEK) case.

which code? I'm pretty sure the connection mgr or the socket transport code isn't doing that... but inside/below pr_poll I'm totally unclear.
(In reply to comment #2)
> (In reply to comment #1)
> > Why would we only do this extra crypto op on Windows XP?

"why only windows" is an interesting question no matter what the root cause is..

 I don't have a good theory on that. any ideas? Maybe we're repeating something that is inherently more expensive on windows? (did I read that certificate ops are externally managed there? I'm totally guessing out of school now.. it is both xp and win7 fwiw.)
(In reply to comment #3)
> I don't have a good theory on that. any ideas? Maybe we're repeating
> something that is inherently more expensive on windows? (did I read that
> certificate ops are externally managed there? I'm totally guessing out of
> school now.. it is both xp and win7 fwiw.)

From the Talos docs: "% CPU - This metric is collected on windows only." So, we have the numbers for one platform but that doesn't mean other platforms aren't affected.

If we fix bug 658138, will this regression be fixed? If so, could you please comment in bug 658138 about which test can be used to observe the problem reliably?
(In reply to comment #4)

> From the Talos docs: "% CPU - This metric is collected on windows only."

whoa. did everybody but me know that? :) thanks.

> 
> If we fix bug 658138, will this regression be fixed? 

unknown. I don't have any reason to think so, but its plausible. (that read flag doesn't bother me in theory - it just should be un-necessary and the reviewer found it confused the code so asked the bug to be filed).

>? If so, could you please
> comment in bug 658138 about which test can be used to observe the problem
> reliably?

I think we can proceed on this issue with that info as there isn't a strong suggestion that they are linked -- but you're right that I ought to provide it! 

I lost that info because the READ actually seemed like the right thing to do at the time (so I moved on with my life and code) until later when the reviewer convinced me it was really a grimy workaround that needed a bug attached. 

I'll start figuring it out asap.
(In reply to comment #1)
> My hypothesis,
> based on a quick look at the relevant code, is that we're busy-waiting in
> the poll case in a way that we don't do in the recv(MSG_PEEK) case.

I put code in to just count the number of times nsSocketTransport::IsAlive() is called both with and without the patch from 654201 and then ran windows tp4 on try. I did this to see if the change makes us somehow spin on IsAlive() more than in the past.

without the patch win7 meausred 41,920 PEEKs, and XP measured 40,256.

with the patch win7 measured 41,632 PEEKs and XP measured 40,416. Here's the weird bit - neither one recorded at least 32 polls (my cutoff for reporting). I did reproduce the change in CPU usage. I'm going to rerun the test with the threshold set at 1 instead of 32, but it does make me look at the problem a little differently.
OK, This is getting more confusing instead of less confusing.

If I revert the entire IsAlive() patch from moz-central at da44483257cf (which is the tip from a couple of days ago, which I will use for all these measurements) I get a 'good' tp4 CPU baseline 58.7 on win7 and 51.7 on xp. (in all cases averaging 3 samples)

For comparison, with the entire patch applied cpu on win7 is 64.6 and on xp 57.0

The bug report (from moz-central a couple weeks ago when this landed) measured win7 changing from 60.2 to 65.1 and xp changing from 52.1 to 56.4 - so I can capture the about the same thing.

If I leave the IsAlive() patch installed, but I change the isalive() implementation to ignore the passive parameter (i.e. s/if (aPassive)/if (0 && aPassive)/) and always do peek I get more or less the baseline numbers: win7 59.3, xp 52.6.

If I insturment the poll vs the peek branches of isalive, I consistently get ~40K peeks and extremely few poll()s. Typically about 18 on xp and 1 or even 0 on win 7. Yes, 0 one time. (Which btw isn't really that surprising given the test case) but its hard to match up with the change in cpu given that the poll case seems to be causing the problem. It should be noted that my runs that counted peek vs poll where never the same runs that measured cpu.

I considered that it was somehow a matter of code generation - the if(0) gave the compiler a chance to just remove the poll branch all together. I ran another test where the test for aPassive remained in tact (as on the trunk), but I hard coded the caller in nsHttpConnection to aPassive=false... that yielded 60.0 on win7, 52.5 on xp - almost the same as the if(0) experiment.

I'm left to the conclusion that the poll() runs very infrequently  but has a significant impact on the overall CPU usage so it must be, at least under some circumstance, the cause. Yet this is weird.
Attached file gdb log
Attached is just a log of stepping through the PR_poll code path. This is actually not from tp4 - it is from just setting a break point and opening https://www.mozilla.org/ .. running tp4 locally is eluding me.
It seems possible that tp4 is triggering this in nsSSLThread.cpp

              // Unfortunately we do not have a pollable event
              // that we could use to wake up the caller, as soon
              // as the previously requested I/O operation has completed.
              // Therefore we must use a kind of busy wait,
              // we want the caller to check again, whenever any
              // activity is detected on the associated socket.
              // Unfortunately this could mean, the caller will detect
              // activity very often, until we are finally done with
              // the previously requested action and are able to
              // return the buffered result.
              // As our real I/O activity is happening on the other thread
              // let's sleep some cycles, in order to not waste all CPU
              // resources.
              // But let's make sure we do not hold our shared mutex
              // while waiting, so let's leave this block first.

              want_sleep_and_wakeup_on_any_socket_activity = PR_TRUE;
              break;

I haven't quite figured out what it means, but it seems likely that  _pr_poll_with_poll() in ptio.c does something suboptimal with it.
here's a theory .. want_sleep results in calls to PR_Sleep()

  if (want_sleep_and_wakeup_on_any_socket_activity)
  {
    // This is where we wait for any socket activity,
    // because we do not have a pollable event.
    // XXX Will this really cause us to wake up
    //     whatever happens?

    PR_Sleep( PR_MillisecondsToInterval(1) );
    return PR_POLL_READ | PR_POLL_WRITE | PR_POLL_EXCEPT;
  }

It certainly seems possible that a short sleep like that on some platforms turns into a busy wait.. it also seems quite arbitrary, as well as kind of pointless from the perspective of the IsAlive() stack (we'd like it to return asap).

I'll turn that sleep from 1ms into a yield and we'll at least find out what try thinks that does to the tp4 result.
(In reply to comment #9)
> It seems possible that tp4 is triggering this in nsSSLThread.cpp
[..]
>               want_sleep_and_wakeup_on_any_socket_activity = PR_TRUE;

Through instrumentation I have determined this code is _not_ being executed, so it is not the source of this bug.

dang.
I decided to try and tackle this head on with a profiler. 

I setup standalone talos and obtained tp4 to run on my linux development machine and eventually got that going and tried to reproduce the report using oprofile.

The report measured CPU utilized on windows and talos does not calculate CPU for non windows platforms. I was hoping it would be reproducible.

Thus far I cannot make it happen. Using CLK_UNHALTED events as a stand-in for CPU used I get very similar numbers of samples both with and without the IsAlive() patch running TP4. The profiles look similar as well.

I'll attach a couple of the profiles. I did multiple runs and they are fairly consistent.
Attached file pre-patch profile 1
as an experiment, I tried out a changeset where we don't call Poll() at all - instead I added an interface to nsISSLSocketControl which let me test (socketInfo->isPK11LoggedOut() || socketInfo->isAlreadyShutDown()) from nsSocketTransport2.cpp without going through poll.

http://hg.mozilla.org/try/rev/d0edeeb59dd8

That still harmed tp4 windows cpu to the same degree. gah.

Those two functions are nothing but boolean accessors, so they aren't the direct cause.

It seems that *not* calling PEEK is having some kind of negative effect which bears investigation.
An update: I'm now taking a slightly different approach here. 

I now believe that the patch for 654201, which causes this issue, is not the optimal way to fix that bug. In 654201 I reached the conclusion that calling MSG_PEEK on an SSL socket that hadn't completed its handshake could cause the socket thread to dangerously call DoPollIteration, and I worked around it by replacing the PEEK with a call to Poll(). I remained nervous about other instances of the recursion, but I fixed the PEEK issue which is the one I had introduced.

I now know that there is code in nsSocketTransportService2.cpp (line 541) in OnProcessNextEvent that is supposed to prevent DoPollIteration from being called recursively at any time, and that code was not working  (see bug 661995). Rather than spot fixing the one case as had been done in 654201, the right thing to do is to fix the semaphore in OnProcessNextEvent (which can be done by just implementing the counter locally instead of relying on nsIThread) and then we don't have to worry about any other recursive possibilities (and there are many). As a side effect we can backout the change from PEEK to Poll() which caused the tp4 problem because that change is now moot.

This strategy almost works. It fixes tp4 and it also fixes the crashes associated with 654201. Unfortunately about 1 time in 20 mochitest-other debug on os X (which is the test that failed most often with 654201) fails with timeouts in the add-on tests. These are the tests that most often failed with 654201, though they generally crashed in that scenario and the crashes have gone away.

Essentially what I have done is fixed an existing lock that wasn't working. It is possible there is existing code that is deadlocking now, that wasn't dedlocking before due to the broken lock.
Can anyone with ssl experience think of code that would

a] synchronously proxy a call to another thread (e.g. GetPreviousCert() to the main thread, or perhaps something to the PSM thread?)

and b] have that call on the other thread require network access of any sort (even DNS) in order to resolve the call?

In the past this always should have deadlocked, but it probably wouldn't without the fix described in comment 16
The guilty stack!

Thread 4
 0  libSystem.B.dylib + 0x506fa
    eip = 0x924aa6fa   esp = 0xb028787c   ebp = 0xb0287a88   ebx = 0x0006fe87
    esi = 0x00000000   edi = 0x03fbbce6   eax = 0x0014005d   ecx = 0xb028787c
    edx = 0x924aa6fa   efl = 0x00000286
    Found by: given as instruction pointer in context
 1  libnspr4.dylib!_pr_poll_with_poll [ptio.c:8790375bec2a : 3951 + 0x18]
    eip = 0x0006b572   esp = 0xb0287a90   ebp = 0xb0287cf8
    Found by: previous frame's frame pointer
 2  libnspr4.dylib!PR_Poll [ptio.c:8790375bec2a : 4353 + 0x18]
    eip = 0x0006b79f   esp = 0xb0287d00   ebp = 0xb0287d18   ebx = 0x0294ef80
    esi = 0x00440268
    Found by: call frame info
 3  XUL!nsSocketTransportService::Poll [nsSocketTransportService2.cpp:8790375bec2a : 425 + 0x18]
    eip = 0x0294f04a   esp = 0xb0287d20   ebp = 0xb0287d58   ebx = 0x0294ef80
    esi = 0x00440268
    Found by: call frame info
 4  XUL!nsSocketTransportService::DoPollIteration [nsSocketTransportService2.cpp:8790375bec2a : 742 + 0x18]
    eip = 0x0294fded   esp = 0xb0287d60   ebp = 0xb0287dc8   ebx = 0x0294faca
    esi = 0x00440268
    Found by: call frame info
 5  XUL!nsSocketTransportService::OnProcessNextEvent [nsSocketTransportService2.cpp:8790375bec2a : 619 + 0x12]
    eip = 0x02950129   esp = 0xb0287dd0   ebp = 0xb0287e08   ebx = 0x03fbac98
    esi = 0x00440268   edi = 0x03fbbce6
    Found by: call frame info
 6  XUL!nsThread::ProcessNextEvent [nsThread.cpp:8790375bec2a : 582 + 0x61]
    eip = 0x03fbae2b   esp = 0xb0287e10   ebp = 0xb0287eb8   ebx = 0x03fbac98
    esi = 0x00440268   edi = 0x03fbbce6
    Found by: call frame info
 7  XUL!NS_ProcessNextEvent_P [nsThreadUtils.cpp:8790375bec2a : 245 + 0x1f]
    eip = 0x03f423e0   esp = 0xb0287ec0   ebp = 0xb0287f08   ebx = 0x03f4236a
    esi = 0x0041af00   edi = 0x03fbbce6
    Found by: call frame info
 8  XUL!nsProxyEventObject::CallMethod [nsProxyEventObject.cpp:8790375bec2a : 260 + 0x12]
    eip = 0x03fc8756   esp = 0xb0287f10   ebp = 0xb0287f98   ebx = 0x03fc83b4
    esi = 0x0041af00   edi = 0x03fbbce6
    Found by: call frame info
 9  XUL!PrepareAndDispatch [xptcstubs_unixish_x86.cpp:8790375bec2a : 93 + 0x31]
    eip = 0x03fe34f2   esp = 0xb0287fa0   ebp = 0xb02880a8   ebx = 0x03fe3289
    esi = 0x00000003   edi = 0x03fc83a6
    Found by: call frame info
10  XUL!nsXPTCStubBase::Stub3 [xptcstubsdef.inc : 1 + 0x25]
    eip = 0x03fde035   esp = 0xb02880b0   ebp = 0xb02880f8   ebx = 0x03fde00e
    esi = 0xb0288194   edi = 0xb02880c0
    Found by: call frame info
11  XUL!CallGetInterface<nsIInterfaceRequestor, nsISecureBrowserUI> [nsIInterfaceRequestorUtils.h : 55 + 0x22]
    eip = 0x03a879b6   esp = 0xb0288100   ebp = 0xb0288128   ebx = 0x03a87922
    esi = 0xb0288194   edi = 0x03fcb38c
    Found by: call frame info
12  XUL!nsNSSSocketInfo::EnsureDocShellDependentStuffKnown [nsNSSIOLayer.cpp:8790375bec2a : 385 + 0x1b]
    eip = 0x03a81ae0   esp = 0xb0288130   ebp = 0xb02881d8   ebx = 0x03a819f9
    esi = 0xb0288194   edi = 0x03fcb38c
    Found by: call frame info
13  XUL!nsNSSSocketInfo::GetPreviousCert [nsNSSIOLayer.cpp:8790375bec2a : 849 + 0xa]
    eip = 0x03a81f7a   esp = 0xb02881e0   ebp = 0xb0288228   ebx = 0x03a81f24
    esi = 0x01ee9890   edi = 0x03fcb38c
    Found by: call frame info
14  XUL!HandshakeCallback [nsNSSCallbacks.cpp:8790375bec2a : 922 + 0x31]
    eip = 0x03a6c2a1   esp = 0xb0288230   ebp = 0xb0288578   ebx = 0x03a6bcd5
    esi = 0x01ee9890   edi = 0x03fcb38c
    Found by: call frame info
15  libssl3.dylib!ssl3_HandleFinished [ssl3con.c : 8501 + 0x1f]
    eip = 0x001f1956   esp = 0xb0288580   ebp = 0xb02885d8   ebx = 0x001f1180
    esi = 0x00000000   edi = 0x00000001
    Found by: call frame info
16  libssl3.dylib!ssl3_HandleHandshakeMessage [ssl3con.c : 8657 + 0x1f]
    eip = 0x001f20c6   esp = 0xb02885e0   ebp = 0xb0288658   ebx = 0x001f1975
    esi = 0x00000301   edi = 0x00000001
    Found by: call frame info
17  libssl3.dylib!ssl3_HandleHandshake [ssl3con.c : 8725 + 0x20]
    eip = 0x001f22f2   esp = 0xb0288660   ebp = 0xb0288698   ebx = 0x001f2116
    esi = 0x00000301   edi = 0x00000001
    Found by: call frame info
18  libssl3.dylib!ssl3_HandleRecord [ssl3con.c : 9064 + 0x11]
    eip = 0x001f2e59   esp = 0xb02886a0   ebp = 0xb0288798   ebx = 0x001f2568
    esi = 0x00000301   edi = 0x00000001
    Found by: call frame info
19  libssl3.dylib!ssl3_GatherCompleteHandshake [ssl3gthr.c : 209 + 0x1d]
    eip = 0x001f3fae   esp = 0xb02887a0   ebp = 0xb02887e8   ebx = 0x001f3ed1
    esi = 0x029ec666   edi = 0x0000000c
    Found by: call frame info
20  libssl3.dylib!ssl_GatherRecord1stHandshake [sslcon.c : 1258 + 0x12]
    eip = 0x001f6c55   esp = 0xb02887f0   ebp = 0xb0288838   ebx = 0x001f6bd0
    esi = 0x029ec666   edi = 0x0000000c
    Found by: call frame info
21  libssl3.dylib!ssl_Do1stHandshake [sslsecur.c : 151 + 0x10]
    eip = 0x00202293   esp = 0xb0288840   ebp = 0xb0288868   ebx = 0x0020204e
    esi = 0x029ec666   edi = 0x0000000c
    Found by: call frame info
22  libssl3.dylib!ssl_SecureRecv [sslsecur.c : 1150 + 0xa]
    eip = 0x002041b5   esp = 0xb0288870   ebp = 0xb02888c8   ebx = 0x00204034
    esi = 0x029ec666   edi = 0x0000000c
    Found by: call frame info
23  libssl3.dylib!ssl_Recv [sslsock.c : 1593 + 0x25]
    eip = 0x0020bd1d   esp = 0xb02888d0   ebp = 0xb0288908   ebx = 0x0020bc66
    esi = 0x029ec666   edi = 0x0000000c
    Found by: call frame info
24  XUL!nsSSLThread::requestRecvMsgPeek [nsSSLThread.cpp:8790375bec2a : 208 + 0x2b]
    eip = 0x03a63c37   esp = 0xb0288910   ebp = 0xb0288968   ebx = 0x03a639da
    esi = 0x029ec666   edi = 0x0000000c
    Found by: call frame info
25  XUL!PSMRecv [nsNSSIOLayer.cpp:8790375bec2a : 2182 + 0x26]
    eip = 0x03a7fba4   esp = 0xb0288970   ebp = 0xb02889b8   ebx = 0x03a7fafa
    esi = 0x029ec666   edi = 0x0000000c
    Found by: call frame info
26  libnspr4.dylib!PR_Recv [priometh.c:8790375bec2a : 220 + 0x2b]
    eip = 0x0004846c   esp = 0xb02889c0   ebp = 0xb02889e8   ebx = 0x029e4855
    esi = 0x029ec666   edi = 0x0000000c
    Found by: call frame info
27  XUL!nsSocketTransport::IsAlive [nsSocketTransport2.cpp:8790375bec2a : 1853 + 0x29]
    eip = 0x02948902   esp = 0xb02889f0   ebp = 0xb0288a58   ebx = 0x029e4855
    esi = 0x029ec666   edi = 0x0000000c
    Found by: call frame info
28  XUL!nsHttpConnection::IsAlive [nsHttpConnection.cpp:8790375bec2a : 280 + 0x22]
    eip = 0x029e3bcb   esp = 0xb0288a60   ebp = 0xb0288a98   ebx = 0x029e4855
    esi = 0x029ec666   edi = 0x0000000c
    Found by: call frame info
29  XUL!nsHttpConnection::CanReuse [nsHttpConnection.cpp:8790375bec2a : 245 + 0x3b]
    eip = 0x029e4892   esp = 0xb0288aa0   ebp = 0xb0288ad8   ebx = 0x029e4855
    esi = 0x029ec666   edi = 0x0000000c
    Found by: call frame info
30  XUL!nsHttpConnectionMgr::PruneDeadConnectionsCB [nsHttpConnectionMgr.cpp:8790375bec2a : 467 + 0xa]
    eip = 0x029ec70b   esp = 0xb0288ae0   ebp = 0xb0288b58   ebx = 0x029ec672
    esi = 0x029ec666   edi = 0x0000000c
    Found by: call frame info
31  XUL!hashEnumerate [nsHashtable.cpp:8790375bec2a : 130 + 0x23]
    eip = 0x03f612c3   esp = 0xb0288b60   ebp = 0xb0288b98   ebx = 0x03f37ef8
    esi = 0x029ec666   edi = 0x0000000c
    Found by: call frame info
32  XUL!PL_DHashTableEnumerate [pldhash.c:8790375bec2a : 754 + 0x22]
    eip = 0x03f37fd2   esp = 0xb0288ba0   ebp = 0xb0288c18   ebx = 0x03f37ef8
    esi = 0x400c7560   edi = 0x0000000c
    Found by: call frame info
33  XUL!nsHashtable::Enumerate [nsHashtable.cpp:8790375bec2a : 318 + 0x1e]
    eip = 0x03f61823   esp = 0xb0288c20   ebp = 0xb0288c48   ebx = 0x03f617da
    esi = 0x00000000   edi = 0x00000000
    Found by: call frame info
34  XUL!nsHttpConnectionMgr::OnMsgPruneDeadConnections [nsHttpConnectionMgr.cpp:8790375bec2a : 1069 + 0x1e]
    eip = 0x029e9adf   esp = 0xb0288c50   ebp = 0xb0288c68   ebx = 0x029e9a82
    esi = 0x00000000   edi = 0x00000000
    Found by: call frame info
35  XUL!nsHttpConnectionMgr::nsConnEvent::Run [nsHttpConnectionMgr.h:8790375bec2a : 306 + 0x60]
    eip = 0x029ecf2c   esp = 0xb0288c70   ebp = 0xb0288c98   ebx = 0x03fbac98
    esi = 0x00000000   edi = 0x00000000
    Found by: call frame info
36  XUL!nsThread::ProcessNextEvent [nsThread.cpp:8790375bec2a : 618 + 0x18]
    eip = 0x03fbb016   esp = 0xb0288ca0   ebp = 0xb0288d48   ebx = 0x03fbac98
    esi = 0x00440268   edi = 0x00000000
    Found by: call frame info
37  XUL!NS_ProcessPendingEvents_P [nsThreadUtils.cpp:8790375bec2a : 195 + 0x20]
    eip = 0x03f42516   esp = 0xb0288d50   ebp = 0xb0288d98   ebx = 0x03f42490
    esi = 0x00000000   edi = 0x00000000
    Found by: call frame info
38  XUL!nsSocketTransportService::Run [nsSocketTransportService2.cpp:8790375bec2a : 653 + 0x12]
    eip = 0x0294f899   esp = 0xb0288da0   ebp = 0xb0288de8   ebx = 0x0294f7ed
    esi = 0x00000000   edi = 0x00000000
    Found by: call frame info
39  XUL!nsThread::ProcessNextEvent [nsThread.cpp:8790375bec2a : 618 + 0x18]
    eip = 0x03fbb016   esp = 0xb0288df0   ebp = 0xb0288e98   ebx = 0x03fbac98
    esi = 0x00000000   edi = 0x00000000
    Found by: call frame info
40  XUL!NS_ProcessNextEvent_P [nsThreadUtils.cpp:8790375bec2a : 245 + 0x1f]
    eip = 0x03f423e0   esp = 0xb0288ea0   ebp = 0xb0288ee8   ebx = 0x03f4236a
    esi = 0x00000000   edi = 0x00000000
    Found by: call frame info
41  XUL!nsThread::ThreadFunc [nsThread.cpp:8790375bec2a : 273 + 0x12]
    eip = 0x03fbb9e9   esp = 0xb0288ef0   ebp = 0xb0288f48   ebx = 0x03fbb8ef
    esi = 0x00000000   edi = 0x00000000
    Found by: call frame info
42  libnspr4.dylib!_pt_root [ptthread.c:8790375bec2a : 187 + 0x10]
    eip = 0x0006cca9   esp = 0xb0288f50   ebp = 0xb0288f78   ebx = 0x0006cb93
    esi = 0xb0289000   edi = 0x00000000
    Found by: call frame info
43  libSystem.B.dylib + 0x32154
    eip = 0x9248c155   esp = 0xb0288f80   ebp = 0xb0288fc8   ebx = 0x9248c028
    esi = 0xb0289000   edi = 0x00000000
    Found by: call frame info
44  libSystem.B.dylib + 0x32011
    eip = 0x9248c012   esp = 0xb0288fd0   ebp = 0xb0288fec
    Found by: previous frame's frame pointer
when poll() is called it will block the socket thread until either network activity happens in one of the fd's it is watching or the timeout expires.

the socket thread uses a number of different algorithms for determining the timeout, sometimes even a timeout of 0. But often the timeout is huge - and any events posted to the socket thread (e.g. adding new http transactions or canceling stuck ones!) during that time will wait to be executed. 

below you see just a random sample for me - you'll see that the timeout is often 64K seconds (most of a day).. under normal circumstances some kind of network activity happens and the poll() returns long before the timeout happens, but that isn't always going to happen.

I believe that this timeout value is derived from the socket-timeout values where we would report an error in OnSocketReady().. that's a fine upper bound, but we need something (potentially) smaller for not blocking event processing. 

Currently, when there is no fd to check we block for 25ms. We could make that the upper bound. (separately, we should also be able to run other events in when there is no fd to check).

some logs:

    timeout = 65535000 milliseconds
    ...returned after 4 milliseconds
    timeout = 0 milliseconds
    ...returned after 0 milliseconds
    timeout = 0 milliseconds
    ...returned after 0 milliseconds
    timeout = 65535000 milliseconds
    ...returned after 4 milliseconds
    timeout = 0 milliseconds
    ...returned after 0 milliseconds
    timeout = 0 milliseconds
    ...returned after 0 milliseconds
    timeout = 65535000 milliseconds
    ...returned after 3 milliseconds
    timeout = 0 milliseconds
    ...returned after 0 milliseconds
    timeout = 0 milliseconds
    ...returned after 0 milliseconds
    timeout = 65535000 milliseconds
    ...returned after 4 milliseconds
    timeout = 0 milliseconds
    ...returned after 0 milliseconds
    timeout = 0 milliseconds
    ...returned after 0 milliseconds
    timeout = 65535000 milliseconds
    ...returned after 4 milliseconds
    timeout = 0 milliseconds
    ...returned after 0 milliseconds
    timeout = 0 milliseconds
    ...returned after 0 milliseconds
    timeout = 65535000 milliseconds
    ...returned after 8 milliseconds
(In reply to comment #19)
> when poll() is called it will block the socket thread until either network
> activity happens in one of the fd's it is watching or the timeout expires.

> the socket thread uses a number of different algorithms for determining the
> timeout, sometimes even a timeout of 0. But often the timeout is huge - and
> any events posted to the socket thread (e.g. adding new http transactions or
> canceling stuck ones!) during that time will wait to be executed. 

I mislead myself - to correct my comment: what I write above is true, except that events posted to the socket thread run through dispatch() which eventually calls PR_SetPollableEvent() which in turn pokes a pipe (or something else platform dependent) in that set of poll'd fd's which wakes up the poll even when presented with a large timeout. Given that, the large timeout is desirable.
Please post the patch you are using to get those timeout values.

Are we running into the want_sleep_and_wakeup_on_any_socket_activity case?
(In reply to comment #21)
> Please post the patch you are using to get those timeout values.

those are existing LOG statements in nsSocketTransportService2.cpp, but note that in comment 20 I came to the conclusion they aren't a problem.

> 
> Are we running into the want_sleep_and_wakeup_on_any_socket_activity case?

no.. I put an crash statement in there and did not hit it.

I think the right way to fix this is comment 16 - and I had a very good night with tryserver confirming that both from a safety and efficiency point of view, so I'm ready to post the patches today.
I'd like to fix this tp4 issue with 2 patches:

1] a clean revert of the IsAlive() patch from 654201 - that will fix the cpu on tp4

2] a new patch which changes the way nsSocketTransportService::Run dispatches the poll event. Based on advice from ben turner, we won't run complicated stacks out of OnProcessNextEvent. Instead, DoPollIteration is called directly out of Run() and interleaved with calls to ProcessNextEvent(). This ensures we won't ever find DoPollIteration anywhere but at the (almost) bottom of the stack - which is what was intended. Comment 18 shows a scenario where it is insufficient to simply repair the semaphore in ProcessNextEvent()

The test that generally showed FAILS for 654201 (or a simple repair of the semaphore) is osx debug mochitest-other, and I have a dozen green runs of that on tryserver with this patch. CPU is back to baseline levels, and load time looks fine.
Attachment #538497 - Flags: review?(honzab.moz)
Attachment #538498 - Flags: review?(honzab.moz)
Attachment #538498 - Flags: feedback?(cbiesinger)
Comment on attachment 538497 [details] [diff] [review]
part 1 revert 654201 isalive v1

Review of attachment 538497 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #538497 - Flags: review?(honzab.moz) → review+
Comment on attachment 538498 [details] [diff] [review]
part 2 ensure DoPollIteration on bottom of stack v1

Review of attachment 538498 [details] [diff] [review]:
-----------------------------------------------------------------

To don't loose a link here, this is trying to solve an issue in bug 654201 comment 8.

I personally don't know about any current functionality that would be triggered on the socket thread, did a synchronous network request and waited to finish it, blocking the socket thread that way.

r=honzab.

::: netwerk/base/src/nsSocketTransportService2.cpp
@@ +625,5 @@
> +        PRBool pendingEvents = PR_FALSE;
> +        thread->HasPendingEvents(&pendingEvents);
> +
> +        do {
> +            DoPollIteration(!pendingEvents);

Maybe add a comment before this call, like 

// Don't block on poll when messages are pending

...or something better phrased, to make clearer what the argument of DoPollIteration does.

@@ +636,5 @@
> +                NS_ProcessNextEvent(thread);
> +                pendingEvents = PR_FALSE;
> +                thread->HasPendingEvents(&pendingEvents);
> +            }
> +

Probably remove this blank new-line.
Attachment #538498 - Flags: review?(honzab.moz) → review+
thanks for the review. This is almost certainly the right thing to do, though on principle I hate to monkey with anything that low in the stack.

I'll put it on m-c for some airtime.
Whiteboard: [inbound]
Pushed:
http://hg.mozilla.org/mozilla-central/rev/3933c5e402da
http://hg.mozilla.org/mozilla-central/rev/d527ceffd806
Assignee: nobody → mcmanus
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [inbound]
Target Milestone: --- → mozilla7
Version: unspecified → Trunk
Is this omething we need to take on beta?  See bug 654201 comment 63.  Requesting tracking to sort that out...
Whiteboard: [concerns about add-on compat]
The API is the same on {5, aurora, nightly}, and different on beta.

This patch has not caused any trouble in 32 days on nightly/aurora and applies cleanly to beta, so taking it on beta to prevent needless api churn seems good.
Attachment #538497 - Flags: approval-mozilla-beta?
Attachment #538498 - Flags: feedback?(cbiesinger) → approval-mozilla-beta?
Yes, having to special-case Firefox 6 for this API call is a major inconvenience for add-on developers and for us who handle compatibility bumps and communication.
Attachment #538497 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment on attachment 538498 [details] [diff] [review]
part 2 ensure DoPollIteration on bottom of stack v1

a=LegNeato for releases/mozilla-beta
Attachment #538498 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Blocks: 654201
Blocks: 658503
Hi.
Can you please tell me how can I verify this bug. Some STR or a testcase.
Thanks.
(In reply to Vlad from comment #35)
> Hi.
> Can you please tell me how can I verify this bug. Some STR or a testcase.
> Thanks.



As noted in the description, it fixes this regression:
https://bugzilla.mozilla.org/show_bug.cgi?id=654201#c52
qa-: fix verification not needed
Whiteboard: [concerns about add-on compat] → [concerns about add-on compat][qa-]
You need to log in before you can comment on or make changes to this bug.