Closed Bug 658580 Opened 14 years ago Closed 14 years ago

IsAlive() is more cpu intensive with pr_poll than with recv(MSG_PEEK) (SSL)

Categories

(Core :: Security: PSM, defect)

Product:

Component:

Platform:

x86_64

Linux

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla7

Tracking Flags:

Tracking

Status

firefox6

+

fixed

firefox7

---

fixed

People

(Reporter: mcmanus, Assigned: mcmanus)

References

Details

(Whiteboard: [concerns about add-on compat][qa-])

Attachments

(5 files)

gdb log 14 years ago Patrick McManus [:mcmanus] 165.38 KB, text/plain		Details
pre-patch profile 1 14 years ago Patrick McManus [:mcmanus] 3.05 MB, text/plain		Details
post patch profile (gzipped) 1 14 years ago Patrick McManus [:mcmanus] 543.12 KB, application/x-gzip		Details
part 1 revert 654201 isalive v1 14 years ago Patrick McManus [:mcmanus] 8.75 KB, patch	mayhemer : review+ christian : approval-mozilla-beta+	Details \| Diff \| Splinter Review
part 2 ensure DoPollIteration on bottom of stack v1 14 years ago Patrick McManus [:mcmanus] 2.81 KB, patch	mayhemer : review+ christian : approval-mozilla-beta+	Details \| Diff \| Splinter Review

Patrick McManus [:mcmanus]

Assignee

Description

•

14 years ago

in bug 654201,instances of SSL sockets which have not yet passed application layer data have their IsAlive() status tested with pr_poll() instead of recv(MSG_PEEK). The poll is called with both the READ and EXCEPT because of workaround to bug 658138. that resulted in a cpu use increase on just the windows tp4 test. See https://bugzilla.mozilla.org/show_bug.cgi?id=654201#c52 .. the elapsed time of the test did not change - just the load. We should figure out why.. I hypothesize that recv() advanced a state machine that the poll() did not and both do some fairly expensive crypto opt - but the poll() version just keeps redoing it until the state advances.

Patrick McManus [:mcmanus]

Assignee

Updated

•

14 years ago

Depends on: 654201

Dão Gottwald [:dao]

Updated

•

14 years ago

Blocks: 658799

No longer depends on: 654201

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Comment 1

•

14 years ago

Why would we only do this extra crypto op on Windows XP? My hypothesis, based on a quick look at the relevant code, is that we're busy-waiting in the poll case in a way that we don't do in the recv(MSG_PEEK) case.

Patrick McManus [:mcmanus]

Assignee

Comment 2

•

14 years ago

(In reply to comment #1) > Why would we only do this extra crypto op on Windows XP? My hypothesis, > based on a quick look at the relevant code, is that we're busy-waiting in > the poll case in a way that we don't do in the recv(MSG_PEEK) case. which code? I'm pretty sure the connection mgr or the socket transport code isn't doing that... but inside/below pr_poll I'm totally unclear.

Patrick McManus [:mcmanus]

Assignee

Comment 3

•

14 years ago

(In reply to comment #2) > (In reply to comment #1) > > Why would we only do this extra crypto op on Windows XP? "why only windows" is an interesting question no matter what the root cause is.. I don't have a good theory on that. any ideas? Maybe we're repeating something that is inherently more expensive on windows? (did I read that certificate ops are externally managed there? I'm totally guessing out of school now.. it is both xp and win7 fwiw.)

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Comment 4

•

14 years ago

(In reply to comment #3) > I don't have a good theory on that. any ideas? Maybe we're repeating > something that is inherently more expensive on windows? (did I read that > certificate ops are externally managed there? I'm totally guessing out of > school now.. it is both xp and win7 fwiw.) From the Talos docs: "% CPU - This metric is collected on windows only." So, we have the numbers for one platform but that doesn't mean other platforms aren't affected. If we fix bug 658138, will this regression be fixed? If so, could you please comment in bug 658138 about which test can be used to observe the problem reliably?

Patrick McManus [:mcmanus]

Assignee

Comment 5

•

14 years ago

(In reply to comment #4) > From the Talos docs: "% CPU - This metric is collected on windows only." whoa. did everybody but me know that? :) thanks. > > If we fix bug 658138, will this regression be fixed? unknown. I don't have any reason to think so, but its plausible. (that read flag doesn't bother me in theory - it just should be un-necessary and the reviewer found it confused the code so asked the bug to be filed). >? If so, could you please > comment in bug 658138 about which test can be used to observe the problem > reliably? I think we can proceed on this issue with that info as there isn't a strong suggestion that they are linked -- but you're right that I ought to provide it! I lost that info because the READ actually seemed like the right thing to do at the time (so I moved on with my life and code) until later when the reviewer convinced me it was really a grimy workaround that needed a bug attached. I'll start figuring it out asap.

Patrick McManus [:mcmanus]

Assignee

Comment 6

•

14 years ago

(In reply to comment #1) > My hypothesis, > based on a quick look at the relevant code, is that we're busy-waiting in > the poll case in a way that we don't do in the recv(MSG_PEEK) case. I put code in to just count the number of times nsSocketTransport::IsAlive() is called both with and without the patch from 654201 and then ran windows tp4 on try. I did this to see if the change makes us somehow spin on IsAlive() more than in the past. without the patch win7 meausred 41,920 PEEKs, and XP measured 40,256. with the patch win7 measured 41,632 PEEKs and XP measured 40,416. Here's the weird bit - neither one recorded at least 32 polls (my cutoff for reporting). I did reproduce the change in CPU usage. I'm going to rerun the test with the threshold set at 1 instead of 32, but it does make me look at the problem a little differently.

Patrick McManus [:mcmanus]

Assignee

Comment 7

•

14 years ago

OK, This is getting more confusing instead of less confusing. If I revert the entire IsAlive() patch from moz-central at da44483257cf (which is the tip from a couple of days ago, which I will use for all these measurements) I get a 'good' tp4 CPU baseline 58.7 on win7 and 51.7 on xp. (in all cases averaging 3 samples) For comparison, with the entire patch applied cpu on win7 is 64.6 and on xp 57.0 The bug report (from moz-central a couple weeks ago when this landed) measured win7 changing from 60.2 to 65.1 and xp changing from 52.1 to 56.4 - so I can capture the about the same thing. If I leave the IsAlive() patch installed, but I change the isalive() implementation to ignore the passive parameter (i.e. s/if (aPassive)/if (0 && aPassive)/) and always do peek I get more or less the baseline numbers: win7 59.3, xp 52.6. If I insturment the poll vs the peek branches of isalive, I consistently get ~40K peeks and extremely few poll()s. Typically about 18 on xp and 1 or even 0 on win 7. Yes, 0 one time. (Which btw isn't really that surprising given the test case) but its hard to match up with the change in cpu given that the poll case seems to be causing the problem. It should be noted that my runs that counted peek vs poll where never the same runs that measured cpu. I considered that it was somehow a matter of code generation - the if(0) gave the compiler a chance to just remove the poll branch all together. I ran another test where the test for aPassive remained in tact (as on the trunk), but I hard coded the caller in nsHttpConnection to aPassive=false... that yielded 60.0 on win7, 52.5 on xp - almost the same as the if(0) experiment. I'm left to the conclusion that the poll() runs very infrequently but has a significant impact on the overall CPU usage so it must be, at least under some circumstance, the cause. Yet this is weird.

Patrick McManus [:mcmanus]

Assignee

Comment 8

•

14 years ago

Attached file gdb log — Details

Attached is just a log of stepping through the PR_poll code path. This is actually not from tp4 - it is from just setting a break point and opening https://www.mozilla.org/ .. running tp4 locally is eluding me.

Patrick McManus [:mcmanus]

Assignee

Comment 9

•

14 years ago

It seems possible that tp4 is triggering this in nsSSLThread.cpp // Unfortunately we do not have a pollable event // that we could use to wake up the caller, as soon // as the previously requested I/O operation has completed. // Therefore we must use a kind of busy wait, // we want the caller to check again, whenever any // activity is detected on the associated socket. // Unfortunately this could mean, the caller will detect // activity very often, until we are finally done with // the previously requested action and are able to // return the buffered result. // As our real I/O activity is happening on the other thread // let's sleep some cycles, in order to not waste all CPU // resources. // But let's make sure we do not hold our shared mutex // while waiting, so let's leave this block first. want_sleep_and_wakeup_on_any_socket_activity = PR_TRUE; break; I haven't quite figured out what it means, but it seems likely that _pr_poll_with_poll() in ptio.c does something suboptimal with it.

Patrick McManus [:mcmanus]

Assignee

Comment 10

•

14 years ago

here's a theory .. want_sleep results in calls to PR_Sleep() if (want_sleep_and_wakeup_on_any_socket_activity) { // This is where we wait for any socket activity, // because we do not have a pollable event. // XXX Will this really cause us to wake up // whatever happens? PR_Sleep( PR_MillisecondsToInterval(1) ); return PR_POLL_READ | PR_POLL_WRITE | PR_POLL_EXCEPT; } It certainly seems possible that a short sleep like that on some platforms turns into a busy wait.. it also seems quite arbitrary, as well as kind of pointless from the perspective of the IsAlive() stack (we'd like it to return asap). I'll turn that sleep from 1ms into a yield and we'll at least find out what try thinks that does to the tp4 result.

Patrick McManus [:mcmanus]

Assignee

Comment 11

•

14 years ago

(In reply to comment #9) > It seems possible that tp4 is triggering this in nsSSLThread.cpp [..] > want_sleep_and_wakeup_on_any_socket_activity = PR_TRUE; Through instrumentation I have determined this code is _not_ being executed, so it is not the source of this bug. dang.

Patrick McManus [:mcmanus]

Assignee

Comment 12

•

14 years ago

I decided to try and tackle this head on with a profiler. I setup standalone talos and obtained tp4 to run on my linux development machine and eventually got that going and tried to reproduce the report using oprofile. The report measured CPU utilized on windows and talos does not calculate CPU for non windows platforms. I was hoping it would be reproducible. Thus far I cannot make it happen. Using CLK_UNHALTED events as a stand-in for CPU used I get very similar numbers of samples both with and without the IsAlive() patch running TP4. The profiles look similar as well. I'll attach a couple of the profiles. I did multiple runs and they are fairly consistent.

Patrick McManus [:mcmanus]

Assignee

Comment 13

•

14 years ago

Attached file pre-patch profile 1 — Details

Patrick McManus [:mcmanus]

Assignee

Comment 14

•

14 years ago

Attached file post patch profile (gzipped) 1 — Details

Patrick McManus [:mcmanus]

Assignee

Comment 15

•

14 years ago

as an experiment, I tried out a changeset where we don't call Poll() at all - instead I added an interface to nsISSLSocketControl which let me test (socketInfo->isPK11LoggedOut() || socketInfo->isAlreadyShutDown()) from nsSocketTransport2.cpp without going through poll. http://hg.mozilla.org/try/rev/d0edeeb59dd8 That still harmed tp4 windows cpu to the same degree. gah. Those two functions are nothing but boolean accessors, so they aren't the direct cause. It seems that *not* calling PEEK is having some kind of negative effect which bears investigation.

Patrick McManus [:mcmanus]

Assignee

Comment 16

•

14 years ago

An update: I'm now taking a slightly different approach here. I now believe that the patch for 654201, which causes this issue, is not the optimal way to fix that bug. In 654201 I reached the conclusion that calling MSG_PEEK on an SSL socket that hadn't completed its handshake could cause the socket thread to dangerously call DoPollIteration, and I worked around it by replacing the PEEK with a call to Poll(). I remained nervous about other instances of the recursion, but I fixed the PEEK issue which is the one I had introduced. I now know that there is code in nsSocketTransportService2.cpp (line 541) in OnProcessNextEvent that is supposed to prevent DoPollIteration from being called recursively at any time, and that code was not working (see bug 661995). Rather than spot fixing the one case as had been done in 654201, the right thing to do is to fix the semaphore in OnProcessNextEvent (which can be done by just implementing the counter locally instead of relying on nsIThread) and then we don't have to worry about any other recursive possibilities (and there are many). As a side effect we can backout the change from PEEK to Poll() which caused the tp4 problem because that change is now moot. This strategy almost works. It fixes tp4 and it also fixes the crashes associated with 654201. Unfortunately about 1 time in 20 mochitest-other debug on os X (which is the test that failed most often with 654201) fails with timeouts in the add-on tests. These are the tests that most often failed with 654201, though they generally crashed in that scenario and the crashes have gone away. Essentially what I have done is fixed an existing lock that wasn't working. It is possible there is existing code that is deadlocking now, that wasn't dedlocking before due to the broken lock.

Patrick McManus [:mcmanus]

Assignee

Comment 17

•

14 years ago

Can anyone with ssl experience think of code that would a] synchronously proxy a call to another thread (e.g. GetPreviousCert() to the main thread, or perhaps something to the PSM thread?) and b] have that call on the other thread require network access of any sort (even DNS) in order to resolve the call? In the past this always should have deadlocked, but it probably wouldn't without the fix described in comment 16

Patrick McManus [:mcmanus]

Assignee

Comment 18

•

14 years ago

The guilty stack! Thread 4 0 libSystem.B.dylib + 0x506fa eip = 0x924aa6fa esp = 0xb028787c ebp = 0xb0287a88 ebx = 0x0006fe87 esi = 0x00000000 edi = 0x03fbbce6 eax = 0x0014005d ecx = 0xb028787c edx = 0x924aa6fa efl = 0x00000286 Found by: given as instruction pointer in context 1 libnspr4.dylib!_pr_poll_with_poll [ptio.c:8790375bec2a : 3951 + 0x18] eip = 0x0006b572 esp = 0xb0287a90 ebp = 0xb0287cf8 Found by: previous frame's frame pointer 2 libnspr4.dylib!PR_Poll [ptio.c:8790375bec2a : 4353 + 0x18] eip = 0x0006b79f esp = 0xb0287d00 ebp = 0xb0287d18 ebx = 0x0294ef80 esi = 0x00440268 Found by: call frame info 3 XUL!nsSocketTransportService::Poll [nsSocketTransportService2.cpp:8790375bec2a : 425 + 0x18] eip = 0x0294f04a esp = 0xb0287d20 ebp = 0xb0287d58 ebx = 0x0294ef80 esi = 0x00440268 Found by: call frame info 4 XUL!nsSocketTransportService::DoPollIteration [nsSocketTransportService2.cpp:8790375bec2a : 742 + 0x18] eip = 0x0294fded esp = 0xb0287d60 ebp = 0xb0287dc8 ebx = 0x0294faca esi = 0x00440268 Found by: call frame info 5 XUL!nsSocketTransportService::OnProcessNextEvent [nsSocketTransportService2.cpp:8790375bec2a : 619 + 0x12] eip = 0x02950129 esp = 0xb0287dd0 ebp = 0xb0287e08 ebx = 0x03fbac98 esi = 0x00440268 edi = 0x03fbbce6 Found by: call frame info 6 XUL!nsThread::ProcessNextEvent [nsThread.cpp:8790375bec2a : 582 + 0x61] eip = 0x03fbae2b esp = 0xb0287e10 ebp = 0xb0287eb8 ebx = 0x03fbac98 esi = 0x00440268 edi = 0x03fbbce6 Found by: call frame info 7 XUL!NS_ProcessNextEvent_P [nsThreadUtils.cpp:8790375bec2a : 245 + 0x1f] eip = 0x03f423e0 esp = 0xb0287ec0 ebp = 0xb0287f08 ebx = 0x03f4236a esi = 0x0041af00 edi = 0x03fbbce6 Found by: call frame info 8 XUL!nsProxyEventObject::CallMethod [nsProxyEventObject.cpp:8790375bec2a : 260 + 0x12] eip = 0x03fc8756 esp = 0xb0287f10 ebp = 0xb0287f98 ebx = 0x03fc83b4 esi = 0x0041af00 edi = 0x03fbbce6 Found by: call frame info 9 XUL!PrepareAndDispatch [xptcstubs_unixish_x86.cpp:8790375bec2a : 93 + 0x31] eip = 0x03fe34f2 esp = 0xb0287fa0 ebp = 0xb02880a8 ebx = 0x03fe3289 esi = 0x00000003 edi = 0x03fc83a6 Found by: call frame info 10 XUL!nsXPTCStubBase::Stub3 [xptcstubsdef.inc : 1 + 0x25] eip = 0x03fde035 esp = 0xb02880b0 ebp = 0xb02880f8 ebx = 0x03fde00e esi = 0xb0288194 edi = 0xb02880c0 Found by: call frame info 11 XUL!CallGetInterface<nsIInterfaceRequestor, nsISecureBrowserUI> [nsIInterfaceRequestorUtils.h : 55 + 0x22] eip = 0x03a879b6 esp = 0xb0288100 ebp = 0xb0288128 ebx = 0x03a87922 esi = 0xb0288194 edi = 0x03fcb38c Found by: call frame info 12 XUL!nsNSSSocketInfo::EnsureDocShellDependentStuffKnown [nsNSSIOLayer.cpp:8790375bec2a : 385 + 0x1b] eip = 0x03a81ae0 esp = 0xb0288130 ebp = 0xb02881d8 ebx = 0x03a819f9 esi = 0xb0288194 edi = 0x03fcb38c Found by: call frame info 13 XUL!nsNSSSocketInfo::GetPreviousCert [nsNSSIOLayer.cpp:8790375bec2a : 849 + 0xa] eip = 0x03a81f7a esp = 0xb02881e0 ebp = 0xb0288228 ebx = 0x03a81f24 esi = 0x01ee9890 edi = 0x03fcb38c Found by: call frame info 14 XUL!HandshakeCallback [nsNSSCallbacks.cpp:8790375bec2a : 922 + 0x31] eip = 0x03a6c2a1 esp = 0xb0288230 ebp = 0xb0288578 ebx = 0x03a6bcd5 esi = 0x01ee9890 edi = 0x03fcb38c Found by: call frame info 15 libssl3.dylib!ssl3_HandleFinished [ssl3con.c : 8501 + 0x1f] eip = 0x001f1956 esp = 0xb0288580 ebp = 0xb02885d8 ebx = 0x001f1180 esi = 0x00000000 edi = 0x00000001 Found by: call frame info 16 libssl3.dylib!ssl3_HandleHandshakeMessage [ssl3con.c : 8657 + 0x1f] eip = 0x001f20c6 esp = 0xb02885e0 ebp = 0xb0288658 ebx = 0x001f1975 esi = 0x00000301 edi = 0x00000001 Found by: call frame info 17 libssl3.dylib!ssl3_HandleHandshake [ssl3con.c : 8725 + 0x20] eip = 0x001f22f2 esp = 0xb0288660 ebp = 0xb0288698 ebx = 0x001f2116 esi = 0x00000301 edi = 0x00000001 Found by: call frame info 18 libssl3.dylib!ssl3_HandleRecord [ssl3con.c : 9064 + 0x11] eip = 0x001f2e59 esp = 0xb02886a0 ebp = 0xb0288798 ebx = 0x001f2568 esi = 0x00000301 edi = 0x00000001 Found by: call frame info 19 libssl3.dylib!ssl3_GatherCompleteHandshake [ssl3gthr.c : 209 + 0x1d] eip = 0x001f3fae esp = 0xb02887a0 ebp = 0xb02887e8 ebx = 0x001f3ed1 esi = 0x029ec666 edi = 0x0000000c Found by: call frame info 20 libssl3.dylib!ssl_GatherRecord1stHandshake [sslcon.c : 1258 + 0x12] eip = 0x001f6c55 esp = 0xb02887f0 ebp = 0xb0288838 ebx = 0x001f6bd0 esi = 0x029ec666 edi = 0x0000000c Found by: call frame info 21 libssl3.dylib!ssl_Do1stHandshake [sslsecur.c : 151 + 0x10] eip = 0x00202293 esp = 0xb0288840 ebp = 0xb0288868 ebx = 0x0020204e esi = 0x029ec666 edi = 0x0000000c Found by: call frame info 22 libssl3.dylib!ssl_SecureRecv [sslsecur.c : 1150 + 0xa] eip = 0x002041b5 esp = 0xb0288870 ebp = 0xb02888c8 ebx = 0x00204034 esi = 0x029ec666 edi = 0x0000000c Found by: call frame info 23 libssl3.dylib!ssl_Recv [sslsock.c : 1593 + 0x25] eip = 0x0020bd1d esp = 0xb02888d0 ebp = 0xb0288908 ebx = 0x0020bc66 esi = 0x029ec666 edi = 0x0000000c Found by: call frame info 24 XUL!nsSSLThread::requestRecvMsgPeek [nsSSLThread.cpp:8790375bec2a : 208 + 0x2b] eip = 0x03a63c37 esp = 0xb0288910 ebp = 0xb0288968 ebx = 0x03a639da esi = 0x029ec666 edi = 0x0000000c Found by: call frame info 25 XUL!PSMRecv [nsNSSIOLayer.cpp:8790375bec2a : 2182 + 0x26] eip = 0x03a7fba4 esp = 0xb0288970 ebp = 0xb02889b8 ebx = 0x03a7fafa esi = 0x029ec666 edi = 0x0000000c Found by: call frame info 26 libnspr4.dylib!PR_Recv [priometh.c:8790375bec2a : 220 + 0x2b] eip = 0x0004846c esp = 0xb02889c0 ebp = 0xb02889e8 ebx = 0x029e4855 esi = 0x029ec666 edi = 0x0000000c Found by: call frame info 27 XUL!nsSocketTransport::IsAlive [nsSocketTransport2.cpp:8790375bec2a : 1853 + 0x29] eip = 0x02948902 esp = 0xb02889f0 ebp = 0xb0288a58 ebx = 0x029e4855 esi = 0x029ec666 edi = 0x0000000c Found by: call frame info 28 XUL!nsHttpConnection::IsAlive [nsHttpConnection.cpp:8790375bec2a : 280 + 0x22] eip = 0x029e3bcb esp = 0xb0288a60 ebp = 0xb0288a98 ebx = 0x029e4855 esi = 0x029ec666 edi = 0x0000000c Found by: call frame info 29 XUL!nsHttpConnection::CanReuse [nsHttpConnection.cpp:8790375bec2a : 245 + 0x3b] eip = 0x029e4892 esp = 0xb0288aa0 ebp = 0xb0288ad8 ebx = 0x029e4855 esi = 0x029ec666 edi = 0x0000000c Found by: call frame info 30 XUL!nsHttpConnectionMgr::PruneDeadConnectionsCB [nsHttpConnectionMgr.cpp:8790375bec2a : 467 + 0xa] eip = 0x029ec70b esp = 0xb0288ae0 ebp = 0xb0288b58 ebx = 0x029ec672 esi = 0x029ec666 edi = 0x0000000c Found by: call frame info 31 XUL!hashEnumerate [nsHashtable.cpp:8790375bec2a : 130 + 0x23] eip = 0x03f612c3 esp = 0xb0288b60 ebp = 0xb0288b98 ebx = 0x03f37ef8 esi = 0x029ec666 edi = 0x0000000c Found by: call frame info 32 XUL!PL_DHashTableEnumerate [pldhash.c:8790375bec2a : 754 + 0x22] eip = 0x03f37fd2 esp = 0xb0288ba0 ebp = 0xb0288c18 ebx = 0x03f37ef8 esi = 0x400c7560 edi = 0x0000000c Found by: call frame info 33 XUL!nsHashtable::Enumerate [nsHashtable.cpp:8790375bec2a : 318 + 0x1e] eip = 0x03f61823 esp = 0xb0288c20 ebp = 0xb0288c48 ebx = 0x03f617da esi = 0x00000000 edi = 0x00000000 Found by: call frame info 34 XUL!nsHttpConnectionMgr::OnMsgPruneDeadConnections [nsHttpConnectionMgr.cpp:8790375bec2a : 1069 + 0x1e] eip = 0x029e9adf esp = 0xb0288c50 ebp = 0xb0288c68 ebx = 0x029e9a82 esi = 0x00000000 edi = 0x00000000 Found by: call frame info 35 XUL!nsHttpConnectionMgr::nsConnEvent::Run [nsHttpConnectionMgr.h:8790375bec2a : 306 + 0x60] eip = 0x029ecf2c esp = 0xb0288c70 ebp = 0xb0288c98 ebx = 0x03fbac98 esi = 0x00000000 edi = 0x00000000 Found by: call frame info 36 XUL!nsThread::ProcessNextEvent [nsThread.cpp:8790375bec2a : 618 + 0x18] eip = 0x03fbb016 esp = 0xb0288ca0 ebp = 0xb0288d48 ebx = 0x03fbac98 esi = 0x00440268 edi = 0x00000000 Found by: call frame info 37 XUL!NS_ProcessPendingEvents_P [nsThreadUtils.cpp:8790375bec2a : 195 + 0x20] eip = 0x03f42516 esp = 0xb0288d50 ebp = 0xb0288d98 ebx = 0x03f42490 esi = 0x00000000 edi = 0x00000000 Found by: call frame info 38 XUL!nsSocketTransportService::Run [nsSocketTransportService2.cpp:8790375bec2a : 653 + 0x12] eip = 0x0294f899 esp = 0xb0288da0 ebp = 0xb0288de8 ebx = 0x0294f7ed esi = 0x00000000 edi = 0x00000000 Found by: call frame info 39 XUL!nsThread::ProcessNextEvent [nsThread.cpp:8790375bec2a : 618 + 0x18] eip = 0x03fbb016 esp = 0xb0288df0 ebp = 0xb0288e98 ebx = 0x03fbac98 esi = 0x00000000 edi = 0x00000000 Found by: call frame info 40 XUL!NS_ProcessNextEvent_P [nsThreadUtils.cpp:8790375bec2a : 245 + 0x1f] eip = 0x03f423e0 esp = 0xb0288ea0 ebp = 0xb0288ee8 ebx = 0x03f4236a esi = 0x00000000 edi = 0x00000000 Found by: call frame info 41 XUL!nsThread::ThreadFunc [nsThread.cpp:8790375bec2a : 273 + 0x12] eip = 0x03fbb9e9 esp = 0xb0288ef0 ebp = 0xb0288f48 ebx = 0x03fbb8ef esi = 0x00000000 edi = 0x00000000 Found by: call frame info 42 libnspr4.dylib!_pt_root [ptthread.c:8790375bec2a : 187 + 0x10] eip = 0x0006cca9 esp = 0xb0288f50 ebp = 0xb0288f78 ebx = 0x0006cb93 esi = 0xb0289000 edi = 0x00000000 Found by: call frame info 43 libSystem.B.dylib + 0x32154 eip = 0x9248c155 esp = 0xb0288f80 ebp = 0xb0288fc8 ebx = 0x9248c028 esi = 0xb0289000 edi = 0x00000000 Found by: call frame info 44 libSystem.B.dylib + 0x32011 eip = 0x9248c012 esp = 0xb0288fd0 ebp = 0xb0288fec Found by: previous frame's frame pointer

Patrick McManus [:mcmanus]

Assignee

Comment 19

•

14 years ago

when poll() is called it will block the socket thread until either network activity happens in one of the fd's it is watching or the timeout expires. the socket thread uses a number of different algorithms for determining the timeout, sometimes even a timeout of 0. But often the timeout is huge - and any events posted to the socket thread (e.g. adding new http transactions or canceling stuck ones!) during that time will wait to be executed. below you see just a random sample for me - you'll see that the timeout is often 64K seconds (most of a day).. under normal circumstances some kind of network activity happens and the poll() returns long before the timeout happens, but that isn't always going to happen. I believe that this timeout value is derived from the socket-timeout values where we would report an error in OnSocketReady().. that's a fine upper bound, but we need something (potentially) smaller for not blocking event processing. Currently, when there is no fd to check we block for 25ms. We could make that the upper bound. (separately, we should also be able to run other events in when there is no fd to check). some logs: timeout = 65535000 milliseconds ...returned after 4 milliseconds timeout = 0 milliseconds ...returned after 0 milliseconds timeout = 0 milliseconds ...returned after 0 milliseconds timeout = 65535000 milliseconds ...returned after 4 milliseconds timeout = 0 milliseconds ...returned after 0 milliseconds timeout = 0 milliseconds ...returned after 0 milliseconds timeout = 65535000 milliseconds ...returned after 3 milliseconds timeout = 0 milliseconds ...returned after 0 milliseconds timeout = 0 milliseconds ...returned after 0 milliseconds timeout = 65535000 milliseconds ...returned after 4 milliseconds timeout = 0 milliseconds ...returned after 0 milliseconds timeout = 0 milliseconds ...returned after 0 milliseconds timeout = 65535000 milliseconds ...returned after 4 milliseconds timeout = 0 milliseconds ...returned after 0 milliseconds timeout = 0 milliseconds ...returned after 0 milliseconds timeout = 65535000 milliseconds ...returned after 8 milliseconds

Patrick McManus [:mcmanus]

Assignee

Comment 20

•

14 years ago

(In reply to comment #19) > when poll() is called it will block the socket thread until either network > activity happens in one of the fd's it is watching or the timeout expires. > the socket thread uses a number of different algorithms for determining the > timeout, sometimes even a timeout of 0. But often the timeout is huge - and > any events posted to the socket thread (e.g. adding new http transactions or > canceling stuck ones!) during that time will wait to be executed. I mislead myself - to correct my comment: what I write above is true, except that events posted to the socket thread run through dispatch() which eventually calls PR_SetPollableEvent() which in turn pokes a pipe (or something else platform dependent) in that set of poll'd fd's which wakes up the poll even when presented with a large timeout. Given that, the large timeout is desirable.

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Comment 21

•

14 years ago

Please post the patch you are using to get those timeout values. Are we running into the want_sleep_and_wakeup_on_any_socket_activity case?

Patrick McManus [:mcmanus]

Assignee

Comment 22

•

14 years ago

(In reply to comment #21) > Please post the patch you are using to get those timeout values. those are existing LOG statements in nsSocketTransportService2.cpp, but note that in comment 20 I came to the conclusion they aren't a problem. > > Are we running into the want_sleep_and_wakeup_on_any_socket_activity case? no.. I put an crash statement in there and did not hit it. I think the right way to fix this is comment 16 - and I had a very good night with tryserver confirming that both from a safety and efficiency point of view, so I'm ready to post the patches today.

Patrick McManus [:mcmanus]

Assignee

Comment 23

•

14 years ago

I'd like to fix this tp4 issue with 2 patches: 1] a clean revert of the IsAlive() patch from 654201 - that will fix the cpu on tp4 2] a new patch which changes the way nsSocketTransportService::Run dispatches the poll event. Based on advice from ben turner, we won't run complicated stacks out of OnProcessNextEvent. Instead, DoPollIteration is called directly out of Run() and interleaved with calls to ProcessNextEvent(). This ensures we won't ever find DoPollIteration anywhere but at the (almost) bottom of the stack - which is what was intended. Comment 18 shows a scenario where it is insufficient to simply repair the semaphore in ProcessNextEvent() The test that generally showed FAILS for 654201 (or a simple repair of the semaphore) is osx debug mochitest-other, and I have a dozen green runs of that on tryserver with this patch. CPU is back to baseline levels, and load time looks fine.

Patrick McManus [:mcmanus]

Assignee

Comment 24

•

14 years ago

Attached patch part 1 revert 654201 isalive v1 — Details — Splinter Review

Attachment #538497 - Flags: review?(honzab.moz)

Patrick McManus [:mcmanus]

Assignee

Comment 25

•

14 years ago

Attached patch part 2 ensure DoPollIteration on bottom of stack v1 — Details — Splinter Review

Attachment #538498 - Flags: review?(honzab.moz)

Attachment #538498 - Flags: feedback?(cbiesinger)

Honza Bambas (:mayhemer)

Comment 26

•

14 years ago

Comment on attachment 538497 [details] [diff] [review] part 1 revert 654201 isalive v1 Review of attachment 538497 [details] [diff] [review]: -----------------------------------------------------------------

Attachment #538497 - Flags: review?(honzab.moz) → review+

Honza Bambas (:mayhemer)

Comment 27

•

14 years ago

Comment on attachment 538498 [details] [diff] [review] part 2 ensure DoPollIteration on bottom of stack v1 Review of attachment 538498 [details] [diff] [review]: ----------------------------------------------------------------- To don't loose a link here, this is trying to solve an issue in bug 654201 comment 8. I personally don't know about any current functionality that would be triggered on the socket thread, did a synchronous network request and waited to finish it, blocking the socket thread that way. r=honzab. ::: netwerk/base/src/nsSocketTransportService2.cpp @@ +625,5 @@ > + PRBool pendingEvents = PR_FALSE; > + thread->HasPendingEvents(&pendingEvents); > + > + do { > + DoPollIteration(!pendingEvents); Maybe add a comment before this call, like // Don't block on poll when messages are pending ...or something better phrased, to make clearer what the argument of DoPollIteration does. @@ +636,5 @@ > + NS_ProcessNextEvent(thread); > + pendingEvents = PR_FALSE; > + thread->HasPendingEvents(&pendingEvents); > + } > + Probably remove this blank new-line.

Attachment #538498 - Flags: review?(honzab.moz) → review+

Patrick McManus [:mcmanus]

Assignee

Comment 28

•

14 years ago

thanks for the review. This is almost certainly the right thing to do, though on principle I hate to monkey with anything that low in the stack. I'll put it on m-c for some airtime.

Whiteboard: [inbound]

Mounir Lamouri (:mounir)

Comment 29

•

14 years ago

Pushed: http://hg.mozilla.org/mozilla-central/rev/3933c5e402da http://hg.mozilla.org/mozilla-central/rev/d527ceffd806

Assignee: nobody → mcmanus

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Whiteboard: [inbound]

Target Milestone: --- → mozilla7

Version: unspecified → Trunk

Boris Zbarsky [:bzbarsky]

Comment 30

•

14 years ago

Is this omething we need to take on beta? See bug 654201 comment 63. Requesting tracking to sort that out...

tracking-firefox6: --- → ?

Asa Dotzler [:asa]

Updated

•

14 years ago

Whiteboard: [concerns about add-on compat]

Patrick McManus [:mcmanus]

Assignee

Comment 31

•

14 years ago

The API is the same on {5, aurora, nightly}, and different on beta. This patch has not caused any trouble in 32 days on nightly/aurora and applies cleanly to beta, so taking it on beta to prevent needless api churn seems good.

Patrick McManus [:mcmanus]

Assignee

Updated

•

14 years ago

Attachment #538497 - Flags: approval-mozilla-beta?

Patrick McManus [:mcmanus]

Assignee

Updated

•

14 years ago

Attachment #538498 - Flags: feedback?(cbiesinger) → approval-mozilla-beta?

Jorge Villalobos [:jorgev] (he/him)

Comment 32

•

14 years ago

Yes, having to special-case Firefox 6 for this API call is a major inconvenience for add-on developers and for us who handle compatibility bumps and communication.

Updated

•

14 years ago

Attachment #538497 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment 33

•

14 years ago

Comment on attachment 538498 [details] [diff] [review] part 2 ensure DoPollIteration on bottom of stack v1 a=LegNeato for releases/mozilla-beta

Attachment #538498 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Asa Dotzler [:asa]

Updated

•

14 years ago

tracking-firefox6: ? → +

Patrick McManus [:mcmanus]

Assignee

Comment 34

•

14 years ago

http://hg.mozilla.org/releases/mozilla-beta/rev/33edd2662af5 http://hg.mozilla.org/releases/mozilla-beta/rev/4bdc7ebdf1fa

status-firefox6: --- → fixed

status-firefox7: --- → fixed

Boris Zbarsky [:bzbarsky]

Updated

•

14 years ago

Blocks: 654201

Updated

•

14 years ago

Blocks: 658503

Comment 35

•

14 years ago

Hi. Can you please tell me how can I verify this bug. Some STR or a testcase. Thanks.

Patrick McManus [:mcmanus]

Assignee

Comment 36

•

14 years ago

(In reply to Vlad from comment #35) > Hi. > Can you please tell me how can I verify this bug. Some STR or a testcase. > Thanks. As noted in the description, it fixes this regression: https://bugzilla.mozilla.org/show_bug.cgi?id=654201#c52

Comment 37

•

14 years ago

qa-: fix verification not needed

Whiteboard: [concerns about add-on compat] → [concerns about add-on compat][qa-]

You need to log in before you can comment on or make changes to this bug.