Closed
Bug 1119160
Opened 9 years ago
Closed 3 years ago
[Mac 10.9] Socket transport service thread pegs the CPU spinning when system error happened in Poll
Categories
(Core :: Networking, defect, P3)
Tracking
()
RESOLVED
INCOMPLETE
People
(Reporter: xzxing, Unassigned, NeedInfo)
Details
(Keywords: hang, perf, Whiteboard: [necko-backlog])
Attachments
(7 files, 1 obsolete file)
User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 Build ID: 20141126041045 Steps to reproduce: We use XULRunnerV10.0.6esr as embedded browser, and recently found a severe CPU Spinning higher than 100% issue on Mac 10.9 which will cause the whole networking blocked, no websites can be opened any more. From several sample process file of the spinning process, we found that when CPU spinning happened, nsSocketTransportService::DoPollIteration() -> PR_Poll() always met the error from libsystem_c.dylib.. : | + 9 PR_Poll (in libnspr4.dylib) + 482 [0x1180bcf22] + ! : | + ! 9 __error (in libsystem_c.dylib) + 1,0,... [0x7fff98ab5365,0x7fff98ab5364,...] In order to try to reproduce the issue, I have modified "netwerk/base/src/nsSocketTransportService2.cpp/nsSocketTransportService::DoPollIteration(bool wait)" function as below to force "Poll(wait, &pollInterval);" to return n < 0, then the issue can be reproduced all the time. nsresult nsSocketTransportService::DoPollIteration(bool wait) { ... int32_t n = Poll(wait, &pollInterval); // xzxing n = -1; // end of xzxing if (n < 0) { SOCKET_LOG((" PR_Poll error [%d]\n", PR_GetError())); } else { ... } return NS_OK; } Actual results: It seems like when error happens in Poll(), there was no error handle at all in nsSocketTransportService::DoPollIteration() or in nsSocketTransportService::Run() which called DoPollIteration(), as a result, if there is some system error happened, Poll() will return error everytime to cause nsSocketTransportService thread pegs the CPU spinning.. I have verified in the latest Firefox 34.05 build, and the same issue should also exist. Expected results: The expect behavior: there should be some error handle mechanism when poll return errors, instead of keeping nsSocketTransportService thread pegs the CPU spinning.. I have attached the snapshot of process sample files when CPU spinning happend, and the screenshot of the busy thread taking by Instrument's Time profile for your reference. I noticed that the related code in nsSocketTransportService2.cpp had been existed in there for releases without any bug report, and in fact, it is hard to reproduce the issue since I am not sure when the system error happened when doing Poll/select, but when it happened, the CPU keep spike and spinning, and the whole network service blocked which only close the entire application can resolve.
Reporter | ||
Updated•9 years ago
|
Reporter | ||
Comment 1•9 years ago
|
||
Reporter | ||
Comment 2•9 years ago
|
||
Reporter | ||
Comment 3•9 years ago
|
||
Attachment #8545807 -
Attachment is obsolete: true
Reporter | ||
Comment 4•9 years ago
|
||
Reporter | ||
Comment 5•9 years ago
|
||
Reporter | ||
Comment 6•9 years ago
|
||
We have got many CPU spinning report recently, exspecially on Mac 10.9. Hope you can give some help and guidance here.
Comment 7•9 years ago
|
||
I think this bug is one of the CPU hogs I see on Mac in Firefox as well. Stack looks familiar. Doug?
Updated•9 years ago
|
Assignee: nobody → dougt
Yeah, I'll take a look.
Assignee: dougt → hurley
Flags: needinfo?(hurley)
Comment 10•9 years ago
|
||
So I can definitely reproduce with the code changes above (I have not yet seen this in regular use on my machine), but that's not a particularly interesting case. We're doing what we should do if poll returns -1 and we don't know why, which is try again. What would be more interesting is to find out what errno is after the call to poll fails, as that will tell us *why* poll is failing, and if there's anything we can do (for example, some values of errno from poll indicate that we messed up, which would be our bug, while others indicate that something went wrong at a lower layer we have no control over, in which case our only option is to try again). What I would need to see to get any further on this is the results of both PR_GetError and PR_GetOSError being called immediately after PR_Poll returns (which is in nsSocketTransportService2::Poll). xzxing, can you provide that information from an in-practice failure (in other words, not the failure you synthesized in comment 0)?
Flags: needinfo?(xzxing)
Comment 11•9 years ago
|
||
xzxing - one other question - do your network conditions happen to have changed just before seeing this behavior in normal practice? (switched from one wifi network to another, wifi->wired, changed the os x "network location" in settings?)
Reporter | ||
Comment 12•9 years ago
|
||
(In reply to Nicholas Hurley [:hurley] from comment #10) > So I can definitely reproduce with the code changes above (I have not yet > seen this in regular use on my machine), but that's not a particularly > interesting case. We're doing what we should do if poll returns -1 and we > don't know why, which is try again. What would be more interesting is to > find out what errno is after the call to poll fails, as that will tell us > *why* poll is failing, and if there's anything we can do (for example, some > values of errno from poll indicate that we messed up, which would be our > bug, while others indicate that something went wrong at a lower layer we > have no control over, in which case our only option is to try again). > > What I would need to see to get any further on this is the results of both > PR_GetError and PR_GetOSError being called immediately after PR_Poll returns > (which is in nsSocketTransportService2::Poll). xzxing, can you provide that > information from an in-practice failure (in other words, not the failure you > synthesized in comment 0)? Hi Nick, thanks for your help to look into this. Sorry currently I don't have the specific errno when poll fails, since it need to enable the NSPR log and run our product from command line. Most of the CPU spinning was reported by internal users, and I only reproduce twice by chance, but I could not reproduce the issue for quite a time.. I will try to contact with the internal users and ask for their help to enable the NSPR log and try to reproduce to see if we can get the useful errno. But as I mentioned, we CAN NOT find a regular way to make it stably reproduced, sometimes it just happened, mostly when access https websites, or under bad network connnection, but with NO guaranty when it will happen.. Firstly, I was suspect it was the issue as https://bugzilla.mozilla.org/show_bug.cgi?id=710176, since the symptoms is very alike, however when I look into the XULRunnerV10.0.6.esr source code with 710176's patch then found out that they are different, 710176 is based on XULRunnerV11 which had remove nsSSLThread. Then I look into several sample process file which taken when CPU spinning happened, and find out that each of this file had a common point is that there is an __error happend after poll/select, that's why I modify to source code to force PR_Poll return -1 to reproduce the same symptoms with the issue. And could you help to confirm the way the get the errno? 1. enable the NSPR Log export NSPR_LOG_MODULES=nsSocketTransport:4 export NSPR_LOG_FILE=/Mozilla/log 2. Run application from command line /Application/***.app/Contents/MacOS/*** 3. get the NSPR log in /Mozilla/log file and find "PR_Poll error" in the log if we reproduce the issue, since there is a log record when error happened in nsSocketTransportService2.cpp/DoPollIteration().. PRInt32 n = Poll(wait, &pollInterval); if (n < 0) { SOCKET_LOG((" PR_Poll error [%d]\n", PR_GetError())); } else { ...
Flags: needinfo?(xzxing)
Reporter | ||
Comment 13•9 years ago
|
||
(In reply to Nicholas Hurley [:hurley] from comment #11) > xzxing - one other question - do your network conditions happen to have > changed just before seeing this behavior in normal practice? (switched from > one wifi network to another, wifi->wired, changed the os x "network > location" in settings?) The network condition change might be a cause since I remember when I was reproduced the issue, my Mac machine was wake up from sleep. And some internal users reported the issue when they use Mac machine under bad network connection(eg, work from home via VPN) or move wired into wifi. So I think your suspection make sense. However, like I said, there is NO regular way to make it reproduce stably which is sad :(
Comment 14•9 years ago
|
||
It's unfortunate that there's no good reproducability, even with a network change. That will make this one much harder to track down. (In reply to xzxing@cn.ibm.com from comment #12) > And could you help to confirm the way the get the errno? > > 1. enable the NSPR Log > export NSPR_LOG_MODULES=nsSocketTransport:4 > export NSPR_LOG_FILE=/Mozilla/log This is correct, though you should really have the "timestamp" module in NSPR_LOG_MODULES as well, so your export would be export NSPR_LOG_MODULES=timestamp,nsSocketTransport:5 > 2. Run application from command line > /Application/***.app/Contents/MacOS/*** > > 3. get the NSPR log in /Mozilla/log file and find "PR_Poll error" in the log > if we reproduce the issue, since there is a log record when error happened > in nsSocketTransportService2.cpp/DoPollIteration().. > > PRInt32 n = Poll(wait, &pollInterval); > if (n < 0) { > SOCKET_LOG((" PR_Poll error [%d]\n", PR_GetError())); > } > else { > ... Correct. If you have the ability to recompile locally, it might be good to modify that SOCKET_LOG to also report PR_GetOSError, so it would look like SOCKET_LOG((" PR_Poll error [%d] os error [%d]\n", PR_GetError(), PR_GetOSError()));
Comment 15•9 years ago
|
||
nick, we should put that logging change into m-c
Comment 16•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #15) > nick, we should put that logging change into m-c Filed bug 1121530
Comment 17•9 years ago
|
||
I've been hitting this randomly for the last week or so (On 10.10), I've found YouTube seems to trigger it more than most other sites. I'm not sure how helpful this log segment is, once I hit it the log does become rather useless due to the amount of data it spits out.
Comment 18•9 years ago
|
||
I've also started seeing this issue in the last week on Nightly. Was Alex's log enough or would another be helpful?
Comment 19•9 years ago
|
||
So the errno Alex is seeing (22) indicates EINVAL - meaning we're passing a bad argument to poll (either nfds > OPEN_MAX, or timeout < -1, according to my man page on 10.10). I still have yet to see this locally, but I will (in the name of bugfixing!) play around on YouTube today, as Alex says that seems to trigger the behavior more often than others. :MattN - if you have any good repro sites, I'd appreciate hearing about them.
Comment 20•9 years ago
|
||
I don't know what specifically is causing it but I'm logged into Hello and saw the bug yesterday while restoring my session for a Nightly update. Only a few pinned tabs had successfully restored before hitting this and they were Google Voice and 2-3 tabs of different Gmail accounts. I also have over 20 pinned BMO tabs. I was on Mozilla MV Wi-Fi when I saw this btw. (I haven't used my ethernet for a few weeks at the office).
Comment 21•9 years ago
|
||
I would guess that the problem for me is related to "Network Predictive Actions" since the timing makes sense.
Comment 22•9 years ago
|
||
MattN: it's possible that the predictor exacerbates this problem (if, in fact, the problem is that we're hitting OPEN_MAX file descriptors), but as shown by the reporter, this issue has existed since long before the predictor landed in any form. I still haven't had any success reproducing this issue locally (even restoring a session with 20+ pinned tabs like yours), so I'm wondering if you've changed any prefs that might change our network behavior or session restore behavior? (the output of about:support would be nice, as I can take a peek at extensions you have that might be causing you to see this issue when I'm not seeing it.)
Flags: needinfo?(MattN+bmo)
Comment 23•9 years ago
|
||
Alex, if you wouldn't mind supplying the same info I asked :MattN for in comment 22, I'd appreciate it. Seeing if you two have anything in common might help me narrow down the issue more quickly.
Flags: needinfo?(the.decryptor)
Comment 24•9 years ago
|
||
https://bugzilla.mozilla.org/show_bug.cgi?id=939318#c206 has a log with the socket transport thread hung (which is slightly different) - it could be related.. the thinking over there is it is related to the network change event handling. If OP has a local repro, maybe that can be ruled out here independent of seer? (I think network-change has a pref).
Comment 25•9 years ago
|
||
an obvious question is if we leaked those descriptors somehow.
Comment 26•9 years ago
|
||
The network change events can be switched off with "network.notify.changed"
Comment 27•9 years ago
|
||
(In reply to Nicholas Hurley [:hurley] from comment #23) > Alex, if you wouldn't mind supplying the same info I asked :MattN for in > comment 22, I'd appreciate it. Seeing if you two have anything in common > might help me narrow down the issue more quickly. The only network preference I changed is "network.http.referer.trimmingPolicy", which shouldn't have an effect although I can certainly try with it off (I actually just reset my profile a few days ago for another issue and I've hit this issue again already), and the only addon I've got installed is ABP. And pinned tab wise I've only got 2, with about 30 unpinned tabs but I only ever load 4-5 of those in any session. I'll try disabling the network change events, I'm on a wired network with static addressing so it doesn't help me much anyway. If I hit it again I'll try disabling the predictor.
Flags: needinfo?(the.decryptor)
Comment 28•9 years ago
|
||
I tried with network change events disabled and hit it again, tried with the predictor disabled and hit it again as well. It could just be bad luck, but it took significantly longer to hit with the predictor disabled, around 6 hours vs. an hour or two of heavy browsing with network change events turned off. Annoyingly I couldn't hit it on YouTube today, but did hit it loading the CommonMark spec.
Comment 29•9 years ago
|
||
I haven't been seeing the problem recently so perhaps my issue was fixed by another bug. My about:support data is at http://pastebin.mozilla.org/8700097 (expiring in 1 month).
Flags: needinfo?(MattN+bmo)
Comment 30•9 years ago
|
||
I've just started seeing this again, after the last message it had entirely stopped, but whatever is causing it for me has flared up again.
Comment 31•9 years ago
|
||
I think I've found a more reliable STR, crashing the content process. When I restore all crashed tabs (in 2 separate windows) they start loading normally, but if I try loading a page in a new tab quickly I hit this issue fairly reliably (As in it's happened 5 times today and I've hit it every single time). Sometimes it actually sticks around and doesn't clear up, I have to restart Firefox to be able to open any more pages. Funnily enough YouTube still helps, but because of Bug 1145101 this time.
Comment 32•8 years ago
|
||
is this something you still see with the current nightly? There have been a number of things cleaned up in this space, though none of them I would specifically correlate with this problem
Flags: needinfo?(xzxing)
Whiteboard: [necko-backlog]
Comment 33•8 years ago
|
||
I haven't hit this issue at all since I last commented, whatever was causing it for me went away.
Comment 34•7 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #32) > is this something you still see with the current nightly? There have been a > number of things cleaned up in this space, though none of them I would > specifically correlate with this problem xzxing user email address is dead
Flags: needinfo?(xzxing) → needinfo?(mcmanus)
Comment 35•7 years ago
|
||
I'm closing this as fixed, as Alex can't reproduce anymore and the original poster isn't responding.
Status: UNCONFIRMED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Comment 36•7 years ago
|
||
(In reply to Daniel Stenberg [:bagder] from comment #35) > I'm closing this as fixed, as Alex can't reproduce anymore and the original > poster isn't responding. Hi Daniel, could you please look at this bug again? Even the original poster is gone, we are still facing the same problem. Produce details: 1, we can stable reproduced it on xulrunner: 10.0.6esr, OS: Mac OS 10.11 & 10.12. 2, we collected nspr logs when it reproduced. We can see from logs that there is a dead loop which might result from PR_Poll error [-5987]. 4, we cannot access any url anymore once it reproduced. 3, I have print out the value of n in "netwerk/base/src/nsSocketTransportService2.cpp/nsSocketTransportService::DoPollIteration(bool wait)" nsresult nsSocketTransportService::DoPollIteration(bool wait) { ... int32_t n = Poll(wait, &pollInterval); if (n < 0) { SOCKET_LOG((" PR_Poll error [%d]\n", PR_GetError())); } else { ... } return NS_OK; } and n always equals -1 when reproduced.
Comment 37•7 years ago
|
||
Updated•7 years ago
|
Status: RESOLVED → REOPENED
Ever confirmed: true
Resolution: FIXED → ---
Comment 38•7 years ago
|
||
(In reply to VincentTiantc from comment #36) > (In reply to Daniel Stenberg [:bagder] from comment #35) > > I'm closing this as fixed, as Alex can't reproduce anymore and the original > > poster isn't responding. > > Hi Daniel, could you please look at this bug again? Even the original poster > is gone, we are still facing the same problem. > > Produce details: > 1, we can stable reproduced it on xulrunner: 10.0.6esr, OS: Mac OS 10.11 & > 10.12. > 2, we collected nspr logs when it reproduced. We can see from logs that > there is a dead loop which might result from PR_Poll error [-5987]. > 4, we cannot access any url anymore once it reproduced. > 3, I have print out the value of n in > "netwerk/base/src/nsSocketTransportService2.cpp/nsSocketTransportService:: > DoPollIteration(bool wait)" > nsresult > nsSocketTransportService::DoPollIteration(bool wait) > { > ... > int32_t n = Poll(wait, &pollInterval); > > if (n < 0) { > SOCKET_LOG((" PR_Poll error [%d]\n", PR_GetError())); > } > else { > ... > } > return NS_OK; > } > and n always equals -1 when reproduced. After collecting more logs, we found: 1, Once 'Poll(wait, &pollInterval)' returns -1, all the following Polls will return -1 too, since there is no error handler in "if(n < 0) branch" except for an error log. 2, The 'PR_Poll error [-5987]' stands for INVALID_PARAM_ERROR. Current workaround: Since once there is an failed socket[invoked by int32_t n = Poll(wait, &pollInterval);] in mPollList, it won't disappear and the whole thread kept endless Polling with the same failed result -1, we add some error-handle-code in "if(n < 0) branch" to free and reset mPollList, mActiveList, mIdleList. This workaround do works, but it's kind of stupid, since it will discard the following Polls in queue, and it doesn't fix the root of this problem.
Comment 39•7 years ago
|
||
Dragana, is there a possibility for a quick fix here? Could it be something affecting more users?
Flags: needinfo?(dd.mozilla)
Comment 40•7 years ago
|
||
I will look into this. comment 38 is interesting.
Comment 41•7 years ago
|
||
Nick, can you make a patch to try to work around this problem or cahange poll parameters to fix this error. You already looked into this bug and poll errors.
Flags: needinfo?(dd.mozilla) → needinfo?(hurley)
Comment 42•7 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #41) > Nick, can you make a patch to try to work around this problem or cahange > poll parameters to fix this error. You already looked into this bug and poll > errors. I can put this on my personal backlog, but I won't be able to get to it until next week at the earliest, given my impending PTO and other, higher-priority things I need to get to today. Feel free to take this (or ask someone else to) if it needs doing before then.
Flags: needinfo?(hurley)
Comment 43•7 years ago
|
||
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Comment 44•7 years ago
|
||
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P1 → P3
Comment 45•5 years ago
|
||
I just encountered this bug on 67 on OSX 10.14.3.
Networking log file is > 2GB full of the same error over and over again.
How got into the bug was that I when I woke my laptop up from sleep it Firefox would no longer make any network connections. Before putting it to sleep (lid open) I a WebRTC call, which I ended normally and put the laptop at my desk and locked it via the OS.
Comment 46•3 years ago
|
||
Hey Nils,
Can you still reproduce this issue or should we close it?
Flags: needinfo?(drno)
Comment 47•3 years ago
|
||
Marking this as Resolved > Incomplete as per reporter's lack of response.
If anyone can still reproduce this issue re-open it or file a new bug.
Status: REOPENED → RESOLVED
Closed: 7 years ago → 3 years ago
Resolution: --- → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•