Closed Bug 1119160 Opened 9 years ago Closed 3 years ago

[Mac 10.9] Socket transport service thread pegs the CPU spinning when system error happened in Poll

Categories

(Core :: Networking, defect, P3)

34 Branch
Other
macOS
defect

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: xzxing, Unassigned, NeedInfo)

Details

(Keywords: hang, perf, Whiteboard: [necko-backlog])

Attachments

(7 files, 1 obsolete file)

Attached file sample file1.txt
User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0
Build ID: 20141126041045

Steps to reproduce:

We use XULRunnerV10.0.6esr as embedded browser, and recently found a severe CPU Spinning higher than 100% issue on Mac 10.9 which will cause the whole networking blocked, no websites can be opened any more.

From several sample process file of the spinning process, we found that when CPU spinning happened, nsSocketTransportService::DoPollIteration() -> PR_Poll() always met the error from libsystem_c.dylib..

 : | + 9 PR_Poll  (in libnspr4.dylib) + 482  [0x1180bcf22]
    + !           : | + ! 9 __error  (in libsystem_c.dylib) + 1,0,...  [0x7fff98ab5365,0x7fff98ab5364,...]

In order to try to reproduce the issue, I have modified "netwerk/base/src/nsSocketTransportService2.cpp/nsSocketTransportService::DoPollIteration(bool wait)" function as below to force "Poll(wait, &pollInterval);" to return n < 0, then the issue can be reproduced all the time.

nsresult
nsSocketTransportService::DoPollIteration(bool wait)
{
   ...
   int32_t n = Poll(wait, &pollInterval);
   // xzxing
   n = -1;
   // end of xzxing
   if (n < 0) {
        SOCKET_LOG(("  PR_Poll error [%d]\n", PR_GetError()));
    }
   else {
      ...
   }
   return NS_OK;
}



Actual results:

It seems like when error happens in Poll(), there was no error handle at all in nsSocketTransportService::DoPollIteration() or in nsSocketTransportService::Run() which called DoPollIteration(), as a result, if there is some system error happened, Poll() will return error everytime to cause nsSocketTransportService thread pegs the CPU spinning..

I have verified in the latest Firefox 34.05 build, and the same issue should also exist.


Expected results:

The expect behavior: there should be some error handle mechanism when poll return errors, instead of keeping nsSocketTransportService thread pegs the CPU spinning..

I have attached the snapshot of process sample files when CPU spinning happend, and the screenshot of the busy thread taking by Instrument's Time profile for your reference.

I noticed that the related code in nsSocketTransportService2.cpp had been existed in there for releases without any bug report, and in fact, it is hard to reproduce the issue since I am not sure when the system error happened when doing Poll/select, but when it happened, the CPU keep spike and spinning, and the whole network service blocked which only close the entire application can resolve.
Severity: normal → critical
Keywords: hang, perf
OS: Windows 7 → Mac OS X
Priority: -- → P1
Hardware: x86_64 → Other
Attached image Screen Shot 2014-04-16 at 4.12.21 PM.png (obsolete) —
Attachment #8545807 - Attachment is obsolete: true
We have got many CPU spinning report recently, exspecially on Mac 10.9. Hope you can give some help and guidance here.
I think this bug is one of the CPU hogs I see on Mac in Firefox as well. Stack looks familiar. Doug?
Assignee: nobody → dougt
nick - do you have a mac handy for this?
Flags: needinfo?(hurley)
Yeah, I'll take a look.
Assignee: dougt → hurley
Flags: needinfo?(hurley)
So I can definitely reproduce with the code changes above (I have not yet seen this in regular use on my machine), but that's not a particularly interesting case. We're doing what we should do if poll returns -1 and we don't know why, which is try again. What would be more interesting is to find out what errno is after the call to poll fails, as that will tell us *why* poll is failing, and if there's anything we can do (for example, some values of errno from poll indicate that we messed up, which would be our bug, while others indicate that something went wrong at a lower layer we have no control over, in which case our only option is to try again).

What I would need to see to get any further on this is the results of both PR_GetError and PR_GetOSError being called immediately after PR_Poll returns (which is in nsSocketTransportService2::Poll). xzxing, can you provide that information from an in-practice failure (in other words, not the failure you synthesized in comment 0)?
Flags: needinfo?(xzxing)
xzxing - one other question - do your network conditions happen to have changed just before seeing this behavior in normal practice? (switched from one wifi network to another, wifi->wired, changed the os x "network location" in settings?)

(In reply to Nicholas Hurley [:hurley] from comment #10)
> So I can definitely reproduce with the code changes above (I have not yet
> seen this in regular use on my machine), but that's not a particularly
> interesting case. We're doing what we should do if poll returns -1 and we
> don't know why, which is try again. What would be more interesting is to
> find out what errno is after the call to poll fails, as that will tell us
> *why* poll is failing, and if there's anything we can do (for example, some
> values of errno from poll indicate that we messed up, which would be our
> bug, while others indicate that something went wrong at a lower layer we
> have no control over, in which case our only option is to try again).
> 
> What I would need to see to get any further on this is the results of both
> PR_GetError and PR_GetOSError being called immediately after PR_Poll returns
> (which is in nsSocketTransportService2::Poll). xzxing, can you provide that
> information from an in-practice failure (in other words, not the failure you
> synthesized in comment 0)?

Hi Nick, thanks for your help to look into this.
Sorry currently I don't have the specific errno when poll fails, since it need to enable the NSPR log and run our product from command line. Most of the CPU spinning was reported by internal users, and I only reproduce twice by chance, but I could not reproduce the issue for quite a time..

I will try to contact with the internal users and ask for their help to enable the NSPR log and try to reproduce to see if we can get the useful errno. But as I mentioned, we CAN NOT find a regular way to make it stably reproduced, sometimes it just happened, mostly when access https websites, or under bad network connnection, but with NO guaranty when it will happen..

Firstly, I was suspect it was the issue as https://bugzilla.mozilla.org/show_bug.cgi?id=710176, since the symptoms is very alike, however when I look into the XULRunnerV10.0.6.esr source code with 710176's patch then found out that they are different, 710176 is based on XULRunnerV11 which had remove nsSSLThread.

Then I look into several sample process file which taken when CPU spinning happened, and find out that each of this file had a common point is that there is an __error happend after poll/select, that's why I modify to source code to force PR_Poll return -1 to reproduce the same symptoms with the issue.

And could you help to confirm the way the get the errno?

1. enable the NSPR Log
export NSPR_LOG_MODULES=nsSocketTransport:4
export NSPR_LOG_FILE=/Mozilla/log

2. Run application from command line
/Application/***.app/Contents/MacOS/***

3. get the NSPR log in /Mozilla/log file and find "PR_Poll error" in the log if we reproduce the issue, since there is a log record when error happened in nsSocketTransportService2.cpp/DoPollIteration()..

PRInt32 n = Poll(wait, &pollInterval);
    if (n < 0) {
        SOCKET_LOG(("  PR_Poll error [%d]\n", PR_GetError()));
    }
    else {
...
Flags: needinfo?(xzxing)
(In reply to Nicholas Hurley [:hurley] from comment #11)
> xzxing - one other question - do your network conditions happen to have
> changed just before seeing this behavior in normal practice? (switched from
> one wifi network to another, wifi->wired, changed the os x "network
> location" in settings?)

The network condition change might be a cause since I remember when I was reproduced the issue, my Mac machine was wake up from sleep. And some internal users reported the issue when they use Mac machine under bad network connection(eg, work from home via VPN) or move wired into wifi. So I think your suspection make sense. However, like I said, there is NO regular way to make it reproduce stably which is sad :(
It's unfortunate that there's no good reproducability, even with a network change. That will make this one much harder to track down.

(In reply to xzxing@cn.ibm.com from comment #12)
> And could you help to confirm the way the get the errno?
> 
> 1. enable the NSPR Log
> export NSPR_LOG_MODULES=nsSocketTransport:4
> export NSPR_LOG_FILE=/Mozilla/log

This is correct, though you should really have the "timestamp" module in NSPR_LOG_MODULES as well, so your export would be

export NSPR_LOG_MODULES=timestamp,nsSocketTransport:5

> 2. Run application from command line
> /Application/***.app/Contents/MacOS/***
> 
> 3. get the NSPR log in /Mozilla/log file and find "PR_Poll error" in the log
> if we reproduce the issue, since there is a log record when error happened
> in nsSocketTransportService2.cpp/DoPollIteration()..
> 
> PRInt32 n = Poll(wait, &pollInterval);
>     if (n < 0) {
>         SOCKET_LOG(("  PR_Poll error [%d]\n", PR_GetError()));
>     }
>     else {
> ...

Correct. If you have the ability to recompile locally, it might be good to modify that SOCKET_LOG to also report PR_GetOSError, so it would look like

SOCKET_LOG(("  PR_Poll error [%d] os error [%d]\n", PR_GetError(), PR_GetOSError()));
nick, we should put that logging change into m-c
(In reply to Patrick McManus [:mcmanus] from comment #15)
> nick, we should put that logging change into m-c

Filed bug 1121530
Attached file Poll errors
I've been hitting this randomly for the last week or so (On 10.10), I've found YouTube seems to trigger it more than most other sites. I'm not sure how helpful this log segment is, once I hit it the log does become rather useless due to the amount of data it spits out.
I've also started seeing this issue in the last week on Nightly. Was Alex's log enough or would another be helpful?
So the errno Alex is seeing (22) indicates EINVAL - meaning we're passing a bad argument to poll (either nfds > OPEN_MAX, or timeout < -1, according to my man page on 10.10). I still have yet to see this locally, but I will (in the name of bugfixing!) play around on YouTube today, as Alex says that seems to trigger the behavior more often than others.

:MattN - if you have any good repro sites, I'd appreciate hearing about them.
I don't know what specifically is causing it but I'm logged into Hello and saw the bug yesterday while restoring my session for a Nightly update. Only a few pinned tabs had successfully restored before hitting this and they were Google Voice and 2-3 tabs of different Gmail accounts. I also have over 20 pinned BMO tabs. I was on Mozilla MV Wi-Fi when I saw this btw. (I haven't used my ethernet for a few weeks at the office).
I would guess that the problem for me is related to "Network Predictive Actions" since the timing makes sense.
MattN: it's possible that the predictor exacerbates this problem (if, in fact, the problem is that we're hitting OPEN_MAX file descriptors), but as shown by the reporter, this issue has existed since long before the predictor landed in any form. I still haven't had any success reproducing this issue locally (even restoring a session with 20+ pinned tabs like yours), so I'm wondering if you've changed any prefs that might change our network behavior or session restore behavior? (the output of about:support would be nice, as I can take a peek at extensions you have that might be causing you to see this issue when I'm not seeing it.)
Flags: needinfo?(MattN+bmo)
Alex, if you wouldn't mind supplying the same info I asked :MattN for in comment 22, I'd appreciate it. Seeing if you two have anything in common might help me narrow down the issue more quickly.
Flags: needinfo?(the.decryptor)
https://bugzilla.mozilla.org/show_bug.cgi?id=939318#c206 has a log with the socket transport thread hung (which is slightly different) - it could be related.. the thinking over there is it is related to the network change event handling. If OP has a local repro, maybe that can be ruled out here independent of seer? (I think network-change has a pref).
an obvious question is if we leaked those descriptors somehow.
The network change events can be switched off with "network.notify.changed"
(In reply to Nicholas Hurley [:hurley] from comment #23)
> Alex, if you wouldn't mind supplying the same info I asked :MattN for in
> comment 22, I'd appreciate it. Seeing if you two have anything in common
> might help me narrow down the issue more quickly.

The only network preference I changed is "network.http.referer.trimmingPolicy", which shouldn't have an effect although I can certainly try with it off (I actually just reset my profile a few days ago for another issue and I've hit this issue again already), and the only addon I've got installed is ABP. And pinned tab wise I've only got 2, with about 30 unpinned tabs but I only ever load 4-5 of those in any session.

I'll try disabling the network change events, I'm on a wired network with static addressing so it doesn't help me much anyway. If I hit it again I'll try disabling the predictor.
Flags: needinfo?(the.decryptor)
I tried with network change events disabled and hit it again, tried with the predictor disabled and hit it again as well. It could just be bad luck, but it took significantly longer to hit with the predictor disabled, around 6 hours vs. an hour or two of heavy browsing with network change events turned off.

Annoyingly I couldn't hit it on YouTube today, but did hit it loading the CommonMark spec.
I haven't been seeing the problem recently so perhaps my issue was fixed by another bug. My about:support data is at http://pastebin.mozilla.org/8700097 (expiring in 1 month).
Flags: needinfo?(MattN+bmo)
I've just started seeing this again, after the last message it had entirely stopped, but whatever is causing it for me has flared up again.
I think I've found a more reliable STR, crashing the content process. When I restore all crashed tabs (in 2 separate windows) they start loading normally, but if I try loading a page in a new tab quickly I hit this issue fairly reliably (As in it's happened 5 times today and I've hit it every single time). Sometimes it actually sticks around and doesn't clear up, I have to restart Firefox to be able to open any more pages.

Funnily enough YouTube still helps, but because of Bug 1145101 this time.
is this something you still see with the current nightly? There have been a number of things cleaned up in this space, though none of them I would specifically correlate with this problem
Flags: needinfo?(xzxing)
Whiteboard: [necko-backlog]
I haven't hit this issue at all since I last commented, whatever was causing it for me went away.
Assignee: hurley → nobody
(In reply to Patrick McManus [:mcmanus] from comment #32)
> is this something you still see with the current nightly? There have been a
> number of things cleaned up in this space, though none of them I would
> specifically correlate with this problem

xzxing user email address is dead
Flags: needinfo?(xzxing) → needinfo?(mcmanus)
I'm closing this as fixed, as Alex can't reproduce anymore and the original poster isn't responding.
Status: UNCONFIRMED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
(In reply to Daniel Stenberg [:bagder] from comment #35)
> I'm closing this as fixed, as Alex can't reproduce anymore and the original
> poster isn't responding.

Hi Daniel, could you please look at this bug again? Even the original poster is gone, we are still facing the same problem.

Produce details:
1, we can stable reproduced it on xulrunner: 10.0.6esr, OS: Mac OS 10.11 & 10.12.
2, we collected nspr logs when it reproduced. We can see from logs that there is a dead loop which might result from PR_Poll error [-5987].
4, we cannot access any url anymore once it reproduced.
3, I have print out the value of n in "netwerk/base/src/nsSocketTransportService2.cpp/nsSocketTransportService::DoPollIteration(bool wait)"
nsresult
nsSocketTransportService::DoPollIteration(bool wait)
{
   ...
   int32_t n = Poll(wait, &pollInterval);

   if (n < 0) {
        SOCKET_LOG(("  PR_Poll error [%d]\n", PR_GetError()));
    }
   else {
      ...
   }
   return NS_OK;
}
and n always equals -1 when reproduced.
Status: RESOLVED → REOPENED
Ever confirmed: true
Resolution: FIXED → ---
(In reply to VincentTiantc from comment #36)
> (In reply to Daniel Stenberg [:bagder] from comment #35)
> > I'm closing this as fixed, as Alex can't reproduce anymore and the original
> > poster isn't responding.
> 
> Hi Daniel, could you please look at this bug again? Even the original poster
> is gone, we are still facing the same problem.
> 
> Produce details:
> 1, we can stable reproduced it on xulrunner: 10.0.6esr, OS: Mac OS 10.11 &
> 10.12.
> 2, we collected nspr logs when it reproduced. We can see from logs that
> there is a dead loop which might result from PR_Poll error [-5987].
> 4, we cannot access any url anymore once it reproduced.
> 3, I have print out the value of n in
> "netwerk/base/src/nsSocketTransportService2.cpp/nsSocketTransportService::
> DoPollIteration(bool wait)"
> nsresult
> nsSocketTransportService::DoPollIteration(bool wait)
> {
>    ...
>    int32_t n = Poll(wait, &pollInterval);
> 
>    if (n < 0) {
>         SOCKET_LOG(("  PR_Poll error [%d]\n", PR_GetError()));
>     }
>    else {
>       ...
>    }
>    return NS_OK;
> }
> and n always equals -1 when reproduced.

After collecting more logs, we found:
1, Once 'Poll(wait, &pollInterval)' returns -1, all the following Polls will return -1 too, since there is no error handler in "if(n < 0) branch" except for an error log.
2, The 'PR_Poll error [-5987]' stands for INVALID_PARAM_ERROR.

Current workaround:
Since once there is an failed socket[invoked by int32_t n = Poll(wait, &pollInterval);] in mPollList, it won't disappear and the whole thread kept endless Polling with the same failed result -1, we add some error-handle-code in "if(n < 0) branch" to free and reset mPollList, mActiveList, mIdleList.

This workaround do works, but it's kind of stupid, since it will discard the following Polls in queue, and it doesn't fix the root of this problem.
Dragana, is there a possibility for a quick fix here?  Could it be something affecting more users?
Flags: needinfo?(dd.mozilla)
I will look into this. comment 38 is interesting.
Nick, can you make a patch to try to work around this problem or cahange poll parameters to fix this error. You already looked into this bug and poll errors.
Flags: needinfo?(dd.mozilla) → needinfo?(hurley)
(In reply to Dragana Damjanovic [:dragana] from comment #41)
> Nick, can you make a patch to try to work around this problem or cahange
> poll parameters to fix this error. You already looked into this bug and poll
> errors.

I can put this on my personal backlog, but I won't be able to get to it until next week at the earliest, given my impending PTO and other, higher-priority things I need to get to today. Feel free to take this (or ask someone else to) if it needs doing before then.
Flags: needinfo?(hurley)
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P1 → P3

I just encountered this bug on 67 on OSX 10.14.3.
Networking log file is > 2GB full of the same error over and over again.

How got into the bug was that I when I woke my laptop up from sleep it Firefox would no longer make any network connections. Before putting it to sleep (lid open) I a WebRTC call, which I ended normally and put the laptop at my desk and locked it via the OS.

Hey Nils,
Can you still reproduce this issue or should we close it?

Flags: needinfo?(drno)

Marking this as Resolved > Incomplete as per reporter's lack of response.
If anyone can still reproduce this issue re-open it or file a new bug.

Status: REOPENED → RESOLVED
Closed: 7 years ago3 years ago
Resolution: --- → INCOMPLETE

Declaring NI bankruptcy.

Flags: needinfo?(drno)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: