1119160 - [Mac 10.9] Socket transport service thread pegs the CPU spinning when system error happened in Poll

xzxing@cn.ibm.com

Reporter

Description

•

9 years ago

Attached file sample file1.txt — Details

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0
Build ID: 20141126041045

Steps to reproduce:

We use XULRunnerV10.0.6esr as embedded browser, and recently found a severe CPU Spinning higher than 100% issue on Mac 10.9 which will cause the whole networking blocked, no websites can be opened any more.

From several sample process file of the spinning process, we found that when CPU spinning happened, nsSocketTransportService::DoPollIteration() -> PR_Poll() always met the error from libsystem_c.dylib..

 : | + 9 PR_Poll  (in libnspr4.dylib) + 482  [0x1180bcf22]
    + !           : | + ! 9 __error  (in libsystem_c.dylib) + 1,0,...  [0x7fff98ab5365,0x7fff98ab5364,...]

In order to try to reproduce the issue, I have modified "netwerk/base/src/nsSocketTransportService2.cpp/nsSocketTransportService::DoPollIteration(bool wait)" function as below to force "Poll(wait, &pollInterval);" to return n < 0, then the issue can be reproduced all the time.

nsresult
nsSocketTransportService::DoPollIteration(bool wait)
{
   ...
   int32_t n = Poll(wait, &pollInterval);
   // xzxing
   n = -1;
   // end of xzxing
   if (n < 0) {
        SOCKET_LOG(("  PR_Poll error [%d]\n", PR_GetError()));
    }
   else {
      ...
   }
   return NS_OK;
}



Actual results:

It seems like when error happens in Poll(), there was no error handle at all in nsSocketTransportService::DoPollIteration() or in nsSocketTransportService::Run() which called DoPollIteration(), as a result, if there is some system error happened, Poll() will return error everytime to cause nsSocketTransportService thread pegs the CPU spinning..

I have verified in the latest Firefox 34.05 build, and the same issue should also exist.


Expected results:

The expect behavior: there should be some error handle mechanism when poll return errors, instead of keeping nsSocketTransportService thread pegs the CPU spinning..

I have attached the snapshot of process sample files when CPU spinning happend, and the screenshot of the busy thread taking by Instrument's Time profile for your reference.

I noticed that the related code in nsSocketTransportService2.cpp had been existed in there for releases without any bug report, and in fact, it is hard to reproduce the issue since I am not sure when the system error happened when doing Poll/select, but when it happened, the CPU keep spike and spinning, and the whole network service blocked which only close the entire application can resolve.

xzxing@cn.ibm.com

Reporter

Updated

•

9 years ago

Severity: normal → critical

Keywords: hang, perf

OS: Windows 7 → Mac OS X

Priority: -- → P1

Hardware: x86_64 → Other

xzxing@cn.ibm.com

Reporter

Comment 1

•

9 years ago

Attached file sample file2.txt, find __error — Details

xzxing@cn.ibm.com

Reporter

Comment 2

•

9 years ago

Attached image Screen Shot 2014-04-16 at 4.12.21 PM.png (obsolete) — Details

xzxing@cn.ibm.com

Reporter

Comment 3

•

9 years ago

Attached image Screen Shot 2014-05-05 at 1.10.56 PM.png — Details

Attachment #8545807 - Attachment is obsolete: true

xzxing@cn.ibm.com

Reporter

Comment 4

•

9 years ago

Attached image Screen Shot 2014-04-16 at 4.12.21 PM.png — Details

xzxing@cn.ibm.com

Reporter

Comment 5

•

9 years ago

Attached image Screen Shot 2014-05-16 at 1.31.11 PM.png — Details

xzxing@cn.ibm.com

Reporter

Comment 6

•

9 years ago

We have got many CPU spinning report recently, exspecially on Mac 10.9. Hope you can give some help and guidance here.

Andreas Gal :gal

Comment 7

•

9 years ago

I think this bug is one of the CPU hogs I see on Mac in Firefox as well. Stack looks familiar. Doug?

Andreas Gal :gal

Updated

•

9 years ago

Assignee: nobody → dougt

Patrick McManus [:mcmanus]

Comment 8

•

9 years ago

nick - do you have a mac handy for this?

Flags: needinfo?(hurley)

u408661

Comment 9

•

9 years ago

Yeah, I'll take a look.

Assignee: dougt → hurley

Flags: needinfo?(hurley)

u408661

Comment 10

•

9 years ago

So I can definitely reproduce with the code changes above (I have not yet seen this in regular use on my machine), but that's not a particularly interesting case. We're doing what we should do if poll returns -1 and we don't know why, which is try again. What would be more interesting is to find out what errno is after the call to poll fails, as that will tell us *why* poll is failing, and if there's anything we can do (for example, some values of errno from poll indicate that we messed up, which would be our bug, while others indicate that something went wrong at a lower layer we have no control over, in which case our only option is to try again).

What I would need to see to get any further on this is the results of both PR_GetError and PR_GetOSError being called immediately after PR_Poll returns (which is in nsSocketTransportService2::Poll). xzxing, can you provide that information from an in-practice failure (in other words, not the failure you synthesized in comment 0)?

Flags: needinfo?(xzxing)

u408661

Comment 11

•

9 years ago

xzxing - one other question - do your network conditions happen to have changed just before seeing this behavior in normal practice? (switched from one wifi network to another, wifi->wired, changed the os x "network location" in settings?)

xzxing@cn.ibm.com

Reporter

Comment 12

•

9 years ago


(In reply to Nicholas Hurley [:hurley] from comment #10)
> So I can definitely reproduce with the code changes above (I have not yet
> seen this in regular use on my machine), but that's not a particularly
> interesting case. We're doing what we should do if poll returns -1 and we
> don't know why, which is try again. What would be more interesting is to
> find out what errno is after the call to poll fails, as that will tell us
> *why* poll is failing, and if there's anything we can do (for example, some
> values of errno from poll indicate that we messed up, which would be our
> bug, while others indicate that something went wrong at a lower layer we
> have no control over, in which case our only option is to try again).
> 
> What I would need to see to get any further on this is the results of both
> PR_GetError and PR_GetOSError being called immediately after PR_Poll returns
> (which is in nsSocketTransportService2::Poll). xzxing, can you provide that
> information from an in-practice failure (in other words, not the failure you
> synthesized in comment 0)?

Hi Nick, thanks for your help to look into this.
Sorry currently I don't have the specific errno when poll fails, since it need to enable the NSPR log and run our product from command line. Most of the CPU spinning was reported by internal users, and I only reproduce twice by chance, but I could not reproduce the issue for quite a time..

I will try to contact with the internal users and ask for their help to enable the NSPR log and try to reproduce to see if we can get the useful errno. But as I mentioned, we CAN NOT find a regular way to make it stably reproduced, sometimes it just happened, mostly when access https websites, or under bad network connnection, but with NO guaranty when it will happen..

Firstly, I was suspect it was the issue as https://bugzilla.mozilla.org/show_bug.cgi?id=710176, since the symptoms is very alike, however when I look into the XULRunnerV10.0.6.esr source code with 710176's patch then found out that they are different, 710176 is based on XULRunnerV11 which had remove nsSSLThread.

Then I look into several sample process file which taken when CPU spinning happened, and find out that each of this file had a common point is that there is an __error happend after poll/select, that's why I modify to source code to force PR_Poll return -1 to reproduce the same symptoms with the issue.

And could you help to confirm the way the get the errno?

1. enable the NSPR Log
export NSPR_LOG_MODULES=nsSocketTransport:4
export NSPR_LOG_FILE=/Mozilla/log

2. Run application from command line
/Application/***.app/Contents/MacOS/***

3. get the NSPR log in /Mozilla/log file and find "PR_Poll error" in the log if we reproduce the issue, since there is a log record when error happened in nsSocketTransportService2.cpp/DoPollIteration()..

PRInt32 n = Poll(wait, &pollInterval);
    if (n < 0) {
        SOCKET_LOG(("  PR_Poll error [%d]\n", PR_GetError()));
    }
    else {
...

Flags: needinfo?(xzxing)

xzxing@cn.ibm.com

Reporter

Comment 13

•

9 years ago

(In reply to Nicholas Hurley [:hurley] from comment #11)
> xzxing - one other question - do your network conditions happen to have
> changed just before seeing this behavior in normal practice? (switched from
> one wifi network to another, wifi->wired, changed the os x "network
> location" in settings?)

The network condition change might be a cause since I remember when I was reproduced the issue, my Mac machine was wake up from sleep. And some internal users reported the issue when they use Mac machine under bad network connection(eg, work from home via VPN) or move wired into wifi. So I think your suspection make sense. However, like I said, there is NO regular way to make it reproduce stably which is sad :(

u408661

Comment 14

•

9 years ago

It's unfortunate that there's no good reproducability, even with a network change. That will make this one much harder to track down.

(In reply to xzxing@cn.ibm.com from comment #12)
> And could you help to confirm the way the get the errno?
> 
> 1. enable the NSPR Log
> export NSPR_LOG_MODULES=nsSocketTransport:4
> export NSPR_LOG_FILE=/Mozilla/log

This is correct, though you should really have the "timestamp" module in NSPR_LOG_MODULES as well, so your export would be

export NSPR_LOG_MODULES=timestamp,nsSocketTransport:5

> 2. Run application from command line
> /Application/***.app/Contents/MacOS/***
> 
> 3. get the NSPR log in /Mozilla/log file and find "PR_Poll error" in the log
> if we reproduce the issue, since there is a log record when error happened
> in nsSocketTransportService2.cpp/DoPollIteration()..
> 
> PRInt32 n = Poll(wait, &pollInterval);
>     if (n < 0) {
>         SOCKET_LOG(("  PR_Poll error [%d]\n", PR_GetError()));
>     }
>     else {
> ...

Correct. If you have the ability to recompile locally, it might be good to modify that SOCKET_LOG to also report PR_GetOSError, so it would look like

SOCKET_LOG(("  PR_Poll error [%d] os error [%d]\n", PR_GetError(), PR_GetOSError()));

Patrick McManus [:mcmanus]

Comment 15

•

9 years ago

nick, we should put that logging change into m-c

u408661

Comment 16

•

9 years ago

(In reply to Patrick McManus [:mcmanus] from comment #15)
> nick, we should put that logging change into m-c

Filed bug 1121530

Alex

Comment 17

•

9 years ago

Attached file Poll errors — Details

I've been hitting this randomly for the last week or so (On 10.10), I've found YouTube seems to trigger it more than most other sites. I'm not sure how helpful this log segment is, once I hit it the log does become rather useless due to the amount of data it spits out.

Matthew N. [:MattN]

Comment 18

•

9 years ago

I've also started seeing this issue in the last week on Nightly. Was Alex's log enough or would another be helpful?

u408661

Comment 19

•

9 years ago

So the errno Alex is seeing (22) indicates EINVAL - meaning we're passing a bad argument to poll (either nfds > OPEN_MAX, or timeout < -1, according to my man page on 10.10). I still have yet to see this locally, but I will (in the name of bugfixing!) play around on YouTube today, as Alex says that seems to trigger the behavior more often than others.

:MattN - if you have any good repro sites, I'd appreciate hearing about them.

Matthew N. [:MattN]

Comment 20

•

9 years ago

I don't know what specifically is causing it but I'm logged into Hello and saw the bug yesterday while restoring my session for a Nightly update. Only a few pinned tabs had successfully restored before hitting this and they were Google Voice and 2-3 tabs of different Gmail accounts. I also have over 20 pinned BMO tabs. I was on Mozilla MV Wi-Fi when I saw this btw. (I haven't used my ethernet for a few weeks at the office).

Matthew N. [:MattN]

Comment 21

•

9 years ago

I would guess that the problem for me is related to "Network Predictive Actions" since the timing makes sense.

u408661

Comment 22

•

9 years ago

MattN: it's possible that the predictor exacerbates this problem (if, in fact, the problem is that we're hitting OPEN_MAX file descriptors), but as shown by the reporter, this issue has existed since long before the predictor landed in any form. I still haven't had any success reproducing this issue locally (even restoring a session with 20+ pinned tabs like yours), so I'm wondering if you've changed any prefs that might change our network behavior or session restore behavior? (the output of about:support would be nice, as I can take a peek at extensions you have that might be causing you to see this issue when I'm not seeing it.)

Flags: needinfo?(MattN+bmo)

u408661

Comment 23

•

9 years ago

Alex, if you wouldn't mind supplying the same info I asked :MattN for in comment 22, I'd appreciate it. Seeing if you two have anything in common might help me narrow down the issue more quickly.

Flags: needinfo?(the.decryptor)

Patrick McManus [:mcmanus]

Comment 24

•

9 years ago

https://bugzilla.mozilla.org/show_bug.cgi?id=939318#c206 has a log with the socket transport thread hung (which is slightly different) - it could be related.. the thinking over there is it is related to the network change event handling. If OP has a local repro, maybe that can be ruled out here independent of seer? (I think network-change has a pref).

Patrick McManus [:mcmanus]

Comment 25

•

9 years ago

an obvious question is if we leaked those descriptors somehow.

Daniel Stenberg [:bagder]

Comment 26

•

9 years ago

The network change events can be switched off with "network.notify.changed"

Alex

Comment 27

•

9 years ago

(In reply to Nicholas Hurley [:hurley] from comment #23)
> Alex, if you wouldn't mind supplying the same info I asked :MattN for in
> comment 22, I'd appreciate it. Seeing if you two have anything in common
> might help me narrow down the issue more quickly.

The only network preference I changed is "network.http.referer.trimmingPolicy", which shouldn't have an effect although I can certainly try with it off (I actually just reset my profile a few days ago for another issue and I've hit this issue again already), and the only addon I've got installed is ABP. And pinned tab wise I've only got 2, with about 30 unpinned tabs but I only ever load 4-5 of those in any session.

I'll try disabling the network change events, I'm on a wired network with static addressing so it doesn't help me much anyway. If I hit it again I'll try disabling the predictor.

Flags: needinfo?(the.decryptor)

Alex

Comment 28

•

9 years ago

I tried with network change events disabled and hit it again, tried with the predictor disabled and hit it again as well. It could just be bad luck, but it took significantly longer to hit with the predictor disabled, around 6 hours vs. an hour or two of heavy browsing with network change events turned off.

Annoyingly I couldn't hit it on YouTube today, but did hit it loading the CommonMark spec.

Matthew N. [:MattN]

Comment 29

•

9 years ago

I haven't been seeing the problem recently so perhaps my issue was fixed by another bug. My about:support data is at http://pastebin.mozilla.org/8700097 (expiring in 1 month).

Flags: needinfo?(MattN+bmo)

Alex

Comment 30

•

9 years ago

I've just started seeing this again, after the last message it had entirely stopped, but whatever is causing it for me has flared up again.

Alex

Comment 31

•

9 years ago

I think I've found a more reliable STR, crashing the content process. When I restore all crashed tabs (in 2 separate windows) they start loading normally, but if I try loading a page in a new tab quickly I hit this issue fairly reliably (As in it's happened 5 times today and I've hit it every single time). Sometimes it actually sticks around and doesn't clear up, I have to restart Firefox to be able to open any more pages.

Funnily enough YouTube still helps, but because of Bug 1145101 this time.

Patrick McManus [:mcmanus]

Comment 32

•

8 years ago

is this something you still see with the current nightly? There have been a number of things cleaned up in this space, though none of them I would specifically correlate with this problem

Flags: needinfo?(xzxing)

Whiteboard: [necko-backlog]

Alex

Comment 33

•

8 years ago

I haven't hit this issue at all since I last commented, whatever was causing it for me went away.

u408661

Updated

•

8 years ago

Assignee: hurley → nobody

Wayne Mery (:wsmwk)

Comment 34

•

7 years ago

(In reply to Patrick McManus [:mcmanus] from comment #32)
> is this something you still see with the current nightly? There have been a
> number of things cleaned up in this space, though none of them I would
> specifically correlate with this problem

xzxing user email address is dead

Flags: needinfo?(xzxing) → needinfo?(mcmanus)

Daniel Stenberg [:bagder]

Comment 35

•

7 years ago

I'm closing this as fixed, as Alex can't reproduce anymore and the original poster isn't responding.

Status: UNCONFIRMED → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

VincentTiantc

Comment 36

•

7 years ago

(In reply to Daniel Stenberg [:bagder] from comment #35)
> I'm closing this as fixed, as Alex can't reproduce anymore and the original
> poster isn't responding.

Hi Daniel, could you please look at this bug again? Even the original poster is gone, we are still facing the same problem.

Produce details:
1, we can stable reproduced it on xulrunner: 10.0.6esr, OS: Mac OS 10.11 & 10.12.
2, we collected nspr logs when it reproduced. We can see from logs that there is a dead loop which might result from PR_Poll error [-5987].
4, we cannot access any url anymore once it reproduced.
3, I have print out the value of n in "netwerk/base/src/nsSocketTransportService2.cpp/nsSocketTransportService::DoPollIteration(bool wait)"
nsresult
nsSocketTransportService::DoPollIteration(bool wait)
{
   ...
   int32_t n = Poll(wait, &pollInterval);

   if (n < 0) {
        SOCKET_LOG(("  PR_Poll error [%d]\n", PR_GetError()));
    }
   else {
      ...
   }
   return NS_OK;
}
and n always equals -1 when reproduced.

VincentTiantc

Comment 37

•

7 years ago

Attached file nsprLogs_when_reproduced — Details

Daniel Stenberg [:bagder]

Updated

•

7 years ago

Status: RESOLVED → REOPENED

Ever confirmed: true

Resolution: FIXED → ---

VincentTiantc

Comment 38

•

7 years ago

(In reply to VincentTiantc from comment #36)
> (In reply to Daniel Stenberg [:bagder] from comment #35)
> > I'm closing this as fixed, as Alex can't reproduce anymore and the original
> > poster isn't responding.
> 
> Hi Daniel, could you please look at this bug again? Even the original poster
> is gone, we are still facing the same problem.
> 
> Produce details:
> 1, we can stable reproduced it on xulrunner: 10.0.6esr, OS: Mac OS 10.11 &
> 10.12.
> 2, we collected nspr logs when it reproduced. We can see from logs that
> there is a dead loop which might result from PR_Poll error [-5987].
> 4, we cannot access any url anymore once it reproduced.
> 3, I have print out the value of n in
> "netwerk/base/src/nsSocketTransportService2.cpp/nsSocketTransportService::
> DoPollIteration(bool wait)"
> nsresult
> nsSocketTransportService::DoPollIteration(bool wait)
> {
>    ...
>    int32_t n = Poll(wait, &pollInterval);
> 
>    if (n < 0) {
>         SOCKET_LOG(("  PR_Poll error [%d]\n", PR_GetError()));
>     }
>    else {
>       ...
>    }
>    return NS_OK;
> }
> and n always equals -1 when reproduced.

After collecting more logs, we found:
1, Once 'Poll(wait, &pollInterval)' returns -1, all the following Polls will return -1 too, since there is no error handler in "if(n < 0) branch" except for an error log.
2, The 'PR_Poll error [-5987]' stands for INVALID_PARAM_ERROR.

Current workaround:
Since once there is an failed socket[invoked by int32_t n = Poll(wait, &pollInterval);] in mPollList, it won't disappear and the whole thread kept endless Polling with the same failed result -1, we add some error-handle-code in "if(n < 0) branch" to free and reset mPollList, mActiveList, mIdleList.

This workaround do works, but it's kind of stupid, since it will discard the following Polls in queue, and it doesn't fix the root of this problem.

Honza Bambas (:mayhemer)

Comment 39

•

7 years ago

Dragana, is there a possibility for a quick fix here?  Could it be something affecting more users?

Flags: needinfo?(dd.mozilla)

Dragana Damjanovic [:dragana]

Comment 40

•

7 years ago

I will look into this. comment 38 is interesting.

Dragana Damjanovic [:dragana]

Comment 41

•

7 years ago

Nick, can you make a patch to try to work around this problem or cahange poll parameters to fix this error. You already looked into this bug and poll errors.

Flags: needinfo?(dd.mozilla) → needinfo?(hurley)

u408661

Comment 42

•

7 years ago

(In reply to Dragana Damjanovic [:dragana] from comment #41)
> Nick, can you make a patch to try to work around this problem or cahange
> poll parameters to fix this error. You already looked into this bug and poll
> errors.

I can put this on my personal backlog, but I won't be able to get to it until next week at the earliest, given my impending PTO and other, higher-priority things I need to get to today. Feel free to take this (or ask someone else to) if it needs doing before then.

Flags: needinfo?(hurley)

Firefox Bug Husbandry Bot

Comment 43

•

7 years ago

Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258

Firefox Bug Husbandry Bot

Comment 44

•

7 years ago

Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258

Priority: P1 → P3

Nils Ohlmeier [:drno]

Comment 45

•

5 years ago

I just encountered this bug on 67 on OSX 10.14.3.
Networking log file is > 2GB full of the same error over and over again.

How got into the bug was that I when I woke my laptop up from sleep it Firefox would no longer make any network connections. Before putting it to sleep (lid open) I a WebRTC call, which I ended normally and put the laptop at my desk and locked it via the OS.

Andrei Purice

Comment 46

•

3 years ago

Hey Nils,
Can you still reproduce this issue or should we close it?

Flags: needinfo?(drno)

Andrei Purice

Comment 47

•

3 years ago

Marking this as Resolved > Incomplete as per reporter's lack of response.
If anyone can still reproduce this issue re-open it or file a new bug.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 3 years ago

Resolution: --- → INCOMPLETE

Nils Ohlmeier [:drno]

Comment 48

•

2 years ago

Declaring NI bankruptcy.

Flags: needinfo?(drno)

sample file1.txt 9 years ago xzxing@cn.ibm.com 23.79 KB, text/plain		Details
sample file2.txt, find __error 9 years ago xzxing@cn.ibm.com 26.51 KB, text/plain		Details
Screen Shot 2014-04-16 at 4.12.21 PM.png 9 years ago xzxing@cn.ibm.com 261.07 KB, image/png		Details
Screen Shot 2014-05-05 at 1.10.56 PM.png 9 years ago xzxing@cn.ibm.com 229.57 KB, image/png		Details
Screen Shot 2014-04-16 at 4.12.21 PM.png 9 years ago xzxing@cn.ibm.com 273.06 KB, image/png		Details
Screen Shot 2014-05-16 at 1.31.11 PM.png 9 years ago xzxing@cn.ibm.com 286.63 KB, image/png		Details
Poll errors 9 years ago Alex 18.73 KB, text/plain		Details
nsprLogs_when_reproduced 7 years ago VincentTiantc 2.52 MB, text/plain		Details